Beyond Chat: Specialized SLMs for Structured Data Extraction in the Browser


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
How to Extract Structured Data in the Browser Using Small Language Models
- Choose a quantized small language model (sub-1GB) suitable for browser deployment, such as SmolLM2-360M-Instruct.
- Install Transformers.js via CDN ESM import to access the high-level pipeline API.
- Detect WebGPU support at runtime and fall back to the WASM backend if unavailable.
- Load the quantized model client-side with a progress indicator; browser Cache API handles repeat visits.
- Craft extraction prompts that include the target JSON schema and a few-shot example to anchor output format.
- Delegate inference to a Web Worker to keep the main thread responsive during generation.
- Parse model output with layered repair strategies—substring extraction, suffix appending—to guarantee valid JSON.
- Validate results by projecting parsed output onto your schema, coercing types, and discarding unexpected keys.
Every web application eventually hits the same wall: unstructured text that needs to become structured JSON. A third path has become viable—specialized small language models, quantized to run under 1GB, can now perform inference directly in the browser via WebGPU and ONNX Runtime Web.
Table of Contents
- Why Structured Extraction Belongs in the Browser
- How SLMs Differ from LLMs for Extraction Tasks
- Setting Up the Browser-Based SLM Pipeline
- Extracting Structured JSON from Unstructured Text
- Real-World Use Case: Client-Side Form Parsing and Auto-Fill
- Performance, Accuracy, and Constraints
- Building the Live Demo
- Key Takeaways and What's Next
Why Structured Extraction Belongs in the Browser
Every web application eventually hits the same wall: unstructured text that needs to become structured JSON. Users paste address blocks, forward receipts, drop in contact details from email signatures, or free-type information that downstream logic requires in a precise schema. Browser data extraction has traditionally meant one of two paths. The first is sending that text to a server-side LLM API, which introduces latency, per-token costs, and the privacy risk of transmitting potentially sensitive user data off-device. The second is regex and rule-based parsing, which is brittle, expensive to maintain, and collapses the moment input formatting deviates from expectations.
A third path has become viable. Specialized small language models, quantized to run under 1GB, can now perform inference directly in the browser via WebGPU and ONNX Runtime Web. Because they run client-side, they eliminate API calls and keep all data on-device. By the end of this article, readers will have a working in-browser pipeline that extracts structured data from contacts, addresses, and receipts using local form parsing powered by a quantized SLM.
Specialized small language models, quantized to run under 1GB, can now perform inference directly in the browser via WebGPU and ONNX Runtime Web.
Prerequisites
- Browser: Chrome 113+ or Edge 113+ for WebGPU acceleration (enabled by default in these versions). Any modern browser works with the WASM fallback, but expect slower inference.
- A discrete or integrated GPU with WebGPU driver support delivers best performance. CPU-only machines will use the WASM backend.
- Network: ~250MB download on first model load (cached automatically for subsequent visits).
- If your development server doesn't already send
Cross-Origin-Opener-Policy: same-originandCross-Origin-Embedder-Policy: require-corpheaders, full WASM threading won't work. A quick local option:npx serve --cors. - No Node.js required at runtime. This runs entirely in the browser.
How SLMs Differ from LLMs for Extraction Tasks
What Makes a Model "Small" and Why It Matters for the Browser
The term "small language model" generally refers to models with fewer than one billion parameters. At that scale, aggressive quantization (INT4 or INT8) can compress model weights into files between 200MB and roughly 1GB, which is within the range a browser can download, cache, and run inference on using client hardware. The runtime layer matters: WebGPU provides GPU-accelerated compute in supported browsers (Chrome 113+, Edge 113+; earlier versions require enabling via chrome://flags/#enable-unsafe-webgpu and are not recommended for production), while ONNX Runtime Web offers a cross-platform fallback that can target both WebGPU and WebAssembly backends. Transformers.js, maintained by Hugging Face, wraps these runtimes with a familiar high-level pipeline() API that mirrors the Python transformers library at that abstraction level.
The trade-off is real. Sub-1B models have a lower accuracy ceiling than a 70B parameter server-side model. But scoped extraction tasks with well-defined schemas close most of the accuracy gap, and the latency and privacy gains are substantial.
Why Specialized Fine-Tuning Beats General Chat Models
A general-purpose chat SLM asked to "extract the name and email from this text" may hallucinate fields, wrap output in conversational preamble, or produce malformed JSON. Fine-tuned extraction models constrain output far more reliably. Models like SmolLM2 (Hugging Face, 135M--1.7B parameters) and DistilBERT variants fine-tuned for NER or JSON output are strong candidates for browser deployment. Phi-3-mini (Microsoft, 3.8B parameters) sits above the sub-1B SLM threshold, but quantized ONNX variants can fit under 1GB and remain viable on capable hardware.
Grammar-constrained decoding, sometimes called grammar-guided generation, adds another layer of reliability. By restricting the model's token selection at each step to only tokens valid within a JSON grammar, grammar-constrained decoding guarantees structurally valid output. Transformers.js does not yet ship native grammar-constrained decoding as of the v3 release line, though third-party libraries offer partial support. The post-processing repair strategies covered below bridge the gap effectively for most extraction use cases.
Setting Up the Browser-Based SLM Pipeline
Choosing Your Runtime: Transformers.js vs. ONNX Runtime Web
| Feature | Transformers.js | ONNX Runtime Web |
|---|---|---|
| API style | High-level pipeline() | Low-level session/tensor API |
| Model format | ONNX (auto-converted from HF Hub) | ONNX |
| Backend support | WebGPU, WASM | WebGPU, WASM, WebGL (deprecated in recent versions) |
| Ease of use | High (mirrors Python API) | Lower (manual pre/post processing) |
| Model hosting | HF Hub integration built-in | Bring your own |
This tutorial uses Transformers.js for accessibility. ONNX Runtime Web is the better choice when developers need fine-grained control over tensor operations or want to target WebGL as a fallback for older GPUs.
Loading a Quantized Extraction Model Client-Side
Model selection matters. A quantized SmolLM2-360M-Instruct ONNX variant hosted on the Hugging Face Hub hits a sweet spot: small enough for fast downloads (~250MB quantized), capable enough for schema-constrained extraction. Verify that the onnx/ directory containing the q4 quantized file exists in the HuggingFaceTB/SmolLM2-360M-Instruct repository before running the code below; ONNX quantized variant availability can change. Transformers.js caches model files via the browser's Cache API automatically, so the model downloads once and loads from cache on repeat visits. Browsers may evict Cache API entries under storage pressure, so users on constrained devices may occasionally re-download the model.
// Code Example 1: Initialize Transformers.js pipeline in the browser
// Pin to an exact verified patch version. Check releases at:
// https://github.com/huggingface/transformers.js/releases
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
// Transformers.js caches models via the browser Cache API automatically.
// No cacheDir configuration is needed in browser environments.
// Detect WebGPU support before requesting it
const device = navigator.gpu ? 'webgpu' : 'wasm';
let extractionPipeline = null;
async function loadModel(onProgress) {
try {
extractionPipeline = await pipeline(
'text-generation',
'HuggingFaceTB/SmolLM2-360M-Instruct',
{
dtype: 'q4', // INT4 quantization
device: device, // Detected above: 'webgpu' if supported, else 'wasm'
progress_callback: (progress) => {
if (onProgress && progress.status === 'progress') {
onProgress(Math.round(progress.progress));
}
}
}
);
return extractionPipeline;
} catch (err) {
document.getElementById('loading-status').textContent =
`Model load failed: ${err.message}`;
console.error('[loadModel] Failed to initialize pipeline:', err);
throw err;
}
}
// Cleanup function for SPA environments — call on route change or unmount
async function unloadModel() {
await extractionPipeline?.dispose();
extractionPipeline = null;
}
// Usage with a loading indicator
loadModel((pct) => {
document.getElementById('loading-bar').style.width = `${pct}%`;
}).then(() => {
document.getElementById('loading-status').textContent = 'Model ready';
});
Extracting Structured JSON from Unstructured Text
Crafting Extraction Prompts with Output Schemas
The prompt engineering pattern for extraction is direct: provide the input text, specify the exact JSON schema expected, embed one few-shot example to anchor the format, and instruct the model to output nothing but JSON. No preamble, no explanation.
// Code Example 2: Prompt template function
function buildExtractionPrompt(rawText, schemaFields, exampleInput, exampleOutput) {
const schemaString = JSON.stringify(schemaFields, null, 2);
return [
{
role: 'system',
content: `You are a data extraction assistant. Extract structured data from text.
Output ONLY valid JSON matching this schema. No explanation, no markdown.
Schema: ${schemaString}
Example input: "${exampleInput}"
Example output: ${JSON.stringify(exampleOutput)}`
},
{
role: 'user',
content: `Extract from this text:
${rawText}`
}
];
}
// Define schema as a JS object
const contactSchema = {
name: 'string',
email: 'string',
phone: 'string',
company: 'string',
role: 'string'
};
const exampleIn = 'Jane Doe, CTO at Acme Corp. [email protected] / 555-0199';
const exampleOut = {
name: 'Jane Doe', email: '[email protected]',
phone: '555-0199', company: 'Acme Corp', role: 'CTO'
};
const prompt = buildExtractionPrompt(userInput, contactSchema, exampleIn, exampleOut);
Running Inference and Parsing the Response
SLM output frequently contains trailing tokens, partial markdown fencing, or truncated closing braces. A robust post-processing step is essential.
// Code Example 3: End-to-end extraction with error handling
async function extractStructuredData(rawText, schema, example) {
const messages = buildExtractionPrompt(rawText, schema, example.input, example.output);
const result = await extractionPipeline(messages, {
max_new_tokens: 256,
// Greedy decoding (do_sample: false) ensures deterministic output.
do_sample: false
});
const lastMessage = result?.[0]?.generated_text?.at(-1);
if (!lastMessage || typeof lastMessage.content !== 'string') {
return { success: false, data: null, raw: null, error: 'Unexpected model output shape' };
}
return parseModelOutput(lastMessage.content, schema);
}
function parseModelOutput(text, schema) {
// Locate JSON boundaries once, available to all repair strategies
const start = text.indexOf('{');
const end = text.lastIndexOf('}');
// Try direct parse first
try {
const parsed = JSON.parse(text);
return { success: true, data: projectToSchema(parsed, schema) };
} catch (e) {
console.warn('[parseModelOutput] Direct parse failed:', e.message, '| raw:', text.slice(0, 120));
}
// Repair strategy 1: extract JSON substring between first { and last }
if (start !== -1 && end > start) {
try {
const parsed = JSON.parse(text.slice(start, end + 1));
return { success: true, data: projectToSchema(parsed, schema) };
} catch (e) {
console.warn('[parseModelOutput] Substring repair failed:', e.message);
}
}
// Repair strategy 2: try appending common closing token sequences
if (start !== -1) {
for (const suffix of ['}', ']}', ']}}']) {
try {
const parsed = JSON.parse(text.slice(start) + suffix);
return { success: true, data: projectToSchema(parsed, schema) };
} catch (_) { /* try next suffix */ }
}
console.warn('[parseModelOutput] All repair strategies exhausted. raw:', text.slice(0, 120));
return { success: false, data: null, raw: text };
}
return { success: false, data: null, raw: text };
}
// Projects an object down to only the keys defined in the schema.
// Coerces values to match declared schema types where possible.
// For production use with nested structures, consider a dedicated schema
// validator (e.g., Zod, Ajv) instead of this simple projection utility.
function projectToSchema(obj, schema) {
const result = {};
for (const key of Object.keys(schema)) {
let value = obj[key] !== undefined ? obj[key] : null;
// Coerce values based on declared schema type
if (value !== null && typeof schema[key] === 'string') {
if (schema[key] === 'number' && typeof value !== 'number') {
const coerced = Number(value);
value = Number.isFinite(coerced) ? coerced : null;
}
}
result[key] = value;
}
return result;
}
// Run it
const input = `John Smith - Senior Engineer
Globex Corporation
[email protected] | (415) 555-0173`;
const output = await extractStructuredData(input, contactSchema, {
input: exampleIn, output: exampleOut
});
console.log(output.data);
// { name: "John Smith", email: "[email protected]", phone: "(415) 555-0173",
// company: "Globex Corporation", role: "Senior Engineer" }
Real-World Use Case: Client-Side Form Parsing and Auto-Fill
Parsing Pasted Address Blocks into Structured Fields
<!-- Code Example 4: Form integration with paste-triggered extraction -->
<form id="address-form">
<label>Paste full address:
<textarea id="address-raw" rows="3" placeholder="Paste address here..."></textarea>
</label>
<fieldset id="parsed-fields" disabled>
<input name="street" placeholder="Street" />
<input name="city" placeholder="City" />
<input name="state" placeholder="State" />
<input name="zip" placeholder="ZIP" />
<input name="country" placeholder="Country" />
</fieldset>
</form>
<script type="module">
const addressSchema = { street: 'string', city: 'string', state: 'string', zip: 'string', country: 'string' };
const addressExample = {
input: '742 Evergreen Terrace, Springfield, IL 62704, USA',
output: { street: '742 Evergreen Terrace', city: 'Springfield', state: 'IL', zip: '62704', country: 'USA' }
};
document.getElementById('address-raw').addEventListener('paste', async (e) => {
// Read clipboard data synchronously within the event handler
const raw = e.clipboardData.getData('text');
if (!raw.trim()) return;
const result = await extractStructuredData(raw, addressSchema, addressExample);
if (result.success) {
Object.keys(result.data).forEach(key => {
const input = document.querySelector(`#parsed-fields input[name="${key}"]`);
if (input) input.value = result.data[key] ?? '';
});
document.getElementById('parsed-fields').disabled = false;
}
});
</script>
Extracting Line Items from Receipt or Invoice Text
// Code Example 5: Receipt extraction
const receiptSchema = {
vendor: 'string',
date: 'string',
items: [{ description: 'string', quantity: 'number', price: 'string' }],
total: 'string'
};
const receiptText = `CORNER CAFE
03/15/2025
Espresso x2 $7.50
Croissant x1 $4.25
Sparkling Water x3 $8.25
--------------------------
TOTAL $20.00`;
const receiptResult = await extractStructuredData(receiptText, receiptSchema, {
input: 'SHOP
01/01/25
Item x1 $5.00
TOTAL $5.00',
output: { vendor: 'SHOP', date: '01/01/25', items: [{ description: 'Item', quantity: 1, price: '$5.00' }], total: '$5.00' }
});
// HTML-escape utility to prevent XSS when rendering model output
const esc = s => String(s).replace(/&/g,'&').replace(/</g,'<').replace(/>/g,'>');
// Render as table
if (receiptResult.success && Array.isArray(receiptResult.data.items)) {
const table = document.createElement('table');
table.innerHTML = `
<caption>${esc(receiptResult.data.vendor)} — ${esc(receiptResult.data.date)}</caption>
<thead><tr><th>Item</th><th>Qty</th><th>Price</th></tr></thead>
<tbody>${receiptResult.data.items.map(i => {
const qty = Number.isFinite(i.quantity) ? i.quantity : 0;
return `<tr><td>${esc(i.description)}</td><td>${esc(qty)}</td><td>${esc(i.price)}</td></tr>`;
}).join('')}</tbody>
<tfoot><tr><td colspan="2">Total</td><td>${esc(receiptResult.data.total)}</td></tr></tfoot>`;
document.getElementById('receipt-output').appendChild(table);
}
Note: The max_new_tokens: 256 limit used in the extraction function may truncate JSON output for receipts with many line items. Increase this value (e.g., to 512) if you expect complex documents, at the cost of slightly higher inference time.
Performance, Accuracy, and Constraints
Benchmarking Extraction Quality
Measuring extraction quality requires exact-match rate on known inputs (does the entire JSON match?) and field-level accuracy (what percentage of individual fields are correct?). Informal tests on simple, single-entity schemas suggest accuracy is high, but results depend heavily on schema complexity, input formatting, and prompt design. Developers should benchmark against their own data before relying on any general estimate. Without post-processing, roughly 70%--85% of outputs parsed as valid JSON in our limited tests on single-entity schemas; the repair strategies above exist because the remainder need them.
Fallback to server-side processing remains the right call for complex multi-page documents, heavily ambiguous multilingual input, or cases where extraction errors carry significant downstream cost (financial documents, medical records).
Optimizing for Speed and Memory
First-load cost is the biggest UX hurdle: a 250MB model download on a slow connection is painful. After caching, subsequent loads drop to seconds. Inference latency varies significantly with GPU model, browser version, system memory, and input length. As a rough baseline, short extraction tasks (under 200 input tokens) completed in about 2 seconds on an Intel Iris Xe integrated GPU running Chrome 124 with WebGPU enabled. Without WebGPU (WASM fallback), expect extraction to take several times longer. Benchmark on your target hardware rather than trusting any single number.
Moving inference off the main thread is non-negotiable for responsive UI. Note: WebGPU access from Web Workers is supported in Chrome/Edge 113+ but has limited support in Firefox and is unavailable in Safari as of this writing. For broader compatibility, use device: 'wasm' as the Worker default, or perform the same navigator.gpu check inside the Worker before requesting WebGPU.
Moving inference off the main thread is non-negotiable for responsive UI.
// Code Example 6: Web Worker for model inference
// --- extraction-worker.js (must be a separate file served from the same origin) ---
// Pin to an exact verified patch version. Check releases at:
// https://github.com/huggingface/transformers.js/releases
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
let pipe = null;
self.onmessage = async ({ data }) => {
if (data.type === 'load') {
// Detect WebGPU support inside the Worker context
const workerDevice = self.navigator?.gpu ? 'webgpu' : 'wasm';
try {
pipe = await pipeline('text-generation', data.modelId, {
dtype: 'q4', device: workerDevice
});
self.postMessage({ type: 'ready' });
} catch (err) {
self.postMessage({ type: 'error', msg: `Model load failed: ${err.message}` });
}
}
if (data.type === 'extract') {
if (!pipe) {
self.postMessage({ type: 'error', msg: 'Model not loaded yet. Wait for ready event.' });
return;
}
try {
const result = await pipe(data.messages, {
max_new_tokens: 256,
// Greedy decoding for deterministic output
do_sample: false
});
const lastMessage = result?.[0]?.generated_text?.at(-1);
const content = (lastMessage && typeof lastMessage.content === 'string')
? lastMessage.content
: '';
// Forward the format key so the main thread resolves the correct schema
self.postMessage({ type: 'result', payload: content, format: data.format });
} catch (err) {
self.postMessage({ type: 'error', msg: `Extraction failed: ${err.message}` });
}
}
};
A model warm-up strategy (running a trivial extraction immediately after load, before the user interacts) eliminates the cold-start penalty on the first real extraction.
Building the Live Demo
Putting It All Together
The architecture is a single HTML page: Transformers.js loaded via CDN ESM import, model cached in the browser after first load, inference delegated to a Web Worker. The UI provides a textarea for pasting input, a format selector (contact, address, or receipt), an extract button, and a rendered JSON output panel.
The functions buildExtractionPrompt, parseModelOutput, and projectToSchema from Code Examples 2--3 must be included or imported into the main thread script for the scaffold below to work. The same applies to all schema and example definitions from earlier examples.
<!-- Code Example 7: Mini-app scaffold -->
<div id="app">
<select id="format-select">
<option value="contact">Contact Info</option>
<option value="address">Address</option>
<option value="receipt">Receipt</option>
</select>
<textarea id="input-text" rows="6" placeholder="Paste unstructured text..."></textarea>
<button id="extract-btn" disabled>Extract</button>
<pre id="json-output"></pre>
<div id="receipt-output"></div>
</div>
<script type="module">
// --- Include or import buildExtractionPrompt, parseModelOutput, projectToSchema ---
// --- from Code Examples 2–3 above. ---
// Schemas and examples defined per format (from Code Examples 2, 4, 5)
const contactSchema = {
name: 'string', email: 'string', phone: 'string', company: 'string', role: 'string'
};
const contactExample = {
input: 'Jane Doe, CTO at Acme Corp. [email protected] / 555-0199',
output: { name: 'Jane Doe', email: '[email protected]', phone: '555-0199', company: 'Acme Corp', role: 'CTO' }
};
const addressSchema = {
street: 'string', city: 'string', state: 'string', zip: 'string', country: 'string'
};
const addressExample = {
input: '742 Evergreen Terrace, Springfield, IL 62704, USA',
output: { street: '742 Evergreen Terrace', city: 'Springfield', state: 'IL', zip: '62704', country: 'USA' }
};
const receiptSchema = {
vendor: 'string', date: 'string',
items: [{ description: 'string', quantity: 'number', price: 'string' }],
total: 'string'
};
const receiptExample = {
input: 'SHOP
01/01/25
Item x1 $5.00
TOTAL $5.00',
output: { vendor: 'SHOP', date: '01/01/25', items: [{ description: 'Item', quantity: 1, price: '$5.00' }], total: '$5.00' }
};
const configs = {
contact: { schema: contactSchema, example: contactExample },
address: { schema: addressSchema, example: addressExample },
receipt: { schema: receiptSchema, example: receiptExample }
};
const worker = new Worker('./extraction-worker.js', { type: 'module' });
worker.postMessage({ type: 'load', modelId: 'HuggingFaceTB/SmolLM2-360M-Instruct' });
worker.onmessage = ({ data }) => {
if (data.type === 'ready') document.getElementById('extract-btn').disabled = false;
if (data.type === 'result') {
// Use the format key returned by the Worker to resolve the correct schema,
// preventing race conditions if the user switches format mid-extraction.
const config = configs[data.format];
const parsed = parseModelOutput(data.payload, config.schema);
document.getElementById('json-output').textContent = JSON.stringify(parsed, null, 2);
}
if (data.type === 'error') {
console.error('Worker reported error:', data.msg);
document.getElementById('json-output').textContent = `Error: ${data.msg}`;
}
};
worker.onerror = (e) => {
console.error('Worker error:', e.message);
};
document.getElementById('extract-btn').addEventListener('click', () => {
const format = document.getElementById('format-select').value;
const raw = document.getElementById('input-text').value;
const config = configs[format];
const messages = buildExtractionPrompt(
raw, config.schema, config.example.input, config.example.output
);
// Send format key alongside messages so the Worker can echo it back with the result
worker.postMessage({ type: 'extract', messages, format });
});
</script>
A working demo can be constructed from the code examples in this article.
Key Takeaways and What's Next
Specialized SLMs are production-viable for scoped structured data extraction in the browser right now. Zero per-inference cost, full offline capability, and complete data privacy make this approach particularly useful for applications that must keep sensitive data on-device: health forms where HIPAA compliance prohibits server-side transmission, financial documents, personal contact information.
Zero per-inference cost, full offline capability, and complete data privacy make this approach particularly useful for applications that must keep sensitive data on-device.
The limitations are real and worth tracking. Model download size remains a first-visit friction point. Device capability variance means WebGPU acceleration is not universal, and the WASM fallback runs roughly 3--5x slower based on our limited testing. The accuracy ceiling on complex or highly ambiguous documents still favors server-side models.
WebGPU support is expanding across browsers. Quantization techniques continue to shrink model sizes without proportional accuracy loss. Grammar-constrained decoding support in Transformers.js may reduce or eliminate the need for post-processing JSON repair entirely, and third-party constrained generation libraries already offer partial solutions. Smaller, better fine-tuned extraction-specific models are appearing on the Hugging Face Hub regularly. For a concrete next step, try swapping in a custom schema against your own production data and measuring field-level accuracy with the Hugging Face evaluation harness.