The Complete Guide to Local-First AI: WebGPU, Wasm, and Chrome's Built-in Model


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Local-first AI in the browser has crossed the threshold from experimental curiosity to production-viable technology. This guide is a deep technical walkthrough for experienced JavaScript and web developers who want to understand the architecture beneath browser-based LLM inference, not just copy-paste a demo.
Table of Contents
- Why Local-First AI Is Having Its Moment
- The Case for Running LLMs in the Browser
- WebGPU Architecture: The GPU Compute Layer Browsers Were Missing
- WebAssembly's Role: The CPU Fallback and Orchestration Layer
- Chrome's Built-in Model: The Prompt API and Gemini Nano
- Bring Your Own Model: WebGPU-Accelerated Open-Source LLMs
- Building the Interactive WebGPU Inference Benchmark
- Performance Optimization and Production Considerations
- Security, Ethical, and Practical Caveats
- The Local-First AI Stack Is Ready
Why Local-First AI Is Having Its Moment
Local-first AI in the browser has crossed the threshold from experimental curiosity to production-viable technology. Three developments converged to make this possible: 4-bit quantized small language models (Q4_0, Q4_K_M) that fit in consumer GPU memory, WebGPU reaching stable status in Chrome 113 (May 2023) with real compute shader pipelines (Firefox and Safari support remains partial as of writing), and Chrome shipping a built-in Gemini Nano model accessible through the Prompt API. Together, these form a complete local-first AI stack that didn't exist twelve months ago.
This guide is a deep technical walkthrough for experienced JavaScript and web developers who want to understand the architecture beneath browser-based LLM inference, not just copy-paste a demo. It covers the WebGPU compute layer, how WebAssembly serves as the orchestration and fallback substrate, Chrome's Prompt API for zero-setup on-device inference, and the bring-your-own-model path using open-source frameworks. Working code accompanies every major section.
Prerequisites: solid JavaScript fundamentals, a basic understanding of how ML inference works (matrix multiplications, tokenization, autoregressive generation), and Chrome 128 or later with the #prompt-api-for-gemini-nano flag enabled (see the Prompt API setup instructions below).
The Case for Running LLMs in the Browser
Privacy by Architecture
Consider a therapist's note-taking app or a legal document reviewer. When inference runs entirely on a user's device, data never leaves that device. This isn't a policy promise or a contractual guarantee; it's an architectural fact. No HTTP request carries the prompt to an external server, and no response payload carries results back. For applications handling sensitive text (medical notes, legal drafts, personal journals), this eliminates an entire category of data-handling risk. On-device processing eliminates transfer risks, but GDPR obligations around consent, retention, and access rights remain. HIPAA-adjacent use cases become more tractable too, though developers should note that on-device processing alone doesn't constitute full HIPAA compliance without broader controls.
When inference runs entirely on a user's device, data never leaves that device. This isn't a policy promise or a contractual guarantee; it's an architectural fact.
Latency and Cost Elimination
Cloud-based LLM inference carries two unavoidable costs: network round-trip latency and per-token billing. Local inference eliminates both. Once a model is loaded into GPU memory, generation happens at whatever speed the hardware allows, with zero network overhead. There's no metered API. No usage cap. No surprise invoice at the end of the month. For high-volume, latency-sensitive features like autocomplete, inline rewriting, or real-time summarization, the cost difference is stark: cloud inference for GPT-class models runs $0.15–$0.60 per million input tokens (varying by provider and model tier), while local inference has zero marginal cost after the initial model download.
Offline-First Capabilities
Browser-based inference works without a network connection. Combined with service workers and cached model weights, developers can build tools that function in genuinely offline environments: field data collection, embedded kiosks, airplane-mode productivity applications. The model ships with the app or downloads once, then runs indefinitely without phoning home.
WebGPU Architecture: The GPU Compute Layer Browsers Were Missing
From WebGL to WebGPU: What Changed
WebGL was designed for rasterization. It maps well to vertex and fragment shaders for drawing triangles, but it has no native concept of general-purpose compute. Developers who tried to repurpose WebGL for matrix math had to encode data into textures and abuse fragment shaders, a technique that works but is fragile, memory-inefficient, and miserable to debug.
WebGPU is fundamentally different. It exposes compute shaders as a first-class pipeline stage, provides storage buffers for arbitrary read/write data, and gives developers explicit control over GPU resource management. The programming model sits closer to Vulkan and Metal than to OpenGL.
| Capability | WebGL2 | WebGPU |
|---|---|---|
| Compute shaders | No | Yes |
| Storage buffers | No (texture workarounds) | Yes |
| Explicit memory management | No | Yes |
| Shader language | GLSL ES 3.0 | WGSL |
| Parallel dispatch control | Limited | Workgroups + dispatch |
| Suitable for ML inference | Barely | Purpose-built |
The WebGPU Pipeline for Matrix Operations
LLM inference is dominated by matrix multiplications: projecting embeddings, computing attention scores, applying feed-forward layers. Every transformer block repeats these operations, and every generated token triggers another full forward pass. WebGPU maps this work onto compute shaders executed across thousands of GPU threads.
The pipeline follows a specific sequence: request a GPU adapter, obtain a device, create shader modules written in WGSL, bind input and output buffers, encode compute commands, submit them to the GPU queue, and read back results. Workgroups define how threads are organized, and dispatch calls determine how many workgroups execute.
// Code Example 1: WebGPU device init + minimal matrix multiply compute shader
// ⚠️ Warning: This shader uses a bounds guard but is configured for 64×64 matrices.
// Production shaders should pass N as a uniform for flexibility.
// ⚠️ This example is illustrative — bufferA and bufferB are written to before
// dispatch (see writeBuffer calls below).
const BYTES_PER_F32 = 4;
const N = 64;
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();
const shaderModule = device.createShaderModule({
code: `
@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> result: array<f32>;
@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) gid: vec3u) {
let N = 64u;
let row = gid.x;
let col = gid.y;
if (row >= N || col >= N) { return; }
var sum = 0.0;
for (var k = 0u; k < N; k = k + 1u) {
sum = sum + a[row * N + k] * b[k * N + col];
}
result[row * N + col] = sum;
}
`
});
const bufferSize = N * N * BYTES_PER_F32; // 64x64 f32 matrix
const bufferA = device.createBuffer({
size: bufferSize,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
});
const bufferB = device.createBuffer({
size: bufferSize,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
});
const bufferResult = device.createBuffer({
size: bufferSize,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST
});
const readBuffer = device.createBuffer({
size: bufferSize,
usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST
});
// Write input data into buffers before dispatch
const matrixAData = new Float32Array(N * N); // populate with your data
const matrixBData = new Float32Array(N * N); // populate with your data
device.queue.writeBuffer(bufferA, 0, matrixAData);
device.queue.writeBuffer(bufferB, 0, matrixBData);
const pipeline = device.createComputePipeline({
layout: 'auto',
compute: { module: shaderModule, entryPoint: 'main' }
});
const bindGroup = device.createBindGroup({
layout: pipeline.getBindGroupLayout(0),
entries: [
{ binding: 0, resource: { buffer: bufferA } },
{ binding: 1, resource: { buffer: bufferB } },
{ binding: 2, resource: { buffer: bufferResult } },
]
});
const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(8, 8); // 64/8 = 8 workgroups per dimension
pass.end();
encoder.copyBufferToBuffer(bufferResult, 0, readBuffer, 0, bufferSize);
device.queue.submit([encoder.finish()]);
// Read back results — required to access GPU output in JavaScript
await readBuffer.mapAsync(GPUMapMode.READ);
const resultData = new Float32Array(readBuffer.getMappedRange().slice(0));
readBuffer.unmap();
console.log('Result[0,0]:', resultData[0]);
Memory Constraints and Model Quantization
Browser GPU memory is constrained. Discrete GPUs typically expose 6–8 GB of dedicated VRAM to WebGPU, while integrated GPUs and Apple Silicon share system RAM with the GPU, commonly making 4–6 GB available depending on total system memory and OS allocation. The browser itself consumes a portion of that. Half-precision (FP16) models are impractical at any useful parameter count; full-precision FP32 is even less feasible.
This is why quantization is non-negotiable for browser inference. 4-bit quantized models (Q4_0, Q4_K_M in GGUF format) or INT8 ONNX models shrink memory requirements by approximately 4× compared to FP16, or 8× compared to FP32. Practically, this sets the in-browser model size ceiling at 3B to 7B parameters, depending on the quantization scheme and available memory. A 4-bit quantized 3B model occupies approximately 1.5 to 2 GB for weights alone (KV cache and runtime overhead add to this), which fits comfortably. A 7B model at 4-bit needs around 3.5 to 4 GB for weights, which works on machines with 8 GB available GPU memory but leaves little headroom.
WebAssembly's Role: The CPU Fallback and Orchestration Layer
Why Wasm Still Matters When You Have GPU
GPU compute handles the heavy matrix math, but LLM inference involves more than matrix multiplications. Tokenization (converting text to token IDs), detokenization (converting IDs back to text), model file parsing, sampling logic (temperature, top-k, top-p), and session management are all CPU-bound tasks that don't benefit from GPU parallelism. WebAssembly handles these operations with near-native speed.
Wasm SIMD extensions can further accelerate tokenizer performance, depending on the implementation. BPE and SentencePiece tokenizers involve tight loops over byte sequences, and SIMD instructions can process multiple bytes per cycle. This helps keep tokenization from becoming a bottleneck in the inference pipeline.
Wasm threading and SIMD optimizations require cross-origin isolation. Serve your page with Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers, or SharedArrayBuffer (which some threading models depend on) will be unavailable.
The Wasm + WebGPU Tandem Architecture
The data flow in a browser-based LLM follows a clear split: user input enters JavaScript, passes to a Wasm-compiled tokenizer that produces token IDs, those IDs feed into the WebGPU inference engine for the forward pass, the resulting logits return to Wasm (or JavaScript) for sampling, and the sampled token ID passes through the Wasm detokenizer to produce text. Libraries like the Wasm port of llama.cpp and ONNX Runtime Web implement exactly this architecture, using Wasm for everything except the core matrix operations.
// Code Example 2: Loading a Wasm tokenizer and generating token IDs
//
// This example uses `@bokuweb/sentencepiece-js`.
// Install with: npm install @bokuweb/sentencepiece-js
// Adjust the API surface if the package's exports differ from what is shown here;
// run: node -e "const m = require('@bokuweb/sentencepiece-js'); console.log(Object.keys(m))"
// to confirm the available exports.
import { SentencePieceProcessor } from '@bokuweb/sentencepiece-js';
const tokenizer = new SentencePieceProcessor();
await tokenizer.load('tokenizer.model'); // Load SentencePiece model file
const prompt = "Explain quantum entanglement in simple terms.";
const tokenIds = tokenizer.encodeIds(prompt);
console.log('Token IDs:', tokenIds); // e.g., [1, 12018, 13261, ...]
console.log('Token count:', tokenIds.length);
// These token IDs are then passed to the WebGPU inference pipeline
// as input_ids for the model's forward pass.
const decoded = tokenizer.decodeIds(tokenIds);
console.log('Round-trip:', decoded);
// Wasm tokenizer handles BPE merges and special tokens at near-native speed.
// The WebGPU pipeline only receives integer arrays.
Chrome's Built-in Model: The Prompt API and Gemini Nano
What Is the Prompt API?
Chrome ships Gemini Nano, a small on-device language model (exact parameter count not officially confirmed by Google), directly in the browser. Developers access the Prompt API through the ai.languageModel namespace. Earlier experimental Chrome builds used an informal window.ai namespace that is not equivalent to the current API. The Prompt API provides JavaScript access to this model without any API keys or external dependencies. The API has been available via origin trial and behind feature flags starting with Chrome 128; check chromestatus.com for current availability status before shipping.
The API surface is straightforward: ai.languageModel.create() produces a session, sessions accept prompts via .prompt() or .promptStreaming(), and system prompts configure behavior at session creation. The model runs entirely on-device using the user's hardware.
Setting Up and Feature-Detecting the Prompt API
For development and testing, the on-device model must be explicitly enabled. Navigate to chrome://flags/#optimization-guide-on-device-model and set it to "Enabled BypassPerfRequirement" (the bypass flag skips hardware checks that might otherwise block lower-spec machines). After restarting Chrome, the model downloads in the background as a separate component. Check chrome://components for "Optimization Guide On Device Model" to verify download status. The model is not bundled with the Chrome binary — it is downloaded separately after you enable the flag or meet eligibility criteria.
Feature detection should always precede usage, since the API may be unavailable on non-Chrome browsers or older versions.
// Code Example 3: Feature detection, session creation, and streaming prompt
async function runPromptAPI() {
// Feature detection
if (!('ai' in self) || !('languageModel' in self.ai)) {
console.error('Prompt API not available in this browser.');
return null;
}
// Check capabilities
const capabilities = await ai.languageModel.capabilities();
console.log('Available:', capabilities.available); // 'readily', 'after-download', or 'no'
console.log('Default temperature:', capabilities.defaultTemperature);
console.log('Default topK:', capabilities.defaultTopK);
console.log('Max topK:', capabilities.maxTopK);
if (capabilities.available === 'no') {
console.error('Language model not available on this device.');
return null;
}
// Create a session with system prompt, using capability defaults as fallbacks
const session = await ai.languageModel.create({
systemPrompt: 'You are a concise technical assistant. Answer in two sentences or fewer.',
temperature: 0.7,
topK: Math.min(5, capabilities.maxTopK),
});
// Streaming prompt
const stream = await session.promptStreaming('What is WebGPU used for in ML inference?');
let fullResponse = '';
for await (const chunk of stream) {
fullResponse = chunk; // Each chunk is the full accumulated text so far, not a delta
process.stdout.write('\r' + chunk); // Overwrite line for live display
}
console.log('
Final response:', fullResponse);
return session;
}
Session Management and Context Windows
Sessions maintain conversational context, but that context consumes tokens from a fixed budget. The maxTokens, temperature, and topK parameters control generation behavior. Critically, countPromptTokens() lets developers measure how much of the context window a prompt will consume before sending it, enabling proactive truncation or summarization of conversation history.
Session cloning via .clone() creates a branching point: both sessions share the context up to the clone, then diverge. This is useful for exploring multiple response paths or implementing undo-style interactions. Sessions should be explicitly destroyed via .destroy() when no longer needed to free GPU and system memory.
// Code Example 4: Multi-turn conversation with token counting and cloning
// Note: This function does NOT destroy the session — the caller owns the session
// lifecycle and is responsible for calling session.destroy() when appropriate.
async function multiTurnDemo(session) {
// First turn
const r1 = await session.prompt('What is WGSL?');
console.log('Turn 1:', r1);
// Count tokens before sending a long follow-up
const followUp = 'How does WGSL differ from GLSL for compute workloads? Include examples.';
const tokenCount = await session.countPromptTokens(followUp);
const remaining = session.maxTokens - session.tokensSoFar;
console.log(`Follow-up will consume ${tokenCount} tokens`);
console.log(`Tokens remaining: ${remaining}`);
console.log(`Max tokens: ${session.maxTokens}`);
if (tokenCount > remaining) {
console.warn('Prompt too long for remaining context.');
return { status: 'context_exceeded' }; // Caller decides whether to destroy or reset
}
const r2 = await session.prompt(followUp);
console.log('Turn 2:', r2);
// Clone to branch the conversation
const branchedSession = await session.clone();
const r3a = await session.prompt('Give me a code example.');
const r3b = await branchedSession.prompt('What are the performance implications?');
console.log('Branch A:', r3a);
console.log('Branch B:', r3b);
// Clean up the branched session (caller still owns the original session)
branchedSession.destroy();
}
Limitations of the Built-in Model
Gemini Nano is a small model, and that places hard limits on capability. As of current Chrome stable releases at time of writing, the model is English-only with no multimodal input support (no image or audio processing). Check ai.languageModel.capabilities() for current feature flags, as these constraints may change with Chrome updates. There is no fine-tuning mechanism; the system prompt is the only customization lever. The model version is coupled to Chrome's update cycle, meaning developers cannot pin a specific model version or roll back if a Chrome update changes behavior. For tasks requiring domain-specific knowledge, multilingual support, or larger context windows, the bring-your-own-model approach is necessary.
Bring Your Own Model: WebGPU-Accelerated Open-Source LLMs
Frameworks for In-Browser LLM Inference
Three frameworks dominate the BYOM space for browser-based LLM inference:
Web-LLM (MLC AI) compiles models to WebGPU shaders via Apache TVM. It supports quantized versions of Llama, Mistral, Phi, Gemma, and other popular model families. Models are pre-compiled to WebGPU-optimized artifacts and cached locally after first download.
What sets Transformers.js (Hugging Face) apart is its familiar API: developers who have used the Python transformers library will recognize the patterns immediately. Under the hood, it uses ONNX Runtime Web as its backend, with WebGPU acceleration for supported operations, and provides access to a broad range of models from the Hugging Face Hub.
MediaPipe LLM Inference API is Google's task-focused approach. It wraps model loading and inference behind a simpler API designed for specific use cases rather than general-purpose chat.
Running a Llama 3.2 1B Model with Web-LLM
// Code Example 5: Complete Web-LLM setup with streaming chat
// npm install @mlc-ai/[email protected] — verify current version at npmjs.com/package/@mlc-ai/web-llm
import { CreateMLCEngine } from '@mlc-ai/web-llm';
async function runWebLLM() {
// Verify this model ID against the current MLC model registry;
// model IDs may change between library versions.
const modelId = 'Llama-3.2-1B-Instruct-q4f16_1-MLC';
// Initialize with progress tracking
const engine = await CreateMLCEngine(modelId, {
initProgressCallback: (progress) => {
console.log(`Loading: ${progress.text}`);
// progress.progress is a float 0-1 during model download/compile
if (progress.progress) {
const pct = (progress.progress * 100).toFixed(1);
document.getElementById('status').textContent = `${pct}% loaded`;
}
},
});
console.log('Model ready. Starting inference...');
// Streaming chat completion
const messages = [
{ role: 'system', content: 'You are a helpful coding assistant.' },
{ role: 'user', content: 'Write a JavaScript function to debounce any callback.' },
];
const stream = await engine.chat.completions.create({
messages,
stream: true,
temperature: 0.6,
max_tokens: 256,
});
let output = '';
const startTime = performance.now();
let chunkCount = 0;
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content || '';
output += delta;
chunkCount++;
document.getElementById('output').textContent = output;
}
const elapsed = performance.now() - startTime;
console.log(`Generated ${chunkCount} chunks in ${elapsed.toFixed(0)}ms`);
// For accurate tokens/sec, use engine.runtimeStatsText() which reports true token counts
console.log(await engine.runtimeStatsText());
// Return the engine for further use; caller is responsible for calling engine.unload()
// when the engine is no longer needed.
return engine;
}
// Usage:
// const engine = await runWebLLM();
// ... use engine for additional inference ...
// engine.unload(); // explicit teardown when done
When to Use Built-in vs. BYOM
The decision depends on several factors. Chrome's Prompt API offers zero-setup convenience and guaranteed availability on supported Chrome versions once the model has downloaded. It suits lightweight tasks: summarization, rewriting, simple Q&A, and text classification.
The BYOM path is appropriate when developers need a specific model architecture, larger parameter counts, non-English languages, or domain-specialized models. The tradeoff: model weights must be downloaded (often 1–4 GB), initial load takes seconds to minutes depending on connection speed and hardware, and the developer assumes responsibility for model version management and caching.
| Factor | Prompt API (Gemini Nano) | BYOM (Web-LLM / Transformers.js) |
|---|---|---|
| Setup complexity | Minimal | Moderate to high |
| Model choice | Fixed (Gemini Nano) | Llama, Mistral, Phi, Gemma, etc. |
| Download size | Downloaded separately (~1 GB+) via Chrome's background update mechanism; verify at chrome://components | 1-4 GB per model |
| Language support | English only (as of current stable) | Multilingual available |
| Offline | Yes (after model downloaded) | Yes (after model cached) |
| Fine-tuning | System prompt only | Not supported at runtime; use pre-fine-tuned model variants (e.g., instruction-tuned Llama) |
Building the Interactive WebGPU Inference Benchmark
Benchmark Architecture
A meaningful benchmark needs to measure three values: time to first token (TTFT), sustained token generation speed (tokens/second), and total inference time. Comparing across backends requires holding variables constant: same prompt text, same requested output length, and same hardware.
The benchmark harness below wraps any inference backend behind a common interface, captures timing data, and returns a standardized results object.
Each adapter function (promptApiInfer, webLlmInfer, transformersJsInfer) must conform to the (prompt, maxTokens, onTokenCallback) signature. The onToken callback must be invoked with the actual number of tokens generated per chunk (a numeric argument) for accurate tokens/second measurement. See each backend's documentation for how to obtain true token counts (e.g., Web-LLM's engine.runtimeStatsText()).
// Code Example 6: Benchmark harness for comparing inference backends
// Adapter stub signatures — implement these for each backend:
// async function promptApiInfer(prompt, maxTokens, onToken) { /* onToken(numTokens) per chunk */ }
// async function webLlmInfer(prompt, maxTokens, onToken) { /* onToken(numTokens) per chunk */ }
// async function transformersJsInfer(prompt, maxTokens, onToken) { /* onToken(numTokens) per chunk */ }
async function benchmarkInference(backendName, inferFn, prompt, maxTokens = 128) {
// Note: gc() is unavailable in normal browser contexts; GC cannot be manually triggered from web pages
await new Promise((r) => setTimeout(r, 500)); // Settle
const t0 = performance.now();
let firstTokenTime = null;
let tokenCount = 0;
const onToken = (tokenDelta) => {
if (typeof tokenDelta !== 'number') {
throw new Error('Adapter must pass token count as a numeric argument to onToken');
}
if (!firstTokenTime) firstTokenTime = performance.now();
tokenCount += tokenDelta;
};
let result;
try {
// inferFn must accept (prompt, maxTokens, onTokenCallback) and return full text
result = await inferFn(prompt, maxTokens, onToken);
} catch (err) {
console.error(`[${backendName}] Inference failed:`, err);
throw err;
}
const totalMs = performance.now() - t0;
const ttft = firstTokenTime ? firstTokenTime - t0 : null;
const generationMs = firstTokenTime ? totalMs - (firstTokenTime - t0) : totalMs;
let tokensPerSecond;
if (generationMs < 10) {
tokensPerSecond = null; // Insufficient resolution for meaningful measurement
} else {
tokensPerSecond = tokenCount > 0 ? (tokenCount / (generationMs / 1000)) : 0;
}
return {
backend: backendName,
ttft: ttft ? Math.round(ttft) : 'N/A',
tokensPerSecond: tokensPerSecond !== null ? parseFloat(tokensPerSecond.toFixed(1)) : 'N/A (insufficient resolution)',
totalMs: Math.round(totalMs),
tokenCount,
outputPreview: result.slice(0, 100),
};
}
// Run across multiple backends
async function runComparison(prompt) {
const results = [];
if ('ai' in self && 'languageModel' in self.ai) {
results.push(await benchmarkInference('Prompt API', promptApiInfer, prompt));
}
results.push(await benchmarkInference('Web-LLM (Phi-3.5)', webLlmInfer, prompt));
results.push(await benchmarkInference('Transformers.js', transformersJsInfer, prompt));
console.table(results);
return results;
}
Rendering Results and Sharing
Note: This section is a design sketch. No code example is provided; the implementation will vary based on your UI framework and sharing requirements.
The benchmark results render into a simple HTML table or a lightweight Canvas-based chart. A shareable "results card" can be implemented as a styled div with fixed dimensions, designed to be screenshot-friendly for social media sharing. The interactive hosted version serves as the viral asset: developers run the benchmark on their own hardware and share their numbers, naturally generating engagement as different GPU configurations produce different results.
Performance varies significantly across hardware. Integrated GPUs (Intel Iris, Apple M-series) and discrete GPUs (NVIDIA, AMD) exhibit different throughput characteristics. Apple Silicon's unified memory architecture avoids the CPU-to-GPU transfer penalty that discrete GPU setups incur, which benefits WebGPU workloads where model weights and KV cache must remain GPU-accessible. Actual throughput depends on the specific chip, model size, and quantization scheme.
Performance Optimization and Production Considerations
GPU Memory Management
Model loading is the most expensive operation. Pre-warm the model on page load (or during an explicit "preparing" step) to avoid a multi-second stall when the user first triggers inference. Monitor the GPUDevice.lost promise to detect and recover from GPU context loss, which can happen if the system reclaims GPU memory under pressure.
Run all inference work off the main thread. Web Workers prevent UI jank during model loading and token generation. The worker handles the Wasm/WebGPU pipeline and posts generated tokens back to the main thread via postMessage. Workers cannot share WebGPU device and buffer objects via postMessage; each worker must create its own GPUDevice. Consult the WebGPU spec section on multi-threading before implementing.
Run all inference work off the main thread. Web Workers prevent UI jank during model loading and token generation.
Progressive Enhancement Strategy
Production applications should never assume a single backend is available. A cascading detection strategy provides resilience:
// Code Example 7: Unified inference factory with progressive enhancement
const BYOM_MODEL_ID = 'Phi-3.5-mini-instruct-q4f16_1-MLC'; // Change to your preferred model
async function createInferenceEngine() {
// Tier 1: Chrome Prompt API (lowest friction)
if ('ai' in self && 'languageModel' in self.ai) {
const caps = await ai.languageModel.capabilities();
if (caps.available === 'readily') {
const session = await ai.languageModel.create();
return {
backend: 'prompt-api',
infer: (prompt) => session.prompt(prompt),
inferStream: (prompt) => session.promptStreaming(prompt),
cleanup: () => session.destroy(),
};
}
}
// Tier 2: WebGPU BYOM (Web-LLM)
if ('gpu' in navigator) {
const adapter = await navigator.gpu.requestAdapter();
if (adapter) {
const { CreateMLCEngine } = await import('@mlc-ai/web-llm');
const engine = await CreateMLCEngine(BYOM_MODEL_ID);
return {
backend: 'web-llm',
infer: async (prompt) => {
const r = await engine.chat.completions.create({
messages: [{ role: 'user', content: prompt }],
});
return r.choices[0].message.content;
},
cleanup: () => engine.unload(),
};
} else {
console.warn('WebGPU adapter unavailable, falling back to cloud');
}
}
// Tier 3: Cloud API fallback
// ⚠️ Warning: The cloud fallback shown here is a skeleton. Production implementations
// must add authentication headers, rate-limit handling, and appropriate error reporting.
return {
backend: 'cloud-fallback',
infer: async (prompt) => {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 10_000);
try {
const res = await fetch('/api/infer', {
method: 'POST',
body: JSON.stringify({ prompt }),
headers: { 'Content-Type': 'application/json' },
signal: controller.signal,
});
if (!res.ok) throw new Error(`Cloud fallback failed: ${res.status}`);
return (await res.json()).text;
} finally {
clearTimeout(timeoutId);
}
},
cleanup: () => {},
};
}
This factory returns a uniform infer(prompt) interface. The calling code doesn't need to know which backend was selected.
Security, Ethical, and Practical Caveats
Security Surface
Running inference locally does not eliminate prompt injection. Malicious input can still manipulate model outputs, and there are no server-side guardrails to intercept harmful generation. Client-side filtering is the developer's responsibility. Additionally, anyone can inspect model weights served to the browser by extracting them from the cache or network tab. Proprietary models should never be deployed this way.
Ethical Considerations
On-device inference consumes real hardware resources. GPU utilization spikes during generation. Battery life on laptops and mobile devices decreases. Users may not realize a webpage triggered a multi-gigabyte model download (Gemini Nano's downloaded component is approximately 1 GB or more depending on the Chrome version) or that their GPU is running at high utilization. Transparent disclosure and explicit consent for model downloads and inference execution should be standard practice, not an afterthought.
Transparent disclosure and explicit consent for model downloads and inference execution should be standard practice, not an afterthought.
The Local-First AI Stack Is Ready
WebGPU provides the GPU compute substrate, WebAssembly handles tokenization and orchestration, and Chrome's Prompt API plus frameworks like Web-LLM and Transformers.js cover both zero-config and custom model paths. The gaps are concrete: Safari lacks WebGPU compute shader support as of June 2025, the Prompt API remains behind a flag in Chrome, and no cross-browser standard for on-device model access exists yet. But the core architecture works today, and production use cases are already shipping: privacy-preserving writing tools, offline-capable field applications, and latency-sensitive UX features that can't tolerate a network round-trip. Build the benchmark. Try the Prompt API on a real feature. Load a quantized Llama model with Web-LLM and measure the tokens-per-second on your target hardware. The code examples in this guide are designed to be copied, modified, and deployed.