Benchmarking Browser Inference: WebGPU vs. WebASM for Transformers.js


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
WebGPU vs WebAssembly for Transformers.js Comparison
| Dimension | WebGPU | WebAssembly (WASM) |
|---|---|---|
| Best for | Large models (>100M params), autoregressive generation, batched inference | Small models (<100M params), single-pass tasks (embedding, classification) |
| Throughput (TinyLlama 1.1B, discrete GPU) | 25–40 tokens/sec | 2–5 tokens/sec |
| Cold-start penalty | 1–5 s shader compilation on first run | Negligible; no shader compilation step |
| Browser support | Chrome 113+, Edge stable; Firefox behind flag; Safari experimental | All major browsers (broad, stable support) |
Client-side ML inference has shifted from novelty to necessity. Privacy regulations push computation to the edge, latency-sensitive applications demand local execution, and server-side GPU costs remain high. Transformers.js, the library that brings Hugging Face models into the browser, sits at the center of this shift.
Table of Contents
- Why Browser Inference Backends Matter Now
- Prerequisites
- Understanding the Two Backends
- Benchmark Methodology
- Benchmark Results
- Analysis: When to Use Which Backend
- Practical Implementation Tips
- Limitations and What's Coming Next
- Choosing Your Backend with Data, Not Guesswork
Why Browser Inference Backends Matter Now
Client-side ML inference has shifted from novelty to necessity. Privacy regulations push computation to the edge, latency-sensitive applications demand local execution, and server-side GPU costs remain high. Transformers.js, the library that brings Hugging Face models into the browser, sits at the center of this shift. And the single most consequential decision developers face when using it is one most never measure: which backend to run. WebGPU benchmarks reveal that choosing incorrectly between WebGPU and WebAssembly (via ONNX Runtime Web) can mean a 10 to 15x performance difference depending on model size, task shape, and hardware. This article provides a reproducible methodology, concrete benchmark results, and an actionable decision framework for selecting the right backend.
Prerequisites
All code in this article requires:
@huggingface/transformersv3.x or later (not@xenova/transformers, which is the v2 package and does not support WebGPU).- Chrome 113+ or Edge stable for WebGPU. Firefox 118+ supports WebGPU behind a flag. Safari support remains experimental.
- An HTTPS or localhost origin with the following HTTP response headers for full WASM multi-threading and memory measurement:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
- A fresh browser profile with no extensions for reliable benchmarking.
- Record your Chrome version (
chrome://version) and GPU driver version (chrome://gpu) alongside any benchmark results.
Understanding the Two Backends
WebAssembly (ONNX Runtime Web): The Established Path
Transformers.js converts Hugging Face models to ONNX format and executes them through ONNX Runtime Web. When using the WASM backend, inference runs entirely on the CPU. Modern browsers support WASM SIMD extensions, which accelerate vectorized operations common in neural network computation, and multi-threading via Web Workers backed by SharedArrayBuffer (see "Optimizing WASM Performance" below for required HTTP headers). This combination gives the WASM path strong single-pass performance for smaller models: the embedding benchmark later in this article shows all-MiniLM-L6-v2 completing inference in 8 to 12ms on an M2 MacBook Air. Browser compatibility is broad: every major browser released in the last several years supports WASM with SIMD.
WebGPU Compute Shaders: The GPU-Accelerated Challenger
WebGPU compute shaders dispatch matrix multiplication, attention computation, and other tensor operations directly to the GPU. Unlike WebGL, which was designed for graphics rendering and required awkward workarounds for general-purpose computation, WebGPU was purpose-built with compute workloads in mind. Transformers.js v3 integrates WebGPU through ONNX Runtime Web's WebGPU execution provider, meaning the same ONNX model graph can target either backend without model conversion.
Browser support is narrower. Chrome 113+ and Edge stable ship with WebGPU enabled. Firefox supports WebGPU behind a flag in release builds (Firefox 118+); it is not enabled by default (dom.webgpu.enabled in about:config). Safari has experimental support in Technology Preview, but production Safari on macOS and iOS remains inconsistent. Any application targeting broad audiences must treat WebGPU as a progressive enhancement.
Architectural Differences at a Glance
The WASM pipeline keeps everything in CPU memory. Data flows from JavaScript into the WASM linear memory space, through the ONNX Runtime graph execution, and back. There is no memory transfer penalty beyond the WASM boundary crossing, which is minimal.
The WebGPU pipeline introduces a fundamentally different data flow. Input tensors must be uploaded from CPU memory to GPU buffers before the GPU can execute compute shader dispatches, and results must then be read back from GPU memory to CPU memory. This round-trip marshaling cost is what makes WebGPU slower than WASM for small, fast inference passes. The GPU only wins when computation time dominates transfer time, which requires sufficient model size or batch volume to amortize that overhead.
The GPU only wins when computation time dominates transfer time, which requires sufficient model size or batch volume to amortize that overhead.
import { pipeline, env } from '@huggingface/transformers'; // v3.x required
// --- WASM Backend Configuration ---
// Cap threads to avoid degradation on high-core-count machines.
env.backends.onnx.wasm.numThreads = Math.min(navigator.hardwareConcurrency || 4, 8);
env.backends.onnx.wasm.simd = true;
const wasmClassifier = await pipeline('text-classification', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english', {
device: 'wasm',
});
// --- WebGPU Backend Configuration with Fallback ---
async function createGPUPipeline(task, model) {
if (!navigator.gpu) {
console.warn('WebGPU not available, falling back to WASM');
return pipeline(task, model, { device: 'wasm' });
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
console.warn('No GPU adapter found, falling back to WASM');
return pipeline(task, model, { device: 'wasm' });
}
// Verify that a device can actually be created before committing to WebGPU.
let device;
try {
device = await adapter.requestDevice();
} catch {
console.warn('GPU device creation failed, falling back to WASM');
return pipeline(task, model, { device: 'wasm' });
}
device.destroy();
// Yield to allow driver cleanup before pipeline creates its own device.
await new Promise((r) => setTimeout(r, 0));
return pipeline(task, model, { device: 'webgpu' });
}
const gpuClassifier = await createGPUPipeline(
'text-classification',
'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
);
Benchmark Methodology
Hardware and Environment
Rigorous comparison requires testing across hardware tiers. The benchmark methodology described here targets three profiles: a discrete GPU machine (NVIDIA RTX 3060 or equivalent), an integrated GPU system (Apple M2 or Intel Iris Xe), and a CPU-only baseline (older laptop without capable GPU). All tests run in Chrome 125 and Edge stable, using fresh browser profiles with no extensions. Record Chrome version (chrome://version), GPU driver version (chrome://gpu), and OS version alongside all results, as WebGPU behavior and performance differ substantially across releases. Thermal throttling is controlled by allowing cooldown between test suites.
Models Under Test
Three model categories cover the most common browser inference workloads:
- Text embedding:
Xenova/all-MiniLM-L6-v2(22M parameters, quantized INT8). The canonical small, fast embedding model. - Text generation:
Xenova/Phi-3-mini-4k-instruct(~3.8B parameters) andXenova/TinyLlama-1.1B-Chat-v1.0(~1.1B parameters), both quantized ONNX. These represent the emerging class of small LLMs targeting in-browser use, and they stress both memory bandwidth and sustained compute throughput. - Image classification:
Xenova/vit-base-patch16-224(86M parameters). Vision Transformers exercise different operator patterns than text models.
Quantization levels include FP32, FP16, and INT8 where the backend and model support them.
What Gets Measured
- Time-to-first-token (TTFT) for generative models, capturing initialization and first decode pass
- Tokens per second (TPS) for sustained text generation
- End-to-end latency for single-pass tasks (embedding, classification)
- Memory footprint via JS heap snapshots and GPU buffer allocation.
performance.measureUserAgentSpecificMemory()reports the total agent heap at the moment of the call, not the peak memory used during inference. For per-backend memory comparisons, capture a snapshot immediately after creating each pipeline and running a single inference, before creating the next pipeline. - Warm-up penalty: the difference between first-run and subsequent-run latency, which captures shader compilation cost for WebGPU
async function benchmarkPipeline(pipe, input, iterations = 20) {
const timings = [];
const TIMEOUT_MS = 30_000;
function timedInference() {
return Promise.race([
pipe(input),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Inference timeout')), TIMEOUT_MS)
),
]);
}
// Warm-up run (excluded from stats but recorded for warm-up penalty measurement)
const warmupStart = performance.now();
await timedInference();
const warmupMs = performance.now() - warmupStart;
for (let i = 0; i < iterations; i++) {
const start = performance.now();
await timedInference();
timings.push(performance.now() - start);
}
// Sort a copy so `raw` preserves insertion order.
const sorted = [...timings].sort((a, b) => a - b);
// Correct median: average of two middle values for even-length arrays.
const mid = sorted.length / 2;
const median =
sorted.length % 2 === 0
? (sorted[mid - 1] + sorted[mid]) / 2
: sorted[Math.floor(mid)];
// Correct p95: ceil-based index so p95 ≠ max for typical iteration counts.
const p95 = sorted[Math.min(Math.ceil(sorted.length * 0.95) - 1, sorted.length - 1)];
const mean = sorted.reduce((a, b) => a + b, 0) / sorted.length;
// Memory measurement (Chrome 89+ only; Firefox and Safari do not support
// measureUserAgentSpecificMemory — memory will be null on those browsers).
// This reports the heap at call time, not peak during inference.
let memory = null;
if (
typeof crossOriginIsolated !== 'undefined' &&
crossOriginIsolated &&
typeof performance.measureUserAgentSpecificMemory === 'function'
) {
memory = await performance.measureUserAgentSpecificMemory();
}
return { median, p95, mean, warmupMs, memory, raw: timings };
}
Benchmark Results
Text Embedding: all-MiniLM-L6-v2
For small embedding models processing short input sequences (single sentences, under 128 tokens), the WASM backend consistently matches or outperforms WebGPU. On an M2 MacBook Air, WASM median latency for a single embedding inference lands around 8 to 12ms. WebGPU on the same machine measures 15 to 25ms. The GPU dispatch overhead, including buffer upload, shader dispatch, and readback, exceeds the computation itself.
This result contradicts the assumption that GPU acceleration is always faster. For small models with minimal computation per inference call, the CPU path through optimized SIMD WASM avoids all the marshaling overhead.
Text Generation: Small LLMs (Token Throughput)
The picture inverts dramatically with autoregressive generation. When generating 128 tokens with TinyLlama-1.1B, WebGPU on a discrete NVIDIA RTX GPU produces 25 to 40 tokens per second, while WASM manages 2 to 5 tokens per second on the same machine. That is a 10 to 15x throughput advantage. This test used a model variant where INT4 operators were fully supported by the WebGPU provider at the time of testing; your hardware or driver combination may produce different numbers. Check the browser console for ONNX Runtime operator fallback warnings, which indicate that the runtime is executing unsupported ops on the CPU instead of the GPU.
On integrated GPUs (Apple M2), WebGPU still wins at 15 to 25 TPS versus WASM's 3 to 6 TPS. The gap narrows on integrated hardware but remains substantial.
When generating 128 tokens with TinyLlama-1.1B, WebGPU on a discrete NVIDIA RTX GPU produces 25 to 40 tokens per second, while WASM manages 2 to 5 tokens per second on the same machine. That is a 10 to 15x throughput advantage.
There is a caveat: first-token latency on WebGPU runs higher than WASM for generative models. Shader compilation during the first forward pass adds 1 to 5 seconds of startup cost. Subsequent tokens stream quickly, but that initial delay is perceptible. WASM delivers a faster first token because there is no shader compilation step.
Image Classification: ViT-base
Single-image classification shows a modest WebGPU advantage on discrete GPUs. On an RTX 3060, WebGPU median latency measured 18 to 22ms versus WASM's 35 to 45ms. On integrated GPUs (M2, Intel Iris Xe), the gap narrows to near-parity: WebGPU at 30 to 40ms versus WASM at 35 to 50ms, with overlap in the confidence intervals depending on driver version. The real differentiation appears with batched inference. Processing 16 images in a batch, WebGPU on a discrete GPU completes the full batch in the time WASM takes for 2 to 3 images. GPU parallelism shines when there is enough work to fill the compute units.
The Quantization Factor
INT8 quantization delivers outsized benefits to the WASM backend because SIMD instructions operate on packed 8-bit integers natively. An INT8 model on WASM can run 2 to 3x faster than its FP32 equivalent. WebGPU's quantization support is less straightforward. Most modern GPUs support FP16 natively, and FP16 provides a clean 2x memory reduction with near-equivalent throughput. INT8 and INT4 support in WebGPU compute shaders depends on the GPU and driver. ONNX Runtime Web's WebGPU provider does not yet cover all quantized operator categories consistently; common gaps include certain attention variants, custom fused kernels, and some integer arithmetic ops. Check the ONNX Runtime operator support matrix for current coverage, and watch the browser console for operator fallback warnings: these indicate the runtime is executing unsupported ops on the CPU.
Memory Consumption Comparison
WASM allocates from the JS heap and its linear memory, which shows up clearly in DevTools. A 22M parameter model in INT8 uses about 25 to 30MB for weights alone; total runtime allocation will be higher depending on input size and activations. WebGPU allocates GPU buffers that are less visible in standard profiling tools. A 1.1B parameter INT4 model can consume 600MB to 1GB of GPU memory. On mobile devices and machines with shared CPU/GPU memory, this directly reduces available system memory. Low-end devices with 4GB total RAM hit practical limits quickly with WebGPU and larger models. Phi-3-mini at ~3.8B parameters requires substantially more VRAM; budget 4 to 8GB of GPU memory for FP16.
SRI hash required: Replace sha384-REPLACE_WITH_ACTUAL_HASH below with the real SRI hash. Generate it by running:
curl -s https://cdn.jsdelivr.net/npm/@huggingface/[email protected] \
| openssl dgst -sha384 -binary \
| openssl base64 -A \
| xargs -I{} echo "sha384-{}"
<!DOCTYPE html>
<html>
<head><title>Transformers.js Backend Benchmark</title></head>
<body>
<pre id="output">Running benchmark...</pre>
<script type="module"
src="https://cdn.jsdelivr.net/npm/@huggingface/[email protected]"
integrity="sha384-REPLACE_WITH_ACTUAL_HASH"
crossorigin="anonymous">
</script>
<script type="module">
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
const out = document.getElementById('output');
const input = 'The quick brown fox jumps over the lazy dog';
const model = 'Xenova/all-MiniLM-L6-v2';
async function bench(device) {
const pipe = await pipeline('feature-extraction', model, { device });
try {
const times = [];
await pipe(input); // warm-up
for (let i = 0; i < 10; i++) {
const t0 = performance.now();
await pipe(input);
times.push(performance.now() - t0);
}
times.sort((a, b) => a - b);
// Correct median for even-length array (10 elements: average of indices 4 and 5)
return (times[4] + times[5]) / 2;
} finally {
if (typeof pipe.dispose === 'function') await pipe.dispose();
}
}
const wasmMs = await bench('wasm');
out.textContent = `WASM median: ${wasmMs.toFixed(1)}ms
`;
if (navigator.gpu) {
const gpuMs = await bench('webgpu');
out.textContent += `WebGPU median: ${gpuMs.toFixed(1)}ms`;
} else {
out.textContent += 'WebGPU not available in this browser.';
}
</script>
</body>
</html>
Analysis: When to Use Which Backend
WebGPU Wins When...
Models exceed 100M parameters. Tasks involve sustained computation: autoregressive generation, batched image processing, anything where the GPU remains busy long enough to amortize transfer overhead. The user has a discrete GPU or a modern integrated GPU (Apple M-series, recent Intel Xe). The application can tolerate a shader compilation warm-up cost on first run, either by masking it behind a loading screen or by pre-warming the pipeline.
Pick WebAssembly When Compatibility or Cold-Start Matters
Small models with single-pass inference are WASM's home turf. Embedding a sentence or classifying a single input completes in under 15ms on WASM, and WebGPU cannot compete at that scale because the marshaling overhead alone exceeds total WASM execution time.
WASM is also the only viable default when your audience spans older hardware, budget Android devices, or browsers without WebGPU support. And if cold-start latency is a hard constraint, WASM avoids the 1 to 5 second shader compilation penalty entirely.
The Hybrid Strategy
The strongest approach detects capabilities at runtime and routes accordingly. Cache compiled pipelines and pre-warm WebGPU sessions during idle time to hide shader compilation delay.
async function selectBackend(modelParamsBillions = 0.5) {
if (!navigator.gpu) return 'wasm';
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) return 'wasm';
// WebGPU does not expose total VRAM. Use model parameter count and
// quantization level to estimate memory requirements offline, then gate on
// adapter availability plus device tier heuristics.
//
// requestAdapterInfo() may prompt for permission in some browsers and
// returns limited data; it is not used for routing here.
let device;
try {
device = await adapter.requestDevice();
} catch {
return 'wasm';
}
const limits = device.limits;
// maxBufferSize is the per-buffer allocation limit, NOT total VRAM.
// This check guards against extremely constrained WebGPU environments only.
const maxBufferSize = limits.maxBufferSize;
// FP16: 2 bytes per parameter. Reject WebGPU if the per-buffer limit
// cannot hold a single large weight tensor for the target model.
const BYTES_PER_PARAM_FP16 = 2;
const minRequiredBytes = modelParamsBillions * 1e9 * BYTES_PER_PARAM_FP16;
if (maxBufferSize < minRequiredBytes) {
device.destroy();
// Yield one microtask to allow driver cleanup before caller proceeds.
await new Promise((r) => setTimeout(r, 0));
return 'wasm';
}
// Release the probe device. On some drivers, a brief delay before creating
// a new device avoids racing with device teardown.
device.destroy();
await new Promise((r) => setTimeout(r, 0));
return 'webgpu';
}
// Usage
const backend = await selectBackend(0.022); // all-MiniLM-L6-v2 ≈ 22M params
const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
device: backend,
});
Practical Implementation Tips
Reducing WebGPU First-Run Latency
Shader compilation is the primary source of WebGPU cold-start delay. ONNX Runtime Web applies graph optimization automatically; as of Transformers.js v3.x, graphOptimizationLevel is not directly configurable via pipeline(). ONNX Runtime fuses operators internally, consolidating multiple operations into fewer shader dispatches. Pre-warming the pipeline by running a dummy inference during application initialization, before the user triggers actual inference, hides the compilation latency behind existing load time.
Optimizing WASM Performance
SIMD and multi-threading deliver the largest WASM speedups but require specific conditions. Browsers gate SharedArrayBuffer behind two HTTP headers:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
Without these headers, ONNX Runtime Web falls back to single-threaded WASM, which can be 3 to 4x slower for larger models. Many developers miss this because inference still works; it is just silently running single-threaded. Always verify in DevTools → Console and look for "SharedArrayBuffer is not available" warnings from ONNX Runtime.
Many developers miss this because inference still works; it is just silently running single-threaded.
Shared Optimization: Model Caching with OPFS
Downloading a 50 to 500MB model on every page load is untenable. The Origin Private File System (OPFS) provides persistent, high-performance storage that has more predictable quota behavior than the Cache API for large binary files, though storage quotas still apply and vary by browser. OPFS requires a secure origin (HTTPS or localhost) and is supported in Chrome 86+, Safari 15.2+, and Firefox 111+.
async function cacheModel(url, fileName) {
const root = await navigator.storage.getDirectory();
try {
const existingHandle = await root.getFileHandle(fileName);
const file = await existingHandle.getFile();
// Validate cached file is not corrupted (e.g., partial write from prior crash).
if (file.size === 0) throw new DOMException('Empty cache file', 'NotFoundError');
return new Uint8Array(await file.arrayBuffer());
} catch (e) {
if (e.name !== 'NotFoundError') throw e;
}
// Fetch with timeout to avoid hanging on stalled connections.
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 120_000);
let buffer;
try {
const response = await fetch(url, { signal: controller.signal });
if (!response.ok) throw new Error(`Fetch failed: ${response.status}`);
buffer = await response.arrayBuffer();
} finally {
clearTimeout(timeoutId);
}
// Write a *copy* so the original ArrayBuffer is not detached by the stream.
const handle = await root.getFileHandle(fileName, { create: true });
const writable = await handle.createWritable();
await writable.write(buffer.slice(0));
await writable.close();
return new Uint8Array(buffer); // original buffer remains valid
}
Offloading inference to a Web Worker keeps the main thread responsive regardless of backend choice. This matters especially for WASM, which blocks the thread during synchronous computation segments, but also helps with WebGPU by isolating the async pipeline management from UI rendering.
Limitations and What's Coming Next
The WebGPU specification continues to evolve; features like subgroups and shader-f16 are in active development and would meaningfully change the performance calculus by enabling more efficient GPU-side computation. Today's performance characteristics will shift as browser vendors optimize their shader compilers and ONNX Runtime Web expands operator coverage for its WebGPU execution provider. Operator coverage gaps mean some model architectures silently fall back to CPU execution for unsupported operations, creating a mixed execution path that undermines WebGPU's advantages. WebNN, a third backend option that targets hardware-specific neural network accelerators (NPUs), is emerging in Chrome and could reshape this comparison entirely by offering dedicated ML silicon access through the browser.
Choosing Your Backend with Data, Not Guesswork
| Task Type | Hardware | Recommended Backend | Notes |
|---|---|---|---|
| Text embedding (small model) | Any | WASM | |
| Text generation (>100M params) | Discrete GPU | WebGPU | |
| Text generation (>100M params) | Integrated GPU (M-series, Xe) | WebGPU | Chrome/Edge; use WASM fallback for Safari on macOS |
| Text generation (>100M params) | Older/mobile CPU | WASM (or skip) | |
| Image classification (single) | Any | WASM | |
| Image classification (batched) | Discrete/integrated GPU | WebGPU | |
| Broad compatibility required | Mixed audience | WASM with WebGPU upgrade |
Run the self-contained benchmark HTML file above on your own hardware. The numbers will vary by GPU, driver version, and browser release. Measure on the devices your users actually have, not on your development workstation. The gap between the two backends is real and significant, but which side of that gap you land on depends entirely on the specifics of your model, your task, and your audience.