The Complete Guide to Local-First AI: WebGPU, Wasm, and Chrome's Built-in Model

SitePoint Team

Published in

AI·JavaScript·Web·

February 25, 2026

Share this article

The Complete Guide to Local-First AI: WebGPU, Wasm, and Chrome's Built-in Model

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

Local-first AI in the browser has crossed the threshold from experimental curiosity to production-viable technology. This guide is a deep technical walkthrough for experienced JavaScript and web developers who want to understand the architecture beneath browser-based LLM inference, not just copy-paste a demo.

Why Local-First AI Is Having Its Moment
The Case for Running LLMs in the Browser
WebGPU Architecture: The GPU Compute Layer Browsers Were Missing
WebAssembly's Role: The CPU Fallback and Orchestration Layer
Chrome's Built-in Model: The Prompt API and Gemini Nano
Bring Your Own Model: WebGPU-Accelerated Open-Source LLMs
Building the Interactive WebGPU Inference Benchmark
Performance Optimization and Production Considerations
Security, Ethical, and Practical Caveats
The Local-First AI Stack Is Ready

Why Local-First AI Is Having Its Moment

Local-first AI in the browser has crossed the threshold from experimental curiosity to production-viable technology. Three developments converged to make this possible: 4-bit quantized small language models (Q4_0, Q4_K_M) that fit in consumer GPU memory, WebGPU reaching stable status in Chrome 113 (May 2023) with real compute shader pipelines (Firefox and Safari support remains partial as of writing), and Chrome shipping a built-in Gemini Nano model accessible through the Prompt API. Together, these form a complete local-first AI stack that didn't exist twelve months ago.

This guide is a deep technical walkthrough for experienced JavaScript and web developers who want to understand the architecture beneath browser-based LLM inference, not just copy-paste a demo. It covers the WebGPU compute layer, how WebAssembly serves as the orchestration and fallback substrate, Chrome's Prompt API for zero-setup on-device inference, and the bring-your-own-model path using open-source frameworks. Working code accompanies every major section.

Prerequisites: solid JavaScript fundamentals, a basic understanding of how ML inference works (matrix multiplications, tokenization, autoregressive generation), and Chrome 128 or later with the #prompt-api-for-gemini-nano flag enabled (see the Prompt API setup instructions below).

The Case for Running LLMs in the Browser

Privacy by Architecture

Consider a therapist's note-taking app or a legal document reviewer. When inference runs entirely on a user's device, data never leaves that device. This isn't a policy promise or a contractual guarantee; it's an architectural fact. No HTTP request carries the prompt to an external server, and no response payload carries results back. For applications handling sensitive text (medical notes, legal drafts, personal journals), this eliminates an entire category of data-handling risk. On-device processing eliminates transfer risks, but GDPR obligations around consent, retention, and access rights remain. HIPAA-adjacent use cases become more tractable too, though developers should note that on-device processing alone doesn't constitute full HIPAA compliance without broader controls.

When inference runs entirely on a user's device, data never leaves that device. This isn't a policy promise or a contractual guarantee; it's an architectural fact.

Latency and Cost Elimination

Cloud-based LLM inference carries two unavoidable costs: network round-trip latency and per-token billing. Local inference eliminates both. Once a model is loaded into GPU memory, generation happens at whatever speed the hardware allows, with zero network overhead. There's no metered API. No usage cap. No surprise invoice at the end of the month. For high-volume, latency-sensitive features like autocomplete, inline rewriting, or real-time summarization, the cost difference is stark: cloud inference for GPT-class models runs $0.15–$0.60 per million input tokens (varying by provider and model tier), while local inference has zero marginal cost after the initial model download.

Offline-First Capabilities

Browser-based inference works without a network connection. Combined with service workers and cached model weights, developers can build tools that function in genuinely offline environments: field data collection, embedded kiosks, airplane-mode productivity applications. The model ships with the app or downloads once, then runs indefinitely without phoning home.

WebGPU Architecture: The GPU Compute Layer Browsers Were Missing

From WebGL to WebGPU: What Changed

WebGL was designed for rasterization. It maps well to vertex and fragment shaders for drawing triangles, but it has no native concept of general-purpose compute. Developers who tried to repurpose WebGL for matrix math had to encode data into textures and abuse fragment shaders, a technique that works but is fragile, memory-inefficient, and miserable to debug.

WebGPU is fundamentally different. It exposes compute shaders as a first-class pipeline stage, provides storage buffers for arbitrary read/write data, and gives developers explicit control over GPU resource management. The programming model sits closer to Vulkan and Metal than to OpenGL.

Capability	WebGL2	WebGPU
Compute shaders	No	Yes
Storage buffers	No (texture workarounds)	Yes
Explicit memory management	No	Yes
Shader language	GLSL ES 3.0	WGSL
Parallel dispatch control	Limited	Workgroups + dispatch
Suitable for ML inference	Barely	Purpose-built

The WebGPU Pipeline for Matrix Operations

LLM inference is dominated by matrix multiplications: projecting embeddings, computing attention scores, applying feed-forward layers. Every transformer block repeats these operations, and every generated token triggers another full forward pass. WebGPU maps this work onto compute shaders executed across thousands of GPU threads.

The pipeline follows a specific sequence: request a GPU adapter, obtain a device, create shader modules written in WGSL, bind input and output buffers, encode compute commands, submit them to the GPU queue, and read back results. Workgroups define how threads are organized, and dispatch calls determine how many workgroups execute.

// Code Example 1: WebGPU device init + minimal matrix multiply compute shader
// ⚠️ Warning: This shader uses a bounds guard but is configured for 64×64 matrices.
// Production shaders should pass N as a uniform for flexibility.
// ⚠️ This example is illustrative — bufferA and bufferB are written to before
// dispatch (see writeBuffer calls below).

const BYTES_PER_F32 = 4;
const N = 64;

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

const shaderModule = device.createShaderModule({
  code: `
    @group(0) @binding(0) var<storage, read> a: array<f32>;
    @group(0) @binding(1) var<storage, read> b: array<f32>;
    @group(0) @binding(2) var<storage, read_write> result: array<f32>;

    @compute @workgroup_size(8, 8)
    fn main(@builtin(global_invocation_id) gid: vec3u) {
      let N = 64u;
      let row = gid.x;
      let col = gid.y;
      if (row >= N || col >= N) { return; }
      var sum = 0.0;
      for (var k = 0u; k < N; k = k + 1u) {
        sum = sum + a[row * N + k] * b[k * N + col];
      }
      result[row * N + col] = sum;
    }
  `
});

const bufferSize = N * N * BYTES_PER_F32; // 64x64 f32 matrix

const bufferA = device.createBuffer({
  size: bufferSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
});

const bufferB = device.createBuffer({
  size: bufferSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
});

const bufferResult = device.createBuffer({
  size: bufferSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST
});

const readBuffer = device.createBuffer({
  size: bufferSize,
  usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST
});

// Write input data into buffers before dispatch
const matrixAData = new Float32Array(N * N); // populate with your data
const matrixBData = new Float32Array(N * N); // populate with your data
device.queue.writeBuffer(bufferA, 0, matrixAData);
device.queue.writeBuffer(bufferB, 0, matrixBData);

const pipeline = device.createComputePipeline({
  layout: 'auto',
  compute: { module: shaderModule, entryPoint: 'main' }
});

const bindGroup = device.createBindGroup({
  layout: pipeline.getBindGroupLayout(0),
  entries: [
    { binding: 0, resource: { buffer: bufferA } },
    { binding: 1, resource: { buffer: bufferB } },
    { binding: 2, resource: { buffer: bufferResult } },
  ]
});

const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(8, 8); // 64/8 = 8 workgroups per dimension
pass.end();

encoder.copyBufferToBuffer(bufferResult, 0, readBuffer, 0, bufferSize);
device.queue.submit([encoder.finish()]);

// Read back results — required to access GPU output in JavaScript
await readBuffer.mapAsync(GPUMapMode.READ);
const resultData = new Float32Array(readBuffer.getMappedRange().slice(0));
readBuffer.unmap();
console.log('Result[0,0]:', resultData[0]);

Memory Constraints and Model Quantization

Browser GPU memory is constrained. Discrete GPUs typically expose 6–8 GB of dedicated VRAM to WebGPU, while integrated GPUs and Apple Silicon share system RAM with the GPU, commonly making 4–6 GB available depending on total system memory and OS allocation. The browser itself consumes a portion of that. Half-precision (FP16) models are impractical at any useful parameter count; full-precision FP32 is even less feasible.

This is why quantization is non-negotiable for browser inference. 4-bit quantized models (Q4_0, Q4_K_M in GGUF format) or INT8 ONNX models shrink memory requirements by approximately 4× compared to FP16, or 8× compared to FP32. Practically, this sets the in-browser model size ceiling at 3B to 7B parameters, depending on the quantization scheme and available memory. A 4-bit quantized 3B model occupies approximately 1.5 to 2 GB for weights alone (KV cache and runtime overhead add to this), which fits comfortably. A 7B model at 4-bit needs around 3.5 to 4 GB for weights, which works on machines with 8 GB available GPU memory but leaves little headroom.

WebAssembly's Role: The CPU Fallback and Orchestration Layer

Why Wasm Still Matters When You Have GPU

GPU compute handles the heavy matrix math, but LLM inference involves more than matrix multiplications. Tokenization (converting text to token IDs), detokenization (converting IDs back to text), model file parsing, sampling logic (temperature, top-k, top-p), and session management are all CPU-bound tasks that don't benefit from GPU parallelism. WebAssembly handles these operations with near-native speed.

Wasm SIMD extensions can further accelerate tokenizer performance, depending on the implementation. BPE and SentencePiece tokenizers involve tight loops over byte sequences, and SIMD instructions can process multiple bytes per cycle. This helps keep tokenization from becoming a bottleneck in the inference pipeline.

Wasm threading and SIMD optimizations require cross-origin isolation. Serve your page with Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers, or SharedArrayBuffer (which some threading models depend on) will be unavailable.

The Wasm + WebGPU Tandem Architecture

The data flow in a browser-based LLM follows a clear split: user input enters JavaScript, passes to a Wasm-compiled tokenizer that produces token IDs, those IDs feed into the WebGPU inference engine for the forward pass, the resulting logits return to Wasm (or JavaScript) for sampling, and the sampled token ID passes through the Wasm detokenizer to produce text. Libraries like the Wasm port of llama.cpp and ONNX Runtime Web implement exactly this architecture, using Wasm for everything except the core matrix operations.

// Code Example 2: Loading a Wasm tokenizer and generating token IDs
//
// This example uses `@bokuweb/sentencepiece-js`.
// Install with: npm install @bokuweb/sentencepiece-js
// Adjust the API surface if the package's exports differ from what is shown here;
// run: node -e "const m = require('@bokuweb/sentencepiece-js'); console.log(Object.keys(m))"
// to confirm the available exports.
import { SentencePieceProcessor } from '@bokuweb/sentencepiece-js';

const tokenizer = new SentencePieceProcessor();
await tokenizer.load('tokenizer.model'); // Load SentencePiece model file

const prompt = "Explain quantum entanglement in simple terms.";
const tokenIds = tokenizer.encodeIds(prompt);
console.log('Token IDs:', tokenIds); // e.g., [1, 12018, 13261, ...]
console.log('Token count:', tokenIds.length);

// These token IDs are then passed to the WebGPU inference pipeline
// as input_ids for the model's forward pass.
const decoded = tokenizer.decodeIds(tokenIds);
console.log('Round-trip:', decoded);

// Wasm tokenizer handles BPE merges and special tokens at near-native speed.
// The WebGPU pipeline only receives integer arrays.

Chrome's Built-in Model: The Prompt API and Gemini Nano

What Is the Prompt API?

Chrome ships Gemini Nano, a small on-device language model (exact parameter count not officially confirmed by Google), directly in the browser. Developers access the Prompt API through the ai.languageModel namespace. Earlier experimental Chrome builds used an informal window.ai namespace that is not equivalent to the current API. The Prompt API provides JavaScript access to this model without any API keys or external dependencies. The API has been available via origin trial and behind feature flags starting with Chrome 128; check chromestatus.com for current availability status before shipping.

The API surface is straightforward: ai.languageModel.create() produces a session, sessions accept prompts via .prompt() or .promptStreaming(), and system prompts configure behavior at session creation. The model runs entirely on-device using the user's hardware.

Setting Up and Feature-Detecting the Prompt API

For development and testing, the on-device model must be explicitly enabled. Navigate to chrome://flags/#optimization-guide-on-device-model and set it to "Enabled BypassPerfRequirement" (the bypass flag skips hardware checks that might otherwise block lower-spec machines). After restarting Chrome, the model downloads in the background as a separate component. Check chrome://components for "Optimization Guide On Device Model" to verify download status. The model is not bundled with the Chrome binary — it is downloaded separately after you enable the flag or meet eligibility criteria.

Feature detection should always precede usage, since the API may be unavailable on non-Chrome browsers or older versions.

// Code Example 3: Feature detection, session creation, and streaming prompt
async function runPromptAPI() {
  // Feature detection
  if (!('ai' in self) || !('languageModel' in self.ai)) {
    console.error('Prompt API not available in this browser.');
    return null;
  }

  // Check capabilities
  const capabilities = await ai.languageModel.capabilities();
  console.log('Available:', capabilities.available); // 'readily', 'after-download', or 'no'
  console.log('Default temperature:', capabilities.defaultTemperature);
  console.log('Default topK:', capabilities.defaultTopK);
  console.log('Max topK:', capabilities.maxTopK);

  if (capabilities.available === 'no') {
    console.error('Language model not available on this device.');
    return null;
  }

  // Create a session with system prompt, using capability defaults as fallbacks
  const session = await ai.languageModel.create({
    systemPrompt: 'You are a concise technical assistant. Answer in two sentences or fewer.',
    temperature: 0.7,
    topK: Math.min(5, capabilities.maxTopK),
  });

  // Streaming prompt
  const stream = await session.promptStreaming('What is WebGPU used for in ML inference?');
  let fullResponse = '';
  for await (const chunk of stream) {
    fullResponse = chunk; // Each chunk is the full accumulated text so far, not a delta
    process.stdout.write('\r' + chunk); // Overwrite line for live display
  }

  console.log('
Final response:', fullResponse);
  return session;
}

Session Management and Context Windows

Sessions maintain conversational context, but that context consumes tokens from a fixed budget. The maxTokens, temperature, and topK parameters control generation behavior. Critically, countPromptTokens() lets developers measure how much of the context window a prompt will consume before sending it, enabling proactive truncation or summarization of conversation history.

Session cloning via .clone() creates a branching point: both sessions share the context up to the clone, then diverge. This is useful for exploring multiple response paths or implementing undo-style interactions. Sessions should be explicitly destroyed via .destroy() when no longer needed to free GPU and system memory.

// Code Example 4: Multi-turn conversation with token counting and cloning
// Note: This function does NOT destroy the session — the caller owns the session
// lifecycle and is responsible for calling session.destroy() when appropriate.
async function multiTurnDemo(session) {
  // First turn
  const r1 = await session.prompt('What is WGSL?');
  console.log('Turn 1:', r1);

  // Count tokens before sending a long follow-up
  const followUp = 'How does WGSL differ from GLSL for compute workloads? Include examples.';
  const tokenCount = await session.countPromptTokens(followUp);
  const remaining = session.maxTokens - session.tokensSoFar;
  console.log(`Follow-up will consume ${tokenCount} tokens`);
  console.log(`Tokens remaining: ${remaining}`);
  console.log(`Max tokens: ${session.maxTokens}`);

  if (tokenCount > remaining) {
    console.warn('Prompt too long for remaining context.');
    return { status: 'context_exceeded' }; // Caller decides whether to destroy or reset
  }

  const r2 = await session.prompt(followUp);
  console.log('Turn 2:', r2);

  // Clone to branch the conversation
  const branchedSession = await session.clone();
  const r3a = await session.prompt('Give me a code example.');
  const r3b = await branchedSession.prompt('What are the performance implications?');
  console.log('Branch A:', r3a);
  console.log('Branch B:', r3b);

  // Clean up the branched session (caller still owns the original session)
  branchedSession.destroy();
}

Limitations of the Built-in Model

Gemini Nano is a small model, and that places hard limits on capability. As of current Chrome stable releases at time of writing, the model is English-only with no multimodal input support (no image or audio processing). Check ai.languageModel.capabilities() for current feature flags, as these constraints may change with Chrome updates. There is no fine-tuning mechanism; the system prompt is the only customization lever. The model version is coupled to Chrome's update cycle, meaning developers cannot pin a specific model version or roll back if a Chrome update changes behavior. For tasks requiring domain-specific knowledge, multilingual support, or larger context windows, the bring-your-own-model approach is necessary.

Bring Your Own Model: WebGPU-Accelerated Open-Source LLMs

Frameworks for In-Browser LLM Inference

Three frameworks dominate the BYOM space for browser-based LLM inference:

Web-LLM (MLC AI) compiles models to WebGPU shaders via Apache TVM. It supports quantized versions of Llama, Mistral, Phi, Gemma, and other popular model families. Models are pre-compiled to WebGPU-optimized artifacts and cached locally after first download.

What sets Transformers.js (Hugging Face) apart is its familiar API: developers who have used the Python transformers library will recognize the patterns immediately. Under the hood, it uses ONNX Runtime Web as its backend, with WebGPU acceleration for supported operations, and provides access to a broad range of models from the Hugging Face Hub.

MediaPipe LLM Inference API is Google's task-focused approach. It wraps model loading and inference behind a simpler API designed for specific use cases rather than general-purpose chat.

Running a Llama 3.2 1B Model with Web-LLM

// Code Example 5: Complete Web-LLM setup with streaming chat
// npm install @mlc-ai/[email protected] — verify current version at npmjs.com/package/@mlc-ai/web-llm
import { CreateMLCEngine } from '@mlc-ai/web-llm';

async function runWebLLM() {
  // Verify this model ID against the current MLC model registry;
  // model IDs may change between library versions.
  const modelId = 'Llama-3.2-1B-Instruct-q4f16_1-MLC';

  // Initialize with progress tracking
  const engine = await CreateMLCEngine(modelId, {
    initProgressCallback: (progress) => {
      console.log(`Loading: ${progress.text}`);
      // progress.progress is a float 0-1 during model download/compile
      if (progress.progress) {
        const pct = (progress.progress * 100).toFixed(1);
        document.getElementById('status').textContent = `${pct}% loaded`;
      }
    },
  });

  console.log('Model ready. Starting inference...');

  // Streaming chat completion
  const messages = [
    { role: 'system', content: 'You are a helpful coding assistant.' },
    { role: 'user', content: 'Write a JavaScript function to debounce any callback.' },
  ];

  const stream = await engine.chat.completions.create({
    messages,
    stream: true,
    temperature: 0.6,
    max_tokens: 256,
  });

  let output = '';
  const startTime = performance.now();
  let chunkCount = 0;

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content || '';
    output += delta;
    chunkCount++;
    document.getElementById('output').textContent = output;
  }

  const elapsed = performance.now() - startTime;
  console.log(`Generated ${chunkCount} chunks in ${elapsed.toFixed(0)}ms`);
  // For accurate tokens/sec, use engine.runtimeStatsText() which reports true token counts
  console.log(await engine.runtimeStatsText());

  // Return the engine for further use; caller is responsible for calling engine.unload()
  // when the engine is no longer needed.
  return engine;
}

// Usage:
// const engine = await runWebLLM();
// ... use engine for additional inference ...
// engine.unload(); // explicit teardown when done

When to Use Built-in vs. BYOM

The decision depends on several factors. Chrome's Prompt API offers zero-setup convenience and guaranteed availability on supported Chrome versions once the model has downloaded. It suits lightweight tasks: summarization, rewriting, simple Q&A, and text classification.

The BYOM path is appropriate when developers need a specific model architecture, larger parameter counts, non-English languages, or domain-specialized models. The tradeoff: model weights must be downloaded (often 1–4 GB), initial load takes seconds to minutes depending on connection speed and hardware, and the developer assumes responsibility for model version management and caching.

Factor	Prompt API (Gemini Nano)	BYOM (Web-LLM / Transformers.js)
Setup complexity	Minimal	Moderate to high
Model choice	Fixed (Gemini Nano)	Llama, Mistral, Phi, Gemma, etc.
Download size	Downloaded separately (~1 GB+) via Chrome's background update mechanism; verify at chrome://components	1-4 GB per model
Language support	English only (as of current stable)	Multilingual available
Offline	Yes (after model downloaded)	Yes (after model cached)
Fine-tuning	System prompt only	Not supported at runtime; use pre-fine-tuned model variants (e.g., instruction-tuned Llama)

Building the Interactive WebGPU Inference Benchmark

Benchmark Architecture

A meaningful benchmark needs to measure three values: time to first token (TTFT), sustained token generation speed (tokens/second), and total inference time. Comparing across backends requires holding variables constant: same prompt text, same requested output length, and same hardware.

The benchmark harness below wraps any inference backend behind a common interface, captures timing data, and returns a standardized results object.

Each adapter function (promptApiInfer, webLlmInfer, transformersJsInfer) must conform to the (prompt, maxTokens, onTokenCallback) signature. The onToken callback must be invoked with the actual number of tokens generated per chunk (a numeric argument) for accurate tokens/second measurement. See each backend's documentation for how to obtain true token counts (e.g., Web-LLM's engine.runtimeStatsText()).

// Code Example 6: Benchmark harness for comparing inference backends

// Adapter stub signatures — implement these for each backend:
// async function promptApiInfer(prompt, maxTokens, onToken) { /* onToken(numTokens) per chunk */ }
// async function webLlmInfer(prompt, maxTokens, onToken) { /* onToken(numTokens) per chunk */ }
// async function transformersJsInfer(prompt, maxTokens, onToken) { /* onToken(numTokens) per chunk */ }

async function benchmarkInference(backendName, inferFn, prompt, maxTokens = 128) {
  // Note: gc() is unavailable in normal browser contexts; GC cannot be manually triggered from web pages
  await new Promise((r) => setTimeout(r, 500)); // Settle

  const t0 = performance.now();
  let firstTokenTime = null;
  let tokenCount = 0;

  const onToken = (tokenDelta) => {
    if (typeof tokenDelta !== 'number') {
      throw new Error('Adapter must pass token count as a numeric argument to onToken');
    }
    if (!firstTokenTime) firstTokenTime = performance.now();
    tokenCount += tokenDelta;
  };

  let result;
  try {
    // inferFn must accept (prompt, maxTokens, onTokenCallback) and return full text
    result = await inferFn(prompt, maxTokens, onToken);
  } catch (err) {
    console.error(`[${backendName}] Inference failed:`, err);
    throw err;
  }

  const totalMs = performance.now() - t0;
  const ttft = firstTokenTime ? firstTokenTime - t0 : null;
  const generationMs = firstTokenTime ? totalMs - (firstTokenTime - t0) : totalMs;

  let tokensPerSecond;
  if (generationMs < 10) {
    tokensPerSecond = null; // Insufficient resolution for meaningful measurement
  } else {
    tokensPerSecond = tokenCount > 0 ? (tokenCount / (generationMs / 1000)) : 0;
  }

  return {
    backend: backendName,
    ttft: ttft ? Math.round(ttft) : 'N/A',
    tokensPerSecond: tokensPerSecond !== null ? parseFloat(tokensPerSecond.toFixed(1)) : 'N/A (insufficient resolution)',
    totalMs: Math.round(totalMs),
    tokenCount,
    outputPreview: result.slice(0, 100),
  };
}

// Run across multiple backends
async function runComparison(prompt) {
  const results = [];

  if ('ai' in self && 'languageModel' in self.ai) {
    results.push(await benchmarkInference('Prompt API', promptApiInfer, prompt));
  }

  results.push(await benchmarkInference('Web-LLM (Phi-3.5)', webLlmInfer, prompt));
  results.push(await benchmarkInference('Transformers.js', transformersJsInfer, prompt));

  console.table(results);
  return results;
}

Rendering Results and Sharing

Note: This section is a design sketch. No code example is provided; the implementation will vary based on your UI framework and sharing requirements.

The benchmark results render into a simple HTML table or a lightweight Canvas-based chart. A shareable "results card" can be implemented as a styled div with fixed dimensions, designed to be screenshot-friendly for social media sharing. The interactive hosted version serves as the viral asset: developers run the benchmark on their own hardware and share their numbers, naturally generating engagement as different GPU configurations produce different results.

Performance varies significantly across hardware. Integrated GPUs (Intel Iris, Apple M-series) and discrete GPUs (NVIDIA, AMD) exhibit different throughput characteristics. Apple Silicon's unified memory architecture avoids the CPU-to-GPU transfer penalty that discrete GPU setups incur, which benefits WebGPU workloads where model weights and KV cache must remain GPU-accessible. Actual throughput depends on the specific chip, model size, and quantization scheme.

Performance Optimization and Production Considerations

GPU Memory Management

Model loading is the most expensive operation. Pre-warm the model on page load (or during an explicit "preparing" step) to avoid a multi-second stall when the user first triggers inference. Monitor the GPUDevice.lost promise to detect and recover from GPU context loss, which can happen if the system reclaims GPU memory under pressure.

Run all inference work off the main thread. Web Workers prevent UI jank during model loading and token generation. The worker handles the Wasm/WebGPU pipeline and posts generated tokens back to the main thread via postMessage. Workers cannot share WebGPU device and buffer objects via postMessage; each worker must create its own GPUDevice. Consult the WebGPU spec section on multi-threading before implementing.

Run all inference work off the main thread. Web Workers prevent UI jank during model loading and token generation.

Progressive Enhancement Strategy

Production applications should never assume a single backend is available. A cascading detection strategy provides resilience:

// Code Example 7: Unified inference factory with progressive enhancement
const BYOM_MODEL_ID = 'Phi-3.5-mini-instruct-q4f16_1-MLC'; // Change to your preferred model

async function createInferenceEngine() {
  // Tier 1: Chrome Prompt API (lowest friction)
  if ('ai' in self && 'languageModel' in self.ai) {
    const caps = await ai.languageModel.capabilities();
    if (caps.available === 'readily') {
      const session = await ai.languageModel.create();
      return {
        backend: 'prompt-api',
        infer: (prompt) => session.prompt(prompt),
        inferStream: (prompt) => session.promptStreaming(prompt),
        cleanup: () => session.destroy(),
      };
    }
  }

  // Tier 2: WebGPU BYOM (Web-LLM)
  if ('gpu' in navigator) {
    const adapter = await navigator.gpu.requestAdapter();
    if (adapter) {
      const { CreateMLCEngine } = await import('@mlc-ai/web-llm');
      const engine = await CreateMLCEngine(BYOM_MODEL_ID);
      return {
        backend: 'web-llm',
        infer: async (prompt) => {
          const r = await engine.chat.completions.create({
            messages: [{ role: 'user', content: prompt }],
          });
          return r.choices[0].message.content;
        },
        cleanup: () => engine.unload(),
      };
    } else {
      console.warn('WebGPU adapter unavailable, falling back to cloud');
    }
  }

  // Tier 3: Cloud API fallback
  // ⚠️ Warning: The cloud fallback shown here is a skeleton. Production implementations
  // must add authentication headers, rate-limit handling, and appropriate error reporting.
  return {
    backend: 'cloud-fallback',
    infer: async (prompt) => {
      const controller = new AbortController();
      const timeoutId = setTimeout(() => controller.abort(), 10_000);
      try {
        const res = await fetch('/api/infer', {
          method: 'POST',
          body: JSON.stringify({ prompt }),
          headers: { 'Content-Type': 'application/json' },
          signal: controller.signal,
        });
        if (!res.ok) throw new Error(`Cloud fallback failed: ${res.status}`);
        return (await res.json()).text;
      } finally {
        clearTimeout(timeoutId);
      }
    },
    cleanup: () => {},
  };
}

This factory returns a uniform infer(prompt) interface. The calling code doesn't need to know which backend was selected.

Security, Ethical, and Practical Caveats

Security Surface

Running inference locally does not eliminate prompt injection. Malicious input can still manipulate model outputs, and there are no server-side guardrails to intercept harmful generation. Client-side filtering is the developer's responsibility. Additionally, anyone can inspect model weights served to the browser by extracting them from the cache or network tab. Proprietary models should never be deployed this way.

Ethical Considerations

On-device inference consumes real hardware resources. GPU utilization spikes during generation. Battery life on laptops and mobile devices decreases. Users may not realize a webpage triggered a multi-gigabyte model download (Gemini Nano's downloaded component is approximately 1 GB or more depending on the Chrome version) or that their GPU is running at high utilization. Transparent disclosure and explicit consent for model downloads and inference execution should be standard practice, not an afterthought.

Transparent disclosure and explicit consent for model downloads and inference execution should be standard practice, not an afterthought.

The Local-First AI Stack Is Ready

WebGPU provides the GPU compute substrate, WebAssembly handles tokenization and orchestration, and Chrome's Prompt API plus frameworks like Web-LLM and Transformers.js cover both zero-config and custom model paths. The gaps are concrete: Safari lacks WebGPU compute shader support as of June 2025, the Prompt API remains behind a flag in Chrome, and no cross-browser standard for on-device model access exists yet. But the core architecture works today, and production use cases are already shipping: privacy-preserving writing tools, offline-capable field applications, and latency-sensitive UX features that can't tolerate a network round-trip. Build the benchmark. Try the Prompt API on a real feature. Load a quantized Llama model with Web-LLM and measure the tokens-per-second on your target hardware. The code examples in this guide are designed to be copied, modified, and deployed.

The Complete Guide to Local-First AI: WebGPU, Wasm, and Chrome's Built-in Model

The Complete Guide to Local-First AI: WebGPU, Wasm, and Chrome's Built-in Model

Table of Contents

Why Local-First AI Is Having Its Moment

The Case for Running LLMs in the Browser

Privacy by Architecture

Latency and Cost Elimination

Offline-First Capabilities

WebGPU Architecture: The GPU Compute Layer Browsers Were Missing

From WebGL to WebGPU: What Changed

The WebGPU Pipeline for Matrix Operations

Memory Constraints and Model Quantization

WebAssembly's Role: The CPU Fallback and Orchestration Layer

Why Wasm Still Matters When You Have GPU

The Wasm + WebGPU Tandem Architecture

Chrome's Built-in Model: The Prompt API and Gemini Nano

What Is the Prompt API?

Setting Up and Feature-Detecting the Prompt API

Session Management and Context Windows

Limitations of the Built-in Model

Bring Your Own Model: WebGPU-Accelerated Open-Source LLMs

Frameworks for In-Browser LLM Inference

Running a Llama 3.2 1B Model with Web-LLM

When to Use Built-in vs. BYOM

Building the Interactive WebGPU Inference Benchmark

Benchmark Architecture

Rendering Results and Sharing

Performance Optimization and Production Considerations

GPU Memory Management

Progressive Enhancement Strategy

Security, Ethical, and Practical Caveats

Security Surface

Ethical Considerations

The Local-First AI Stack Is Ready

Further Reading

Social Engineering 2.0: The 'Talking to Strangers' Vulnerability

Game Dev Without An Engine: The 2025/2026 Renaissance

NIST vs Global Science: The Impact of Foreign Scientist Restrictions