AI & ML

The Definitive Guide to Local-First AI: Building Privacy-Centric Web Apps in 2026

· 5 min read
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Local-first AI means running inference directly on the client device, inside the browser, with no data ever leaving the user's machine. This guide covers the full local-first AI stack, from raw WebGPU compute shaders through high-level window.ai abstractions, with illustrative code examples at every layer.

Table of Contents

Why Local-First AI Is the 2026 Default

Local-first AI means running inference directly on the client device, inside the browser, with no data ever leaving the user's machine. This is not a theoretical aspiration anymore. WebGPU compute shader pipelines, WebAssembly-optimized transformer runtimes, and Chrome's experimental window.ai API now make client-side inference practical for production applications in 2026, not just conference demos.

The pressure comes from multiple directions simultaneously. The EU AI Act is in active enforcement. US state privacy laws continue to expand. API costs for cloud-hosted inference have risen steadily, with per-token pricing from major providers increasing 15-30% year-over-year since 2024. Users increasingly expect offline capability as a baseline. For developers, the cost-benefit calculus has shifted: running a quantized 3B-parameter model in the browser now outperforms round-trip API calls for single-turn summarization under 2K tokens, text classification, form-field autocompletion, short translation tasks, and FAQ-style question answering.

Running a quantized 3B-parameter model in the browser now outperforms round-trip API calls for single-turn summarization under 2K tokens, text classification, form-field autocompletion, short translation tasks, and FAQ-style question answering.

This guide covers the full local-first AI stack, from raw WebGPU compute shaders through high-level window.ai abstractions, with illustrative code examples at every layer (complete runnable implementations are linked in the accompanying repository). By the end, you will have built a hybrid inference application and an interactive benchmarking tool that compares all three runtime paths on your own hardware.

You will need Node.js ≥ 18.0 and npm ≥ 9.0, comfort with modern JavaScript and TypeScript, a basic understanding of ML inference concepts, and Chrome Canary or Dev channel (version 130 or later).

The Local-First AI Architecture Stack

The browser's local-first AI architecture follows a layered model. The user interacts with the browser runtime, which dispatches inference tasks to one of three execution paths: WebGPU for GPU-accelerated workloads, WebAssembly for CPU-bound preprocessing or fallback, or window.ai for managed, zero-configuration language model tasks. The app stores models locally in IndexedDB or the Origin Private File System (OPFS), and all data remains on the device.

Which path you choose depends on the task, the target device, and how much control you need over the inference pipeline.

WebGPU: GPU-Accelerated Inference in the Browser

WebGPU is the foundational layer for performant local AI. Unlike WebGL, which Khronos designed for graphics rendering and which required awkward workarounds to perform general compute, WebGPU exposes first-class compute shaders and storage buffers. These primitives map directly to the operations that transformer inference demands: large matrix multiplications, attention computations, and activation functions executed across thousands of GPU threads.

As of mid-2026, WebGPU is stable in Chrome and Edge, with Firefox shipping support behind a preference flag and Safari offering partial support in Technology Preview builds. For production targeting, Chrome and Chromium-based browsers represent the reliable deployment surface.

What should you measure for AI workloads? Memory bandwidth, precision support, and buffer size limits. WebGPU supports both f16 and f32 precision, with f16 offering roughly 2x throughput on hardware that supports it natively (this is hardware-dependent and should be verified on target devices). The per-binding storage buffer limit typically ranges from 128MB to 256MB on discrete GPUs; total VRAM available is separate and device-dependent. These per-binding limits constrain the model sizes you can load into a single buffer without partitioning.

WebAssembly: CPU Fallback and Model Preprocessing

WASM fills two distinct roles in the local-first stack. First, it handles preprocessing tasks like tokenization, input normalization, and embedding lookups that run efficiently on the CPU and do not warrant GPU dispatch overhead. Second, it serves as the fallback execution path for devices without WebGPU support or with inadequate GPU memory.

WASM SIMD (Single Instruction, Multiple Data) and threads enable meaningful parallelism for CPU inference. Libraries like ort-web (the ONNX Runtime web build) and wonnx use these capabilities to run quantized models at 1-4 tokens per second on multi-core processors (see the benchmark table below for device-specific ranges).

CPU inference is substantially slower than GPU-accelerated paths for transformer models, but it remains essential for progressive enhancement and for devices like older smartphones where GPU compute is unavailable or unreliable.

Chrome's window.ai API: The High-Level Abstraction

Chrome's experimental window.ai API (available via Origin Trial and developer flags; not yet stable) provides a managed, high-level interface to on-device AI capabilities powered by a built-in Gemini Nano model. This API is not yet stable and its surface may change before general availability. The API surface includes window.ai.languageModel for general text generation, window.ai.summarizer for document summarization, and window.ai.translator for language translation.

The key distinction from raw WebGPU inference is that window.ai manages the entire model lifecycle: downloading, caching, loading, and executing the model. You get zero-setup AI capabilities with no model hosting, no VRAM management, and no compute shader authoring. The tradeoff is constraint. The built-in Gemini Nano model has fixed capabilities, a context window of roughly 1,024 tokens (query capabilities() at runtime for the exact limit on your build), and no support for custom or fine-tuned weights. For rapid prototyping, text summarization, and translation tasks, that tradeoff usually pays off. For domain-specific tasks like medical image classification or custom embedding generation, you will need raw WebGPU inference.

// Feature detection and capability checking
async function detectAICapabilities() {
  const capabilities = {
    webgpu: false,
    windowAI: false,
    wasm: false,
    wasmSimd: false,
    // Note: Do not retain a live GPUAdapter reference here.
    // Call initWebGPUDevice() when you need an actual device.
  };

  // Check WebGPU — request adapter only to probe capability; do not retain it
  if (navigator.gpu) {
    const adapter = await navigator.gpu.requestAdapter();
    if (adapter) {
      capabilities.webgpu = true;
      // adapter intentionally not stored; call initWebGPUDevice() when needed
    }
  }

  // Check window.ai (experimental; may be undefined even in Chrome)
  if (window.ai?.languageModel) {
    const status = await window.ai.languageModel.capabilities();
    capabilities.windowAI = status.available !== "no";
  }

  // WASM is broadly supported; also check SIMD
  capabilities.wasm = typeof WebAssembly === "object";

  if (capabilities.wasm) {
    try {
      // Minimal SIMD probe: validate a module containing a SIMD instruction
      const simdTest = new Uint8Array([
        0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 123, 3, 2, 1, 0, 10,
        10, 1, 8, 0, 65, 0, 253, 15, 253, 98, 11,
      ]);
      capabilities.wasmSimd = WebAssembly.validate(simdTest);
    } catch {
      capabilities.wasmSimd = false;
    }
  }

  return capabilities;
}

Setting Up Your Development Environment

You need Chrome Dev or Canary channel for the latest API surfaces. For window.ai language model access, navigate to chrome://flags/#prompt-api-for-gemini-nano and set it to "Enabled". For the Summarizer API, also enable chrome://flags/#summarization-api-for-gemini-nano. For the Translator API, enable chrome://flags/#translation-api-for-gemini-nano.

Confirm WebGPU compute availability by verifying that navigator.gpu resolves to a valid adapter and that adapter.requestDevice() succeeds. (navigator.gpu being non-null alone does not confirm compute shader support.)

Next, acquire your models. Download quantized models in GGUF (for llama.cpp-compatible runtimes) or ONNX (for ort-web) format and cache them locally. For persistence across sessions, OPFS provides a high-performance file system API that handles multi-gigabyte model files without the size limitations of IndexedDB blob storage. Note that OPFS requires a secure context (HTTPS or localhost).

Project Scaffolding

Initialize a Vite + TypeScript project and install the core dependencies:

npm create vite@5 local-ai-app -- --template vanilla-ts
cd local-ai-app
npm install @huggingface/transformers@^3 onnxruntime-web
npm install -D vite typescript

Transformers.js v3 (published under the @huggingface/transformers package) provides the pipeline abstraction that manages model loading, tokenization, and inference with WebGPU backend support out of the box.

// WebGPU device initialization for inference workloads
async function initWebGPUDevice() {
  if (!navigator.gpu) {
    throw new Error("WebGPU not supported in this browser");
  }

  const adapter = await navigator.gpu.requestAdapter({
    powerPreference: "high-performance",
  });

  if (!adapter) {
    throw new Error("No suitable GPU adapter found");
  }

  const adapterInfo = await adapter.requestAdapterInfo();
  // adapterInfo.description may be empty on some platforms/browsers
  console.log(`GPU: ${adapterInfo.description || "(unknown)"}`);

  // Clamp requested limits to what the adapter actually supports
  const MAX_BUFFER = 256 * 1024 * 1024;
  const safeBufferSize = Math.min(MAX_BUFFER, adapter.limits.maxBufferSize);
  const safeStorageSize = Math.min(
    MAX_BUFFER,
    adapter.limits.maxStorageBufferBindingSize
  );
  const safeWorkgroupX = Math.min(
    256,
    adapter.limits.maxComputeWorkgroupSizeX
  );
  const safeInvocations = Math.min(
    256,
    adapter.limits.maxComputeInvocationsPerWorkgroup
  );

  const baseLimits = {
    maxBufferSize: safeBufferSize,
    maxStorageBufferBindingSize: safeStorageSize,
    maxComputeWorkgroupSizeX: safeWorkgroupX,
    maxComputeInvocationsPerWorkgroup: safeInvocations,
  };

  // Attempt f16 first; fall back to f32 on unsupported hardware
  let device;
  try {
    device = await adapter.requestDevice({
      requiredFeatures: ["shader-f16"],
      requiredLimits: baseLimits,
    });
  } catch {
    console.warn("shader-f16 not supported or limit error; falling back to f32");
    device = await adapter.requestDevice({
      requiredLimits: baseLimits,
    });
  }

  device.lost.then((info) => {
    console.error(`GPU device lost: ${info.message}`);
    // Trigger fallback to WASM — implement recovery logic here
  });

  return device;
}

The shader-f16 feature request will reject on hardware without f16 capability (many integrated Intel/AMD GPUs). The try/catch above ensures graceful degradation. All requested limits are clamped to the adapter's reported maximums, preventing OperationError on mid-range and integrated GPUs where defaults may be lower than 256MB.

Running an LLM in the Browser with WebGPU

Model selection is the single most consequential decision for browser-based inference. Quantized models are non-negotiable: a full-precision 3B-parameter model requires roughly 12GB of memory in f32 (weights only, excluding activations and KV-cache), but a 4-bit quantized (INT4/GGUF Q4_K_M) version of the same model's weights fits into approximately 1.8GB. Note that total VRAM consumption will be higher once KV-cache, activations, and runtime overhead are included. This brings it within reach of devices with 4-8GB of available VRAM, though 2GB integrated GPUs will likely be insufficient for a 3B model once all memory costs are accounted for.

Practical device VRAM ranges from 2GB on low-end integrated graphics to 8GB or more on discrete GPUs and Apple Silicon with unified memory. A quantized Phi-3-mini (3.8B parameters at INT4) or a similarly sized model represents the current sweet spot for browser inference: small enough to load reliably on mid-range hardware, large enough to produce useful outputs for summarization, classification, and conversational tasks.

Loading and Running a Quantized Transformer

The loading pipeline follows a predictable sequence: fetch the model files (weights, tokenizer, configuration), cache them in OPFS for subsequent visits, and then load the weights into GPU buffers. Transformers.js v3 abstracts most of this behind its pipeline API.

Streaming token generation is critical for interactive applications. Users perceive the model as responsive when tokens start appearing within 1-2 seconds, even if total generation takes 10-15 seconds. The TextStreamer callback mechanism in Transformers.js provides this.

import { pipeline, TextStreamer } from "@huggingface/transformers";

async function runBrowserLLM(prompt, outputElement) {
  const startTime = performance.now();
  let tokenCount = 0;
  let fullOutput = "";

  // Initialize pipeline with WebGPU backend
  // Verify the model exists at https://huggingface.co/Xenova/Phi-3-mini-4k-instruct-q4 before use.
  // For reproducibility, pin to a specific revision: { revision: '<commit-sha>' }
  const generator = await pipeline(
    "text-generation",
    "Xenova/Phi-3-mini-4k-instruct-q4",
    {
      device: "webgpu",
      dtype: "q4", // 4-bit quantization
    }
  );

  const loadTime = performance.now() - startTime;
  console.log(`Model loaded in ${(loadTime / 1000).toFixed(1)}s`);

  try {
    const streamer = new TextStreamer(generator.tokenizer, {
      skip_prompt: true,
      callback_function: (text) => {
        // Accumulate text; count tokens accurately from encoded length
        fullOutput += text;
        tokenCount = generator.tokenizer.encode(fullOutput).length;
        outputElement.textContent += text;
      },
    });

    const inferStart = performance.now();

    await generator(prompt, {
      max_new_tokens: 256,
      temperature: 0.7,
      do_sample: true,
      streamer,
    });

    const inferTime = (performance.now() - inferStart) / 1000;
    const tokensPerSec = tokenCount / inferTime;

    console.log(
      `Generated ${tokenCount} tokens at ${tokensPerSec.toFixed(1)} tok/s`
    );

    return { tokenCount, tokensPerSec, loadTime };
  } finally {
    // Release GPU resources — dispose() is the Transformers.js pipeline cleanup method.
    // The finally block ensures cleanup even if inference throws an error.
    await generator.dispose();
  }
}

Memory management deserves explicit attention. Calling dispose() on the pipeline releases GPU buffers. Wrapping the inference call in a try/finally block ensures GPU buffers are released even when inference throws an error, preventing VRAM leaks on repeated failures. Monitoring device.lost (shown in the initialization code above) allows graceful recovery when the browser reclaims GPU memory under pressure, for instance, when the user switches to a GPU-intensive tab.

Note that TextStreamer callbacks may deliver variable-length text chunks, not one token per call. To get an accurate token count, the code above encodes the accumulated output text using the tokenizer rather than simply incrementing a counter per callback invocation.

Optimizing Inference Performance

KV-cache management on the GPU is the primary bottleneck for longer sequences. The key-value cache grows linearly with sequence length and must persist in GPU memory across token generation steps. For a 3B model with 32 layers, 32 attention heads, and a head dimension of 96 at a 4096 context length, the KV-cache consumes approximately 768MB in f16 (calculated as: 2 × 32 layers × 32 heads × 96 head_dim × 4096 context_length × 2 bytes). Limiting max_new_tokens and implementing cache eviction for multi-turn conversations directly impacts whether a model fits in available VRAM.

Quantization tradeoffs are real and measurable. INT4 quantization roughly halves model size compared to INT8 but introduces quantization error that can degrade output quality, particularly for reasoning-heavy tasks. INT8 offers better quality-to-size balance for tasks requiring precision. FP16 provides near-full-precision quality but doubles memory requirements versus INT8. The right choice depends on the application: summarization and classification tolerate INT4 well, while code generation and mathematical reasoning benefit from INT8 or higher.

Model selection is the single most consequential decision for browser-based inference. Quantized models are non-negotiable.

To illustrate what happens beneath the Transformers.js abstraction layer, here is a minimal WGSL compute shader performing matrix multiplication, the core operation repeated thousands of times during transformer inference:

// WGSL compute shader — matrix multiplication on storage buffers
// Both matA and matB are expected in row-major layout.
@group(0) @binding(0) var<storage, read> matA: array<f32>;
@group(0) @binding(1) var<storage, read> matB: array<f32>;
@group(0) @binding(2) var<storage, read_write> matC: array<f32>;

struct Dimensions {
  M: u32, N: u32, K: u32,
}
@group(0) @binding(3) var<uniform> dims: Dimensions;

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  let row = gid.x;
  let col = gid.y;

  if (row >= dims.M || col >= dims.N) { return; }

  var sum: f32 = 0.0;
  for (var k: u32 = 0u; k < dims.K; k = k + 1u) {
    sum = sum + matA[row * dims.K + k] * matB[k * dims.N + col];
  }
  matC[row * dims.N + col] = sum;
}
// Dispatching the matmul shader from JavaScript
// This is a complete, runnable implementation including buffer creation and bind group setup.

async function dispatchMatMul(device, matAData, matBData, M, N, K) {
  const wgslShaderSource = `
    @group(0) @binding(0) var<storage, read> matA: array<f32>;
    @group(0) @binding(1) var<storage, read> matB: array<f32>;
    @group(0) @binding(2) var<storage, read_write> matC: array<f32>;
    struct Dimensions { M: u32, N: u32, K: u32 }
    @group(0) @binding(3) var<uniform> dims: Dimensions;
    @compute @workgroup_size(16, 16)
    fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
      let row = gid.x; let col = gid.y;
      if (row >= dims.M || col >= dims.N) { return; }
      var sum: f32 = 0.0;
      for (var k: u32 = 0u; k < dims.K; k++) {
        sum += matA[row * dims.K + k] * matB[k * dims.N + col];
      }
      matC[row * dims.N + col] = sum;
    }`;

  const byteSize = (n) => n * Float32Array.BYTES_PER_ELEMENT;

  const makeBuffer = (data, usage) => {
    const buf = device.createBuffer({
      size: byteSize(data.length),
      usage: usage | GPUBufferUsage.COPY_DST,
    });
    device.queue.writeBuffer(buf, 0, data);
    return buf;
  };

  const bufA = makeBuffer(matAData, GPUBufferUsage.STORAGE);
  const bufB = makeBuffer(matBData, GPUBufferUsage.STORAGE);
  const bufC = device.createBuffer({
    size: byteSize(M * N),
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });
  const dimsData = new Uint32Array([M, N, K]);
  const bufDims = makeBuffer(dimsData, GPUBufferUsage.UNIFORM);

  const shaderModule = device.createShaderModule({ code: wgslShaderSource });
  const computePipeline = device.createComputePipeline({
    layout: "auto",
    compute: { module: shaderModule, entryPoint: "main" },
  });

  const bindGroup = device.createBindGroup({
    layout: computePipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: bufA } },
      { binding: 1, resource: { buffer: bufB } },
      { binding: 2, resource: { buffer: bufC } },
      { binding: 3, resource: { buffer: bufDims } },
    ],
  });

  const commandEncoder = device.createCommandEncoder();
  const pass = commandEncoder.beginComputePass();
  pass.setPipeline(computePipeline);
  pass.setBindGroup(0, bindGroup); // REQUIRED: must not be omitted
  pass.dispatchWorkgroups(Math.ceil(M / 16), Math.ceil(N / 16));
  pass.end();

  device.queue.submit([commandEncoder.finish()]);

  // Cleanup intermediate buffers
  bufA.destroy();
  bufB.destroy();
  bufDims.destroy();
  // bufC ownership transferred to caller; caller must destroy when done

  return bufC;
}

This is simplified, but the pattern is representative: create storage buffers containing weight matrices, create a bind group referencing all buffers, dispatch workgroups sized to tile the matrix, and accumulate dot products. Production transformer runtimes add tiling optimizations, shared memory usage, and fused activation kernels, but the fundamental compute shader dispatch model is the same.

Using Chrome's window.ai for Zero-Setup AI Features

When the goal is adding text summarization, translation, or general language understanding without managing models, window.ai is the pragmatic choice. It trades flexibility for simplicity: no model downloads to manage, no GPU buffer allocation, no quantization decisions.

The API works well for rapid prototyping and for applications where the built-in Gemini Nano capabilities align with the requirements, but it falls short for domain-specific tasks like medical image classification, custom embedding generation, or any workload requiring a fine-tuned model. For those, raw WebGPU inference remains necessary.

Building a Privacy-First Summarizer

// window.ai summarizer with streaming output
async function summarizeDocument(text, outputElement) {
  const canSummarize = await window.ai.summarizer.capabilities();

  if (canSummarize.available === "no") {
    throw new Error("Summarizer not available on this device");
  }

  const summarizer = await window.ai.summarizer.create({
    type: "key-points",
    format: "markdown",
    length: "medium",
    monitor(m) {
      m.addEventListener("downloadprogress", (e) => {
        const pct = ((e.loaded / e.total) * 100).toFixed(0);
        outputElement.textContent = `Downloading model: ${pct}%`;
      });
    },
  });

  try {
    // Wait for model readiness if still downloading
    if (canSummarize.available === "after-download") {
      await summarizer.ready;
    }

    outputElement.textContent = "";

    const stream = await summarizer.summarizeStreaming(text);
    for await (const chunk of stream) {
      // The streaming API returns cumulative text; each chunk replaces the previous content
      outputElement.textContent = chunk;
    }
  } finally {
    // destroy() is the window.ai session cleanup method (distinct from dispose() used by Transformers.js pipelines).
    // The finally block ensures the session is released even if streaming throws an error.
    summarizer.destroy();
  }
}

The monitor callback provides download progress for first-time users who need to wait for the Gemini Nano model to download. Subsequent visits use the cached model. The type parameter accepts "key-points", "tl;dr", "teaser", and "headline" values, each producing stylistically different summaries. These values are correct as of the current experimental API surface; verify against current Chrome documentation as the API is not yet stable.

Combining window.ai with Custom Models

The most powerful architecture pattern combines both approaches: window.ai for general text intelligence and custom WebGPU models for specialized tasks. A routing function dispatches work to the appropriate backend based on task type and device capabilities.

// Stub for custom ONNX model inference — implement per your model and task requirements
async function runCustomONNXModel(task, input, backend) {
  throw new Error(
    `runCustomONNXModel not implemented for task: ${task}, backend: ${backend}`
  );
}

// Non-blocking consent modal (replaces synchronous confirm() which blocks the main thread
// and is suppressed in cross-origin iframes)
async function showConsentModal(message) {
  // Replace with your application's modal UI component.
  // This default implementation uses confirm() as a placeholder.
  return Promise.resolve(confirm(message));
}

// Auth and CSRF token retrieval — implement per your authentication scheme
function getAuthToken() {
  // Return a Bearer token from your auth provider
  return localStorage.getItem("auth_token") || "";
}

function getCSRFToken() {
  // Return CSRF token from meta tag or cookie
  return (
    document.querySelector('meta[name="csrf-token"]')?.content || ""
  );
}

// Allowlist of recognized task types for server fallback
const ALLOWED_TASKS = new Set([
  "summarize",
  "translate",
  "chat",
  "classify-image",
  "embeddings",
]);

// Hybrid task router
async function routeAITask(task, input, capabilities) {
  switch (task) {
    case "summarize":
    case "translate":
    case "chat":
      if (capabilities.windowAI) {
        const session = await window.ai.languageModel.create();
        try {
          // Collect full streamed response before returning
          const stream = session.promptStreaming(input);
          let result = "";
          for await (const chunk of stream) {
            result = chunk; // API returns cumulative text per chunk
          }
          return result;
        } finally {
          session.destroy(); // Always release the language model session
        }
      }
      // Fall through to WebGPU LLM
      {
        const outputEl = document.getElementById("output");
        if (!outputEl) {
          throw new Error(
            'Output element with id "output" not found in the DOM'
          );
        }
        return runBrowserLLM(input, outputEl);
      }

    case "classify-image":
    case "embeddings":
      if (capabilities.webgpu) {
        return runCustomONNXModel(task, input, "webgpu");
      }
      return runCustomONNXModel(task, input, "wasm");

    default: {
      if (!capabilities.webgpu && !capabilities.windowAI) {
        // Validate task against allowlist before sending to server
        if (!ALLOWED_TASKS.has(task)) {
          throw new Error(`Unknown task type: ${task}`);
        }

        // ⚠️ Server fallback: this sends user data off-device.
        // In a production app, require explicit user consent before proceeding.
        const consent = await showConsentModal(
          "No on-device AI is available. This action requires sending data to a server. Continue?"
        );
        if (!consent) return null;

        const authToken = getAuthToken();
        const response = await fetch("/api/inference", {
          method: "POST",
          headers: {
            "Content-Type": "application/json",
            Authorization: `Bearer ${authToken}`,
            "X-CSRF-Token": getCSRFToken(),
          },
          body: JSON.stringify({ task, input }),
        });
        if (!response.ok) {
          throw new Error(`Server error: ${response.status}`);
        }
        return response.json();
      }
      throw new Error(
        `Unroutable task with available capabilities: ${task}`
      );
    }
  }
}

This pattern preserves privacy by default (all local paths first) and only falls back to a server API when no on-device option exists, with explicit user consent before any data leaves the device. The server fallback validates the task against an allowlist, includes Content-Type and authentication headers, and uses a non-blocking consent prompt instead of the synchronous confirm() API.

Building the Interactive WebGPU Performance Comparison Tool

The benchmark tool runs identical inference tasks across all three execution paths (WebGPU, WASM CPU, and window.ai) on the reader's own hardware, producing a direct comparison of tokens per second, time to first token, and memory usage.

Implementing the Benchmark Harness

Note: performance.measureUserAgentSpecificMemory() requires cross-origin isolation. Your server must send Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers. Verify isolation with self.crossOriginIsolated === true in the console before relying on memory measurements.

The harness runs each backend sequentially to avoid resource contention, executes one discarded warm-up run followed by five measured runs per backend, and reports the median of measured runs to reduce variance from background system activity and cold-start model loading.

// Stub: implement to dispatch inference to the appropriate backend
// Example: if (backend === 'webgpu') return runBrowserLLM(prompt, outputEl);
async function executeInference(backend, prompt, onToken) {
  throw new Error(
    `executeInference not implemented for backend: ${backend}. Wire this to your backend runner.`
  );
}

// Benchmark harness core loop
const BENCHMARK_RUNS = 5;

async function runBenchmarkSuite(prompt, backends) {
  const results = {};

  for (const backend of backends) {
    const timings = [];

    // Warm-up run (discarded) — absorbs model load, shader compilation costs.
    // Errors are caught so a broken backend is reported, not silently skipped.
    try {
      await executeInference(backend, prompt, () => {});
    } catch (err) {
      console.warn(
        `Warm-up failed for backend "${backend}": ${err.message}. Skipping.`
      );
      results[backend] = { error: err.message };
      continue;
    }

    for (let run = 0; run < BENCHMARK_RUNS; run++) {
      const startTTFT = performance.now();
      let firstTokenTime = null;
      let totalTokens = 0;

      const onToken = () => {
        if (!firstTokenTime) firstTokenTime = performance.now() - startTTFT;
        totalTokens++;
      };

      const start = performance.now();
      await executeInference(backend, prompt, onToken);
      const elapsed = (performance.now() - start) / 1000;

      timings.push({
        tokensPerSec: totalTokens / elapsed,
        timeToFirstToken: firstTokenTime,
        totalTime: elapsed,
        tokenCount: totalTokens,
      });

      // Per-backend memory snapshot (if available) — measured immediately
      // after each backend's run for accuracy
      if (
        self.crossOriginIsolated &&
        performance.measureUserAgentSpecificMemory
      ) {
        timings[timings.length - 1].memory =
          await performance.measureUserAgentSpecificMemory();
      }
    }

    // Sort by tokensPerSec ascending and take the median (middle element).
    // For an odd run count the middle index is exact; for even counts,
    // the lower-median is used.
    timings.sort((a, b) => a.tokensPerSec - b.tokensPerSec);
    const medianIdx = Math.floor(timings.length / 2);
    results[backend] = timings[medianIdx];
  }

  return results;
}

Visualizing Results with Real-Time Metrics

For precise GPU timing beyond performance.now(), GPUComputePassTimestampWrites allows querying the actual GPU execution time separate from CPU-side dispatch overhead. This reveals cases where the CPU is the bottleneck (in buffer upload or tokenization) versus the GPU compute itself.

The benchmark results can be encoded as URL query parameters, enabling users to share their device-specific numbers. Be aware that sharing hardware identifiers in URLs (e.g., ?gpu=45.2&wasm=3.1&windowai=28.7&device=M3Pro) constitutes device fingerprinting. Consider allowing users to anonymize or omit device info before sharing, consistent with this guide's privacy-first principles.

Privacy, Security, and Production Considerations

Threat Model for Client-Side AI

Local-first AI eliminates server-side data exposure by architecture. User prompts, documents, and inputs never traverse a network, which satisfies GDPR and CCPA data minimization requirements without needing a privacy policy carve-out for AI processing.

What local-first does not solve: an attacker can download and extract model weights from the client. Model theft is a real concern for proprietary models, and obfuscation provides only superficial protection. Prompt injection attacks remain applicable since the model runs the same inference regardless of where it executes. Prompt injection can cause harmful outputs locally; exfiltration risk drops without network access, but hybrid architectures that include a server fallback do not eliminate it entirely.

Local-first AI eliminates server-side data exposure by architecture. User prompts, documents, and inputs never traverse a network, which satisfies GDPR and CCPA data minimization requirements without needing a privacy policy carve-out for AI processing.

In shared or virtualized GPU environments (not typical local consumer hardware), GPU side-channel timing attacks may allow an adversary sharing the same GPU to infer information about inference workloads through timing analysis. This risk is relevant primarily in cloud virtualization contexts, not the local consumer hardware deployments that are this guide's focus. Content Security Policy headers should restrict the origins from which model files can be loaded.

Offline Support and Model Caching

A Service Worker strategy for model files treats them as long-lived cache entries. OPFS is preferable to the Cache API for multi-gigabyte model files because it provides a file-system-like interface with stream-based reads that avoid loading entire files into memory. A simple manifest file (listing model name, version hash, and file URLs) enables you to invalidate the cache when you publish updated weights.

Progressive Enhancement and Fallback

Graceful degradation follows the capability detection pattern established earlier: attempt WebGPU first, fall back to WASM, and offer a server API as the last resort. Communicating device requirements to users matters for UX. A model that requires a 2GB download and 4GB of VRAM should be clearly disclosed before the download begins, not surfaced as a loading spinner that stalls on insufficient hardware.

Performance Benchmarks: What to Expect in 2026

Performance varies dramatically by device class. The following are indicative estimates for a quantized 3B-parameter model (INT4, Q4_K_M, Transformers.js v3 with WebGPU backend, 256-token generation). Actual results depend on model, quantization scheme, context length, browser version, and thermal conditions:

Device Class Example Hardware Tokens/sec Time to First Token Notes
High-end desktop RTX 4070 / M3 Pro 40-55 tok/s 0.3-0.5s Above 30 tok/s, users perceive output as real-time
Mid-range laptop Intel Iris Xe / M2 Air 12-20 tok/s 0.8-1.5s Usable for summarization, slower chat
Mobile (Android Chrome) Snapdragon 8 Gen 3 5-10 tok/s 1.5-3.0s Limited by thermal throttling and VRAM
WASM CPU fallback Any modern quad-core 1-4 tok/s 2.0-5.0s ort-web, WASM SIMD + threads, INT4 3B model

For comparison, a cloud API call to a hosted LLM typically adds an estimated 200-500ms of network latency before the first token arrives (varies by provider, region, and load), plus variable queue wait times. Local inference on a high-end desktop reaches first-token faster than most API calls and eliminates per-token billing entirely. The crossover point where local beats cloud is clear: latency-sensitive applications, offline-required scenarios, and any context where data privacy is non-negotiable.

What's Next for Local-First AI

These three browser AI layers serve distinct purposes in 2026. WebGPU gives you maximum control and performance for custom models. WebAssembly ensures universal device coverage while handling CPU-bound preprocessing, and window.ai gets you from zero to working language AI without touching a model file. The hybrid architecture pattern described throughout this guide, routing tasks to the appropriate backend based on device capabilities and task requirements, represents the most robust production approach.

The trajectory points toward broader WebNN (Web Neural Network API) adoption, already partially available in Chrome since version 113, which will open hardware-specific acceleration paths beyond WebGPU. Techniques like 2-bit quantization and speculative decoding continue pushing larger effective model sizes into browser-viable memory footprints. And wider browser vendor adoption of built-in AI APIs beyond Chrome will expand the deployment surface. Developers who invest in understanding the local-first stack now are building on a foundation that will only become more capable.

The interactive benchmark tool described in this guide is available to fork and extend. Run it on your own hardware, share the results, and experiment with the hybrid routing pattern against your specific use cases. Start with the capability detection function, wire it into the task router, and measure what your target devices can actually do.