Case Study: Migrating a Cloud Wrapper to a Local-First AI PWA

SitePoint Team

Published in

AI·JavaScript·Web·

February 25, 2026

Share this article

Case Study: Migrating a Cloud Wrapper to a Local-First AI PWA

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

The app in question was a document summarization and Q&A tool. At roughly 50,000 monthly active users, OpenAI API costs had climbed to $2,400 per month, and enterprise prospects kept walking away the moment they learned that user documents were being sent to a third-party API for processing.

Prerequisites
Why We Migrated Away from the Cloud
The Original Architecture: Cloud Wrapper Anatomy
Evaluating Local AI Feasibility
Migration Strategy: The Three-Phase Approach
Implementing the Local AI Engine
Building the PWA Shell
Performance Results and Cost Impact
Limitations and Honest Tradeoffs
Key Takeaways and Migration Checklist
Tests
Verification Commands

Prerequisites

Browser: Chrome 113+ or Edge 113+ with WebGPU enabled by default; Safari 18+ (with limitations); Firefox support is experimental
OS: Windows 10+, macOS 12+, or Linux with Vulkan drivers
Node.js: 18+ recommended
Build tool: Vite with vite-plugin-pwa, or Webpack with workbox-webpack-plugin, to inject the self.__WB_MANIFEST precache manifest
Packages (pin versions in package.json): @mlc-ai/web-llm, workbox-routing, workbox-strategies, workbox-expiration, workbox-precaching
HTTPS: Service workers and navigator.gpu require a secure context
GPU drivers: Up-to-date drivers required for WebGPU
Storage quota: ~2 GB+ of browser storage; call navigator.storage.persist() to request persistent storage so cached model weights are not evicted under pressure
CORS: The CDN serving model weights must include Access-Control-Allow-Origin: * (or your app's origin) for the Cache API to store cross-origin responses. Verify with: curl -I [model-weight-url] | grep -i access-control

Why We Migrated Away from the Cloud

The app in question was a document summarization and Q&A tool. Users would paste or upload documents, and the app would generate summaries, answer questions about the content, and extract key points. Under the hood, it was an OpenAI wrapper: every inference call hit the GPT-3.5-turbo or GPT-4 API through a backend proxy, and the frontend displayed the results. A local-first migration seemed impractical until WebGPU browser support and quantized small language models matured enough to make it real.

The pain points were concrete. At roughly 50,000 monthly active users, OpenAI API costs had climbed to $2,400 per month. During peak hours, p90 response times exceeded 4 seconds. Rate limiting from OpenAI's API occasionally degraded the experience for concurrent users, and enterprise prospects kept walking away during sales conversations the moment they learned that user documents were being sent to a third-party API for processing.

The migration goal was unambiguous: move all AI inference into the browser, eliminate API calls entirely, ship the whole thing as an installable PWA capable of working offline, and cut operational costs dramatically. The outcome, documented in detail below, was a 94% reduction in monthly operational spend (from $2,400/month to $140/month, with the residual covering CDN hosting and authentication), full offline capability, and sub-2-second inference times on mid-range hardware.

The migration goal was unambiguous: move all AI inference into the browser, eliminate API calls entirely, ship the whole thing as an installable PWA capable of working offline, and cut operational costs dramatically.

The Original Architecture: Cloud Wrapper Anatomy

Tech Stack Before Migration

The pre-migration stack was a standard pattern for AI wrapper apps. The frontend was a React single-page application hosted on Vercel. A Node.js proxy server running on AWS Lambda sat between the frontend and OpenAI's REST API, handling token management, request batching, and API key security. The AI layer used GPT-3.5-turbo for most summarization tasks, with GPT-4 reserved for complex Q&A where users toggled a "high quality" mode. Authentication and billing ran through Stripe and Supabase.

Where the Money Went: Cost Breakdown

The monthly cost breakdown told a stark story:

Category	Monthly Cost	% of Total
OpenAI API tokens	$1,870	78%
AWS Lambda invocations	$210	9%
Vercel hosting/bandwidth	$180	7%
Supabase (auth + DB)	$140	6%
Total	$2,400	100%

Seventy-eight percent of total spend was OpenAI API tokens. The backend proxy existed solely to shuttle requests to OpenAI and manage API keys. The Lambda functions, Vercel bandwidth, and Supabase costs were all supporting infrastructure for a fundamentally cloud-dependent architecture.

[Before Architecture Diagram: React SPA → Vercel → AWS Lambda Proxy → OpenAI API, with Supabase for auth. Cost pie chart showing 78% API tokens.]

Evaluating Local AI Feasibility

Model Selection Criteria

The core tasks, document summarization and extractive Q&A, don't actually require GPT-4-class reasoning. They need competent text comprehension and generation within a constrained domain. That opened the door to small language models.

The primary constraints were browser memory limits. The target was sub-4GB VRAM usage to cover integrated GPUs and laptops without discrete graphics cards. Quantization format mattered: MLC quantization levels (e.g., q4f16_1) trade model size for output quality, with lower-bit quantizations reducing memory at the cost of some generation quality. GGUF is a separate format used by llama.cpp, not Web-LLM; model artifacts for Web-LLM are MLC-compiled (.wasm, .params files). The candidates evaluated were Phi-3-mini (3.8B parameters), Gemma 2B, SmolLM2, and Mistral 7B. We rejected Mistral 7B immediately because even aggressive 4-bit quantization couldn't bring it under the 4GB VRAM ceiling reliably.

Browser Runtime Options

Two serious contenders exist for running LLMs in the browser: Transformers.js, which uses ONNX Runtime Web under the hood, and Web-LLM from MLC, which compiles models specifically for WebGPU execution.

Transformers.js has broader model format support and a familiar Hugging Face-style API. Web-LLM, however, is WebGPU-native from the ground up, supports streaming token generation out of the box, and ships pre-compiled model artifacts that eliminate the need for in-browser model compilation. Note that WebGPU shader pipelines are still compiled on first use, which may cause a brief pause; this is distinct from model weight compilation. For this use case, where WebGPU was the target runtime and streaming responses were critical for perceived performance, Web-LLM was the stronger fit.

Migration Strategy: The Three-Phase Approach

Phase 1: Dual-Mode with Cloud Fallback

We started the migration with a feature-flagged dual-mode setup. Local inference ran alongside existing API calls, controlled by a feature flag and a runtime capability check. The detection logic probed for WebGPU support before deciding which path to take:

async function selectInferenceRuntime() {
  if (!navigator.gpu) {
    return { mode: 'cloud', reason: 'WebGPU not supported' };
  }

  try {
    const adapter = await navigator.gpu.requestAdapter();
    if (!adapter) {
      return { mode: 'cloud', reason: 'No GPU adapter available' };
    }

    // requestAdapterInfo() availability varies by browser;
    // feature-detect before calling.
    let adapterInfo = null;
    if (adapter.requestAdapterInfo) {
      adapterInfo = await adapter.requestAdapterInfo();
    }

    // IMPORTANT: No reliable VRAM query exists in WebGPU.
    // device.limits.maxBufferSize is a per-buffer allocation limit,
    // NOT available VRAM. On a machine with 8 GB VRAM, this value may
    // report 256 MB, routing users to cloud incorrectly.
    //
    // Recommended alternatives:
    //   1. Use adapter vendor/architecture strings from adapterInfo
    //      to map to known GPU capability tiers.
    //   2. Attempt model load and catch out-of-memory failures,
    //      falling back to cloud on error.
    //   3. Let users self-select ("Use local AI?" toggle) with
    //      guidance about hardware requirements.
    //
    // The block below uses a try-the-model approach as a safeguard:
    return { mode: 'local', adapter: adapterInfo };
  } catch (err) {
    return { mode: 'cloud', reason: `Detection failed: ${err.message}` };
  }
}

If local inference failed or the device was underpowered, the app fell back to the existing cloud path without any visible disruption. Users never saw an error; they just got the cloud-backed experience they'd always had.

Phase 2: Local-Primary, Cloud as Escape Hatch

The second phase flipped the default. Local inference became the primary path, and we reserved cloud API calls for a single edge case: documents exceeding the local model's context window after chunking. A token-counting heuristic routed requests, estimating document length against the model's context limit and only escalating to the cloud when map-reduce chunking would produce unacceptable quality degradation.

Phase 3: Full Local-First, Cloud Removed

The final phase stripped out the Lambda proxy, the OpenAI dependency, and all server-side token management. The final architecture: a static PWA served from a CDN, a service worker managing cached model weights, and zero backend compute.

[After Architecture Diagram: Static PWA on CDN → Service Worker → Cached Model Weights (Browser). Side-by-side cost comparison table showing $2,400/mo → $140/mo.]

Implementing the Local AI Engine

Loading and Caching Models with Web-LLM

The @mlc-ai/web-llm package handles engine initialization, model downloading, and inference. The first load pulls 1.5 GB of model weights for Phi-3-mini at Q4 quantization. Subsequent loads hit the Cache API, making initialization nearly instant. Model weight URLs must be served with Access-Control-Allow-Origin headers for the Cache API to store cross-origin responses; without this, the browser silently fails to cache and re-downloads the full 1.5 GB on every visit.

The initialization code uses a concurrency guard to ensure that only one engine instance is ever created, even if multiple parts of the application call initializeEngine() simultaneously (for example, on tab regain-focus during a download). A timeout prevents the initialization from hanging indefinitely if the CDN stalls or WebGPU initialization deadlocks.

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Verify this model ID against the current Web-LLM model registry at
// https://github.com/mlc-ai/web-llm before deploying.
// Registry contents change between releases.
export const DEFAULT_MODEL_ID = "Phi-3-mini-4k-instruct-q4f16_1-MLC";

const INIT_TIMEOUT_MS = 120_000; // 2 minutes; adjust for expected download size

let engine = null;
let engineInitPromise = null;
let engineLoadedFromCache = false;

async function initializeEngine(onProgress) {
  // If engine is already initialized, return immediately with cached status
  if (engine) return { success: true, cached: true };

  // If initialization is already in flight, return the existing promise
  // to prevent double-init and GPU/WASM resource leaks
  if (engineInitPromise) return engineInitPromise;

  engineInitPromise = _doInit(onProgress).finally(() => {
    engineInitPromise = null;
  });
  return engineInitPromise;
}

async function _doInit(onProgress) {
  engineLoadedFromCache = false;
  let progressCount = 0;

  const initProgressCallback = (progress) => {
    progressCount++;
    // Web-LLM emits a single 100% progress event for cache hits
    if (progressCount === 1 && progress.progress === 1) {
      engineLoadedFromCache = true;
    }
    if (onProgress) {
      onProgress({
        message: progress.text,
        percent: Number.isFinite(progress.progress)
          ? Math.round(progress.progress * 100)
          : 0,
      });
    }
  };

  const timeoutPromise = new Promise((_, reject) =>
    setTimeout(
      () => reject(new Error('Engine initialization timed out after 2 minutes')),
      INIT_TIMEOUT_MS
    )
  );

  try {
    engine = await Promise.race([
      CreateMLCEngine(DEFAULT_MODEL_ID, {
        initProgressCallback,
        logLevel:
          (typeof process !== 'undefined' && process.env && process.env.NODE_ENV
            ? process.env.NODE_ENV
            : 'development') === 'production'
            ? 'SILENT'
            : 'INFO',
      }),
      timeoutPromise,
    ]);
    return { success: true, cached: engineLoadedFromCache };
  } catch (err) {
    engine = null;
    console.error("Web-LLM init failed:", err);
    return { success: false, error: err.message };
  }
}

The CreateMLCEngine call checks the Cache API first. If the model artifacts are already stored, it skips the download entirely and reports near-instant progress. The cached field in the return value reflects whether the model was loaded from cache (detected by a single 100%-progress event on startup) so that callers can skip "first-load" download messaging on subsequent visits. The progress callback drives a download indicator in the UI, which is essential for the first-visit experience where users are waiting for a 1.5 GB download. The concurrency guard ensures that if two calls race (e.g., a tab regaining focus during an in-flight download), only one engine instance is created and both callers receive the same result.

Inference Pipeline: Summarization and Q&A

Smaller models need tighter prompts. Verbose system instructions that work fine with GPT-4 can confuse a 3.8B parameter model. We rewrote the prompts to be concise and explicit, with constrained output expectations. Streaming responses via async iteration kept the UI responsive.

The streamSummary function enforces a character-length guard before sending text to the model. Documents exceeding the limit should be split with chunkDocument() first. The async iterator is explicitly cleaned up on error to prevent leaking the engine's decode context:

async function streamSummary(documentText, onChunk) {
  if (!engine) {
    throw new Error('Engine not initialized. Call initializeEngine() first.');
  }

  // Guard: reject documents that exceed the model's usable context window.
  // Documents over ~6,000 chars (~1,500 tokens) should be chunked first
  // using chunkDocument() and processed via map-reduce.
  const CHAR_LIMIT = 6000;
  if (documentText.length > CHAR_LIMIT) {
    throw new Error(
      `Document too long for direct inference (${documentText.length} chars, limit ${CHAR_LIMIT}). ` +
      `Use chunkDocument() to split the document first.`
    );
  }

  const response = await engine.chat.completions.create({
    messages: [
      {
        role: "system",
        content: "Summarize the document in 3-5 bullet points. Be concise.",
      },
      { role: "user", content: documentText },
    ],
    temperature: 0.3,
    max_tokens: 512,
    stream: true,
  });

  let fullResponse = "";
  try {
    for await (const chunk of response) {
      const delta = chunk.choices[0]?.delta?.content ?? "";
      fullResponse += delta;
      onChunk(fullResponse);
    }
  } catch (err) {
    // Attempt to cancel the stream to release engine decode resources
    if (typeof response.cancel === 'function') response.cancel();
    throw err;
  }

  return fullResponse;
}

The stream: true parameter returns an async iterator. Each chunk contains a delta of generated text, and the UI updates incrementally. This makes even a 2-3 second total generation time feel responsive because the user sees tokens appearing immediately. The nullish coalescing operator (??) is used instead of logical OR (||) so that only null/undefined deltas are replaced with an empty string, avoiding silent data loss on other falsy values.

Handling Context Window Limitations

Phi-3-mini's 4K token context window (with roughly 1,500 tokens usable per chunk after reserving space for system prompt, few-shot examples, and the reduce-phase summary prompt overhead) is smaller than the GPT-3.5-turbo-16k variant's 16K window. The original cloud architecture used the gpt-3.5-turbo-16k variant; the base gpt-3.5-turbo model had a 4K context window. Long documents that fit comfortably in the cloud model's context now require chunking. The strategy is map-reduce: split the document into overlapping chunks, summarize each chunk individually, then summarize the summaries. Adjust maxTokens below based on your actual prompt sizes.

The chunking function includes guards against empty input and degenerate parameter combinations. If overlapTokens is greater than or equal to maxTokens, the overlap is clamped to ensure forward progress and prevent an infinite loop:

function chunkDocument(text, maxTokens = 1500, overlapTokens = 200) {
  if (!text) return [];

  // Guard: overlap must be strictly less than chunk size to guarantee
  // forward progress. Clamp if caller passes degenerate values.
  const safeOverlap = Math.min(overlapTokens, maxTokens - 1);

  // Rough token estimate: 1 token ≈ 4 characters for English.
  // This heuristic breaks down for CJK, code, or emoji-dense content
  // where tokenization is denser. For non-English workloads, consider
  // passing a lower charsPerToken value or integrating a lightweight
  // tokenizer.
  const charLimit = maxTokens * 4;
  const overlapChars = safeOverlap * 4;
  const chunks = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + charLimit, text.length);
    chunks.push(text.slice(start, end));
    if (end === text.length) break;
    const nextStart = end - overlapChars;
    // Guarantee forward progress regardless of parameters
    start = nextStart > start ? nextStart : start + 1;
  }

  return chunks;
}

The overlap strategy preserves context across chunk boundaries, preventing summaries from losing coherence where the text was split. For documents under roughly 6,000 characters (about 1,500 tokens), the text goes directly to the model. Above that threshold, the app activates the chunking pipeline automatically. Documents exceeding about 40,000 characters surface a "document too large for optimal local processing" warning, since multiple rounds of map-reduce summarization can degrade output quality noticeably.

Building the PWA Shell

Service Worker Strategy for AI Assets

This service worker requires workbox-webpack-plugin (Webpack) or vite-plugin-pwa (Vite) configured to inject self.__WB_MANIFEST at build time. Without the build plugin, self.__WB_MANIFEST is undefined and the service worker will fail to register. See the Workbox "Get Started" documentation for setup instructions.

The service worker uses a two-tier caching strategy. The app shell (HTML, CSS, JavaScript bundles) uses StaleWhileRevalidate so updates propagate on the next visit. On first visit, StaleWhileRevalidate fetches from the network; offline capability for app shell assets only applies after the initial load. Model weights and WASM artifacts use CacheFirst because they're versioned and immutable once downloaded.

The route matcher for model assets uses scoped matching to avoid caching unrelated third-party resources. CDN hostnames are extracted into a configuration constant so that future CDN migrations require a single-line change:

import { registerRoute } from "workbox-routing";
import { CacheFirst, StaleWhileRevalidate } from "workbox-strategies";
import { ExpirationPlugin } from "workbox-expiration";
import { precacheAndRoute } from "workbox-precaching";

// Precache app shell — requires build plugin to inject manifest.
// Guard against missing manifest to provide a clear error instead of
// a silent service worker install failure.
const manifest = self.__WB_MANIFEST;
if (!Array.isArray(manifest)) {
  console.error(
    '[SW] self.__WB_MANIFEST is not defined or not an array. ' +
    'Ensure vite-plugin-pwa or workbox-webpack-plugin is configured. ' +
    'App shell precaching is disabled.'
  );
} else {
  precacheAndRoute(manifest);
}

// CDN hostnames serving model weights — update this array when migrating CDNs
const MODEL_CDN_HOSTNAMES = ['huggingface.co'];
const THIRTY_DAYS_SECONDS = 30 * 24 * 60 * 60;

// Model weights and WASM: CacheFirst, long-lived.
// Route matching is scoped to avoid caching unrelated assets:
//   - .wasm files are matched by extension
//   - /mlc-ai/ paths are matched only on the app's own origin
//   - Model CDN hostnames are matched exactly against the configured list
registerRoute(
  ({ url }) =>
    url.pathname.endsWith(".wasm") ||
    (url.pathname.startsWith("/mlc-ai/") && url.hostname === self.location.hostname) ||
    MODEL_CDN_HOSTNAMES.includes(url.hostname),
  new CacheFirst({
    cacheName: "ai-model-cache-v1",
    plugins: [
      new ExpirationPlugin({
        maxEntries: 10,
        maxAgeSeconds: THIRTY_DAYS_SECONDS,
        purgeOnQuotaError: true,
      }),
    ],
  })
);

// App assets: StaleWhileRevalidate for freshness
registerRoute(
  ({ request }) =>
    request.destination === "script" || request.destination === "style",
  new StaleWhileRevalidate({ cacheName: "app-assets-v1" })
);

Cache versioning matters when shipping model updates. Changing the cache name from ai-model-cache-v1 to v2 triggers a fresh download on the next visit. The purgeOnQuotaError: true flag is critical: if the browser's storage quota is exceeded, Workbox evicts the least-recently-used entries within ai-model-cache-v1 up to the maxEntries limit instead of crashing the service worker.

Warning: Caching 1.5 GB or more in the Cache API is subject to browser storage pressure. Call navigator.storage.persist() at app startup to request persistent storage, reducing the risk of silent eviction. Verify cached assets in DevTools → Application → Cache Storage after first load; confirm only model artifacts (.wasm, .params) appear in ai-model-cache-v1.

Web App Manifest and Install Experience

The manifest configures standalone display mode for an app-like experience. The install prompt timing is deliberate: prompting users to install the PWA after their first successful local inference, not on the landing page. By that point, the model is cached and the user has seen the value of offline-capable AI.

Offline UX Patterns

A three-state model readiness indicator communicates clearly: "Downloading model" with a progress bar, "Model cached" confirming offline readiness, and "Ready" when the inference engine is initialized and ready to process requests without reloading model weights. Any telemetry or analytics that still require network connectivity use a queue-and-sync pattern, buffering events in IndexedDB and flushing them when connectivity returns.

Performance Results and Cost Impact

Inference Benchmarks: Cloud vs. Local

We measured latency across three device tiers for generating a 200-word summary from a 1,000-word document:

Device Tier	Example Hardware	Local Latency	Cloud Latency	Decode Tokens/sec (Local)
High-end	RTX 3060 / M1 Pro	~1.2s	~1.8s (p50)	~45
Mid-range	Integrated GPU / M1 Air	~2.8s	~1.8s (p50)	~18
Low-end / No WebGPU	Older integrated	Fallback to cloud	~1.8s (p50)	N/A

On high-end hardware, local inference was actually faster than the cloud path because it eliminated network round-trips entirely. Mid-range devices were slower in absolute terms but still under the 3-second p95 target we set based on user testing. The cloud latency figure of ~1.8s represents a median; it varied significantly with network conditions and OpenAI API load. The local numbers were consistent.

Informal evaluation on a held-out set of representative documents showed Phi-3-mini Q4 outputs rated acceptable for the use case. Rigorous ROUGE benchmarking requires a labeled reference corpus and specified methodology (ROUGE variant, test set size, evaluation script) that were not available here; teams considering a similar migration should build a domain-specific evaluation set before cutting over.

Cost Comparison: Before and After

Category	Before	After
OpenAI API tokens	$1,870	$0
AWS Lambda	$210	$0
Vercel / CDN hosting	$180	$85
Supabase (auth only)	$140	$55
Monthly Total	$2,400	$140
Annual Savings		$27,120

The $140/month residual covers CDN hosting ($85) and Supabase auth ($55). Annual savings of $27,120 reflects spend dropping by $2,260/month, not total elimination of costs. The 94% figure is the relative reduction: ($2,400 − $140) / $2,400 = 94.2%.

The migration took three weeks of developer time. At the post-migration burn rate, the development investment paid for itself within the first month.

[Cost Comparison Chart: Side-by-side bar chart showing $2,400/mo before vs. $140/mo after, with 94% reduction callout.]

The outcome, documented in detail below, was a 94% reduction in monthly operational spend (from $2,400/month to $140/month, with the residual covering CDN hosting and authentication), full offline capability, and sub-2-second inference times on mid-range hardware.

User Experience Wins

Offline capability opened up use cases the web app couldn't serve before: field workers processing documents without connectivity, enterprise users in air-gapped environments, and users in regions with unreliable internet. Cold starts disappeared. Rate limiting disappeared. And the privacy story changed in a concrete way: user documents never leave the device, which eliminated the primary objection enterprise prospects had raised during sales conversations.

Limitations and Honest Tradeoffs

What Was Lost

Model quality has a ceiling. GPT-4-level reasoning and nuance simply aren't available in a 3.8B parameter model running in a browser, and pretending otherwise would be dishonest. The first-load experience requires downloading 1.5 GB of model weights, which takes roughly 2 minutes on a 10 Mbps connection and upwards of 12 minutes on 2 Mbps. As of mid-2024, WebGPU is enabled by default in Chrome 113+, Edge 113+, and Safari 18+ on supported hardware. Firefox support is experimental. Check caniuse.com/webgpu for current data. A meaningful percentage of users, especially on mobile, still need a fallback path or won't get the local experience at all. Server-side audit trails of AI outputs no longer exist, which some compliance-sensitive verticals require.

What Should Have Been Done Differently

Starting with Phase 1 (dual-mode with cloud as primary) added architectural complexity that had limited shelf life. In retrospect, jumping directly to Phase 2 (local-primary with cloud escape hatch) would have been more efficient, since we tore down the dual-mode infrastructure within weeks anyway. (That said, Takeaway #4 below, dual-mode with feature flags, remains valid for teams without an existing user base on the local path; this retrospective applies to this specific migration timeline where the local path was proven quickly.) Investing in domain-specific model fine-tuning earlier would have closed the quality gap with GPT-3.5-turbo faster than prompt engineering alone.

Key Takeaways and Migration Checklist

Audit AI tasks for small-model feasibility. Summarization, extraction, and constrained Q&A are strong candidates. Open-ended reasoning and creative generation are not.
Benchmark WebGPU on target user devices. Collect real hardware data from analytics before committing to a local-first approach.
Choose the right runtime. Web-LLM fits WebGPU-native workloads with streaming; Transformers.js gives you broader model format support and ONNX compatibility.
Implement dual-mode with feature flags as a safety net during transition, but plan to remove the cloud path.
Build model caching into the service worker from day one. A 1.5 GB re-download on every visit is unacceptable.
Measure quality parity using structured human evaluation with a rubric on a held-out test set before cutting the cloud dependency. ROUGE evaluation requires labeled reference outputs; for user-generated document tools, structured human evaluation is often more practical.
Ship as a PWA with offline-first UX, including model readiness indicators and install prompt timing tied to user value.
Monitor device fallback rates post-launch to understand what percentage of users can't run local inference. This data feeds directly back into whether you can fully remove the cloud path or need to maintain it indefinitely.

Local-first AI is not about replacing the cloud everywhere. It is about recognizing which AI tasks don't need a server, don't need an API key, and don't need user data leaving the device, then building accordingly.

Local-first AI is not about replacing the cloud everywhere. It is about recognizing which AI tasks don't need a server, don't need an API key, and don't need user data leaving the device, then building accordingly.

Tests

// --- Unit Tests ---

// 1. chunkDocument: overlap >= maxTokens must not infinite-loop
test('chunkDocument terminates when overlapTokens >= maxTokens', () => {
  const result = chunkDocument('a'.repeat(1000), 100, 100);
  expect(result.length).toBeGreaterThan(0);
  expect(result.every(c => c.length <= 400)).toBe(true); // 100 tokens * 4 chars
});

// 2. chunkDocument: empty input
test('chunkDocument returns empty array for empty string', () => {
  expect(chunkDocument('')).toEqual([]);
});

// 3. chunkDocument: single chunk for short input
test('chunkDocument returns single chunk for short document', () => {
  const text = 'Hello world';
  const chunks = chunkDocument(text, 1500, 200);
  expect(chunks).toHaveLength(1);
  expect(chunks[0]).toBe(text);
});

// 4. initializeEngine: concurrent calls return same promise
test('concurrent initializeEngine calls do not double-initialize', async () => {
  const mockCreate = jest.fn().mockResolvedValue({});
  // inject mockCreate in place of CreateMLCEngine
  const [r1, r2] = await Promise.all([initializeEngine(), initializeEngine()]);
  expect(mockCreate).toHaveBeenCalledTimes(1);
  expect(r1).toEqual(r2);
});

// 5. streamSummary: rejects on oversized input before calling engine
test('streamSummary throws on input exceeding char limit', async () => {
  engine = {}; // mock initialized engine
  await expect(streamSummary('x'.repeat(7000), () => {}))
    .rejects.toThrow('Document too long');
});

// --- Integration Test ---

// 6. Full init + summarize round-trip (requires real WebGPU or wasm mock)
test('integration: initializeEngine then streamSummary returns non-empty string', async () => {
  const result = await initializeEngine((p) => {});
  expect(result.success).toBe(true);

  const chunks = [];
  const summary = await streamSummary(
    'The quick brown fox jumps over the lazy dog. '.repeat(20),
    (partial) => chunks.push(partial)
  );
  expect(typeof summary).toBe('string');
  expect(summary.length).toBeGreaterThan(0);
  expect(chunks.length).toBeGreaterThan(0); // streaming fired at least once
}, 60_000);

Verification Commands

# 1. Verify service worker installs without build plugin misconfiguration
# Expected: no error in SW install; manifest array logged
npx workbox wizard && npx workbox generateSW
# Then in browser: DevTools → Application → Service Workers → Status = "activated"

# 2. Verify CORS headers on model weight CDN (required for Cache API)
curl -I https://huggingface.co/mlc-ai/Phi-3-mini-4k-instruct-q4f16_1-MLC/resolve/main/params_shard_0.bin \
  | grep -i access-control
# Expected output contains:
# access-control-allow-origin: *

# 3. Confirm model weights appear in Cache API (not re-downloaded on second visit)
# In browser console after first load:
caches.open('ai-model-cache-v1').then(c => c.keys().then(k => console.log(k.length, 'entries')))
# Expected: number > 0 (params shards + wasm artifacts cached)

# 4. Detect infinite-loop regression in chunkDocument
node -e "
const { chunkDocument } = require('./src/chunker.js');
const start = Date.now();
chunkDocument('a'.repeat(500), 100, 100);
const elapsed = Date.now() - start;
console.assert(elapsed < 100, 'Hang detected: ' + elapsed + 'ms');
console.log('OK:', elapsed + 'ms');
"
# Expected: OK: <10ms

# 5. Confirm no model CDN hostname leakage into unrelated cache
# Instrument SW route matcher in a test build and observe:
# fetch('https://evil.example.com/mlc-ai/tracker.js')
# DevTools → Application → Cache Storage → ai-model-cache-v1
# Expected: tracker.js NOT present in cache after patched route matcher

Case Study: Migrating a Cloud Wrapper to a Local-First AI PWA

Case Study: Migrating a Cloud Wrapper to a Local-First AI PWA

Table of Contents

Prerequisites

Why We Migrated Away from the Cloud

The Original Architecture: Cloud Wrapper Anatomy

Tech Stack Before Migration

Where the Money Went: Cost Breakdown

Evaluating Local AI Feasibility

Model Selection Criteria

Browser Runtime Options

Migration Strategy: The Three-Phase Approach

Phase 1: Dual-Mode with Cloud Fallback

Phase 2: Local-Primary, Cloud as Escape Hatch

Phase 3: Full Local-First, Cloud Removed

Implementing the Local AI Engine

Loading and Caching Models with Web-LLM

Inference Pipeline: Summarization and Q&A

Handling Context Window Limitations

Building the PWA Shell

Service Worker Strategy for AI Assets

Web App Manifest and Install Experience

Offline UX Patterns

Performance Results and Cost Impact

Inference Benchmarks: Cloud vs. Local

Cost Comparison: Before and After

User Experience Wins

Limitations and Honest Tradeoffs

What Was Lost

What Should Have Been Done Differently

Key Takeaways and Migration Checklist

Tests

Verification Commands

Further Reading

Social Engineering 2.0: The 'Talking to Strangers' Vulnerability

Game Dev Without An Engine: The 2025/2026 Renaissance

NIST vs Global Science: The Impact of Foreign Scientist Restrictions