UX Patterns for Local Inference: Handling Latency and Model Loading


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Local AI inference runs directly in the browser or on a user's device through runtimes like WebLLM, Ollama, llama.cpp, MediaPipe, and Transformers.js. This introduces UX problems that cloud-hosted AI never faced, such as multi-gigabyte model downloads, WASM compilation stalls, and GPU context creation delays. This article walks through four reusable React component patterns that address each stage of the local inference lifecycle: model download progress, cold-start latency, streaming output, and error handling.
Table of Contents
- The Local Inference Lifecycle: Understanding the States
- Pattern 1: Model Download and Initialization Progress
- Pattern 2: Cold-Start and Warm-Up Latency
- Pattern 3: Streaming Response Output
- Pattern 4: Error States and Fallbacks
- Putting It All Together: The Full AI Interface Lifecycle
- Summary and Component Library Reference
The Local Inference Lifecycle: Understanding the States
A local inference session moves through a sequence of discrete phases, each with different duration characteristics and user expectations:
Download → Load/Initialize → Warm-up → Inference → Streaming Output → Idle
The download phase can take minutes for a 4GB model on a typical connection. Initialization involves WASM compilation and GPU context setup, often 2 to 15 seconds on hardware like an M1 MacBook Air or a 2022 Core i5 laptop with integrated graphics. When the runtime uses ahead-of-time compilation (as WebLLM does on first load), a warm-up compilation step follows initialization. Inference and streaming happen token by token. Idle is the resting state between queries.
A state machine governs these transitions: IDLE → DOWNLOADING → INITIALIZING → WARMING_UP → READY → GENERATING → ERROR → IDLE. Each state maps to a distinct UI treatment.
How This Differs from Cloud AI UX
What does a cloud AI interface look like to a user? A single "waiting" spinner, then a response. Years of cloud AI products trained users to expect exactly that. Local inference breaks the mental model entirely. Some phases take minutes on first run (model download). Others take seconds (initialization). Others are near-instantaneous (cached model detection). Users need transparency into which phase they are in and why it is slow. A generic spinner communicating nothing about a 2GB download will drive users away; research on loading UX consistently shows abandonment rates spike after roughly 8 seconds of unexplained waiting.
Cloud APIs hide the complexity behind a network request: the user waits, and the API returns a response. Local inference, by contrast, forces the interface to manage all of that while keeping the user informed and engaged.
Pattern 1: Model Download and Initialization Progress
In-browser inference models range from roughly 100MB (quantized small models via Transformers.js) to 4GB or more (larger quantized LLMs via WebLLM). During download, users see nothing happening unless the interface explicitly communicates progress. The design principles here: show real byte-level progress, provide an estimated time remaining, and allow cancellation. The implementation below uses the Origin Private File System (OPFS) to stream model data directly to disk, avoiding the need to buffer the entire model in memory. This is critical for multi-GB models where accumulating chunks in a JavaScript array would cause out-of-memory crashes on most devices.
Building a ModelDownloadProgress Component
import { useState, useCallback, useRef, useEffect } from 'react';
function useModelLoader(modelUrl) {
const [status, setStatus] = useState('idle');
const [progress, setProgress] = useState({ loaded: 0, total: 0 });
const [eta, setEta] = useState(null);
const controllerRef = useRef(null);
const startTimeRef = useRef(null);
const start = useCallback(async () => {
try {
const cache = await caches.open('model-cache');
const cached = await cache.match(modelUrl);
if (cached) {
setStatus('cached');
return cached;
}
// Check available storage before attempting a large download
if (navigator.storage && navigator.storage.estimate) {
const { quota, usage } = await navigator.storage.estimate();
const available = quota - usage;
console.warn(`Storage available: ${(available / 1e9).toFixed(1)} GB`);
}
controllerRef.current = new AbortController();
setStatus('downloading');
startTimeRef.current = Date.now();
const response = await fetch(modelUrl, {
signal: controllerRef.current.signal,
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const contentLength = response.headers.get('content-length');
const total = contentLength ? parseInt(contentLength, 10) : 0;
const reader = response.body.getReader();
// Stream directly to OPFS to avoid buffering the entire model in memory.
// Accumulating chunks in a JS array would require 2× the model size in heap
// (once for the chunks, once for the Blob), causing OOM crashes for large models.
const opfsRoot = await navigator.storage.getDirectory();
const fileHandle = await opfsRoot.getFileHandle(
`model-${encodeURIComponent(modelUrl)}`,
{ create: true }
);
const writable = await fileHandle.createWritable();
let loaded = 0;
try {
while (true) {
const { done, value } = await reader.read();
if (done) break;
await writable.write(value);
loaded += value.length;
const elapsed = Math.max(
(Date.now() - startTimeRef.current) / 1000,
0.001
);
const rate = loaded / elapsed;
const remaining = total > 0 ? (total - loaded) / rate : null;
setProgress({ loaded, total });
setEta(remaining !== null ? Math.round(remaining) : null);
}
await writable.close();
} catch (writeErr) {
await writable.abort();
throw writeErr;
}
// Cache the completed file for fast retrieval on subsequent loads.
// Build headers explicitly so content-length reflects the actual written size.
const file = await fileHandle.getFile();
const safeHeaders = new Headers();
const contentType = response.headers.get('content-type');
if (contentType) safeHeaders.set('content-type', contentType);
safeHeaders.set('content-length', String(file.size));
const cacheResponse = new Response(file, { headers: safeHeaders });
await cache.put(modelUrl, cacheResponse);
setStatus('ready');
return fileHandle;
} catch (err) {
if (err.name === 'AbortError') {
setStatus('idle'); // Clean cancel — not an error
} else {
setStatus('error');
}
}
}, [modelUrl]);
const cancel = useCallback(() => {
controllerRef.current?.abort();
setStatus('idle');
}, []);
return { status, progress, eta, start, cancel };
}
function ModelDownloadProgress({ modelUrl, onComplete }) {
const { status, progress, eta, start, cancel } = useModelLoader(modelUrl);
const pct = progress.total
? Math.round((progress.loaded / progress.total) * 100)
: 0;
const sizeMB = progress.total ? (progress.total / 1e6).toFixed(0) : '??';
const indeterminate = !progress.total;
// Move onComplete to an effect so it does not fire during render.
// Calling a parent dispatch synchronously inside render violates React's rules
// and causes double-dispatch in Strict Mode.
useEffect(() => {
if (status === 'cached') {
onComplete?.();
}
}, [status, onComplete]);
if (status === 'cached') {
return <span className="badge badge-green">Cached</span>;
}
return (
<div className="download-progress">
{status === 'idle' && <button onClick={start}>Download Model</button>}
{status === 'downloading' && (
<>
<div className="progress-bar">
{indeterminate ? (
<div className="progress-fill progress-fill--indeterminate" />
) : (
<div className="progress-fill" style={{ width: `${pct}%` }} />
)}
</div>
<span>
{indeterminate
? `${(progress.loaded / 1e6).toFixed(1)} MB downloaded…`
: `${pct}% of ${sizeMB} MB — ~${eta}s remaining`}
</span>
<button onClick={cancel}>Cancel</button>
</>
)}
{status === 'ready' && <span className="badge badge-green">Ready</span>}
{status === 'error' && (
<div className="error-banner" role="alert">
<p>Download failed.</p>
<button onClick={start}>Retry</button>
</div>
)}
</div>
);
}
The useModelLoader hook streams downloaded bytes directly to OPFS via FileSystemWritableFileStream, keeping heap usage flat regardless of model size. It handles byte-level tracking, ETA calculation based on elapsed throughput, Cache API persistence, abort controller support, and distinguishes user cancellation (AbortError) from real network errors. If the server does not provide a content-length header (common with chunked transfer encoding and some CDNs), the component falls back to an indeterminate progress bar showing downloaded bytes. The onComplete callback fires from a useEffect rather than inline during render, preventing React Strict Mode double-dispatch issues.
Communicating Model Size Before Download
Before a download begins, users deserve a disclosure card showing what they are about to commit to: model name, file size, estimated download time, and required disk space.
function ModelInfoCard({ model, cacheStatus, onDownload }) {
const estimatedMinutes = (model.sizeBytes / 625_000 / 60).toFixed(1); // 5 Mbps = 625 KB/s
if (cacheStatus === 'cached') {
return (
<div className="model-card model-card--ready">
<h3>{model.name}</h3>
<span className="badge badge-green">Ready to use</span>
</div>
);
}
return (
<div className="model-card">
<h3>{model.name}</h3>
<ul>
<li>Size: {(model.sizeBytes / 1e9).toFixed(1)} GB</li>
<li>Est. download: ~{estimatedMinutes} min on 5 Mbps</li>
<li>Disk space required: {(model.sizeBytes / 1e9).toFixed(1)} GB</li>
</ul>
<p className="model-card__note">
The model is stored locally after first download.
</p>
{onDownload && <button onClick={onDownload}>Download Model</button>}
</div>
);
}
The estimated time uses a 5 Mbps baseline (625 KB/s). Adjust 625_000 to match your target audience's connection speed. For applications targeting enterprise users on faster connections, this baseline should be increased. The conditional rendering based on cache status prevents unnecessary friction for returning users.
Pattern 2: Cold-Start and Warm-Up Latency
After a model downloads (or loads from cache), initialization still involves WASM compilation, WebGPU context creation, and loading weights into memory. This phase typically takes 2 to 15 seconds on devices like an M1 MacBook Air or a 2022 Core i5 laptop with integrated graphics, even with cached models. That range is too long for a bare spinner to feel acceptable, but too unpredictable for a progress bar to be accurate. The right pattern is a skeleton UI with contextual, phased messaging.
Building a WarmUpIndicator Component
import { useState, useEffect, useRef } from 'react';
const PHASE_MESSAGES = [
{ threshold: 0, message: 'Loading model into memory…' }, // 0–3 s: typical GPU init
{ threshold: 3000, message: 'Optimizing for your device…' }, // 3–8 s: slower devices
{ threshold: 8000, message: 'Almost ready — this is a one-time setup' },
];
function WarmUpIndicator({ isWarmingUp, onReady }) {
const [elapsed, setElapsed] = useState(0);
const startRef = useRef(null);
useEffect(() => {
if (!isWarmingUp) return;
startRef.current = Date.now();
setElapsed(0);
const interval = setInterval(
() => setElapsed(Date.now() - startRef.current),
500
);
return () => clearInterval(interval);
}, [isWarmingUp]);
if (!isWarmingUp) return null;
const currentMessage = [...PHASE_MESSAGES]
.reverse()
.find((p) => elapsed >= p.threshold)?.message;
return (
<div className="warmup-indicator">
<div className="shimmer-skeleton" />
<p className="warmup-message">{currentMessage}</p>
</div>
);
}
The component rotates through status messages based on elapsed time thresholds. At 0 to 3 seconds, users see "Loading model into memory." If the wait stretches past 8 seconds, the messaging shifts to reassurance. Elapsed time is computed from a recorded start timestamp rather than incremented by a fixed amount, preventing drift caused by browser throttling in background tabs. The onReady callback prop is intended to notify the parent when initialization completes. It must be called by the consumer when the model runtime resolves its initialization promise, not by this component directly. For example, the parent component should listen for the model runtime's readiness signal and then invoke onReady to advance the lifecycle.
Preloading and Eager Initialization Strategies
Several techniques reduce perceived cold-start time without making initialization itself faster. Preloading on route entry starts the model load as soon as a user navigates to an AI-enabled page. If the user hovers over or focuses on the AI feature trigger, you can begin initialization during the natural pause between intent and action. For applications where the AI feature is secondary but likely to be used, requestIdleCallback starts initialization during browser idle periods without competing for main-thread resources. The right choice depends on how central the AI feature is to the application's primary flow; these are UX-level decisions, not code-heavy implementations.
Pattern 3: Streaming Response Output
Token-by-token generation in local models is actually a UX advantage. Unlike cloud APIs where network round-trip latency dominates perceived responsiveness, local streaming gives users something to read almost immediately. The design principles: stream tokens to the screen as they arrive, show a typing indicator during generation, and always provide a stop button.
Get the AI UX patterns for model loading UI and streaming responses wrong, and users will abandon the app before inference even begins.
Building a StreamingResponseDisplay Component
The model object passed to useStreamingInference is expected to implement the following interface:
interface LocalModel {
generate(prompt: string): AsyncIterable<string>;
}
Different runtimes require thin adapter wrappers. For example, a minimal WebLLM adapter:
// WebLLM adapter example (assumes ChatModule is already loaded)
const webllmAdapter = {
async *generate(prompt) {
let lastLen = 0;
await chatModule.generate(prompt, (step, message) => {
// This callback fires on each token; we bridge to async iterable below
});
// For a full async iterable bridge, use a ReadableStream or async queue
}
};
The exact adapter depends on your runtime. The hook itself is runtime-agnostic given any object that satisfies the generate() contract above.
import { useState, useCallback, useRef, useEffect } from 'react';
function useStreamingInference(model) {
const [tokens, setTokens] = useState('');
const [isGenerating, setIsGenerating] = useState(false);
const stopRef = useRef(false);
const generate = useCallback(async (prompt) => {
setTokens('');
setIsGenerating(true);
stopRef.current = false;
// model.generate() returns an AsyncIterable directly, not a Promise.
// Do not await it — await of a non-Promise is a no-op but masks the
// type contract and breaks if a runtime returns a plain iterable.
const stream = model.generate(prompt);
try {
for await (const token of stream) {
if (stopRef.current) break;
setTokens((prev) => prev + token);
}
} finally {
// Explicitly close the async iterator to release underlying resources
// (e.g., worker ports, GPU contexts) that may be held open.
await stream.return?.();
setIsGenerating(false);
}
}, [model]);
const stop = useCallback(() => { stopRef.current = true; }, []);
return { tokens, isGenerating, generate, stop };
}
function StreamingResponseDisplay({ tokens, isGenerating, onStop }) {
const containerRef = useRef(null);
useEffect(() => {
if (containerRef.current) {
containerRef.current.scrollTop = containerRef.current.scrollHeight;
}
}, [tokens]);
return (
<div className="response-display" ref={containerRef}>
<div className="response-text">
{tokens}
{isGenerating && <span className="blinking-cursor">▊</span>}
</div>
{isGenerating && (
<button className="stop-btn" onClick={onStop}>Stop generating</button>
)}
</div>
);
}
The useStreamingInference hook wraps the model's generate function, consuming an async iterable and accumulating tokens to state. The finally block ensures the async iterator is explicitly closed via stream.return?.() when generation completes or is stopped, preventing resource leaks from open worker ports or GPU contexts. The display component auto-scrolls and renders a blinking cursor during generation. For production use, a lightweight markdown renderer can be applied to the accumulated token buffer, though mid-stream markdown parsing requires careful handling of incomplete syntax.
Handling Slow Token Generation Gracefully
When inference runs on CPU without GPU acceleration, such as on Celeron/Pentium-class laptops or other devices without discrete GPUs, token generation speed can drop below 2 tokens per second. A Q4-quantized 7B model running on CPU typically generates only 1 to 2 tokens per second. At that rate, the drip-feed of individual characters looks stuttery and broken. The solution is to batch tokens into word-level chunks before flushing them to the UI.
async function* batchTokens(tokenStream) {
let buffer = '';
for await (const token of tokenStream) {
buffer += token;
if (buffer.includes(' ') || buffer.includes('
') || buffer.length > 80) {
yield buffer;
buffer = '';
}
}
if (buffer) yield buffer;
}
This async generator groups tokens by whitespace boundaries before yielding, creating a more natural word-at-a-time cadence. The size-based flush at 80 characters ensures that long unbroken tokens (such as URLs or code) do not accumulate indefinitely in the buffer. It can be used directly in useStreamingInference by wrapping the model's stream: for await (const chunk of batchTokens(stream)). Pairing this with a subtle notice like "Generating slowly on CPU; GPU acceleration available in supported browsers" sets the right expectation without hiding the limitation.
Pattern 4: Error States and Fallbacks
Local inference fails in ways cloud APIs never do: insufficient device memory, missing WebGPU support, interrupted model downloads, and unavailable WASM runtimes. WebGPU is absent in Firefox and Safari as of mid-2025, so these are not edge cases. The design principle is to detect capability before attempting and fail gracefully with specific, recovery-oriented messages.
Capability Detection and Graceful Degradation
import { useState, useEffect } from 'react';
function useDeviceCapability() {
const [capabilities, setCapabilities] = useState(null);
useEffect(() => {
const detect = async () => {
// navigator.gpu is only available in Chrome 113+ and Edge 113+.
// Safari and Firefox do not support WebGPU as of 2024.
// Presence of navigator.gpu alone does not guarantee a usable adapter;
// call requestAdapter() for a definitive check.
let gpu = false;
if (navigator.gpu) {
try {
const adapter = await Promise.race([
navigator.gpu.requestAdapter(),
new Promise((resolve) => setTimeout(() => resolve(null), 3000)),
]);
gpu = adapter !== null;
} catch {
gpu = false;
}
}
const wasm = typeof WebAssembly === 'object';
// navigator.deviceMemory is Chromium-only (Chrome, Edge, Opera).
// It returns coarsened power-of-2 values: 0.25, 0.5, 1, 2, 4, or 8.
// Firefox and Safari return undefined.
const memory = navigator.deviceMemory || null;
setCapabilities({ gpu, wasm, memory });
};
detect();
}, []);
return capabilities;
}
function CompatibilityBanner({ capabilities, requiredMemoryGB }) {
if (!capabilities) return null;
const warnings = [];
if (!capabilities.gpu)
warnings.push('WebGPU not available — inference will use CPU (slower). WebGPU requires Chrome 113+ or Edge 113+.');
if (!capabilities.wasm)
warnings.push('WebAssembly not supported — local inference unavailable.');
if (capabilities.memory !== null && capabilities.memory < requiredMemoryGB)
warnings.push(`Device reports ${capabilities.memory} GB RAM (approximate — browsers round this value). ${requiredMemoryGB} GB+ recommended. Consider a smaller model.`);
if (capabilities.memory === null)
warnings.push('Unable to detect device memory (unsupported in this browser). Ensure your device has sufficient RAM for the selected model.');
if (!warnings.length) return null;
return (
<div className="compatibility-banner" role="alert">
{warnings.map((w) => <p key={w.slice(0, 40)}>{w}</p>)}
</div>
);
}
The useDeviceCapability hook checks for navigator.gpu (Chrome/Edge 113+ only; confirmed via requestAdapter()), WASM, and navigator.deviceMemory (Chromium only; returns coarsened power-of-2 values from 0.25 to 8 GB; unavailable in Firefox and Safari). The requestAdapter() call is wrapped in a Promise.race with a 3-second timeout so that a hung GPU process does not leave capabilities permanently unresolved. CompatibilityBanner turns these checks into specific suggestions: try a smaller model, switch to a supported browser, or verify available RAM. When deviceMemory is unavailable, a general RAM warning appears instead.
Putting It All Together: The Full AI Interface Lifecycle
All four patterns compose into a single lifecycle managed by useReducer:
import { useReducer, useState, useCallback, useRef, useEffect } from 'react';
const initialState = { phase: 'info', error: null };
function lifecycleReducer(state, action) {
switch (action.type) {
case 'START_DOWNLOAD':
return { phase: 'downloading', error: null };
case 'CANCEL_DOWNLOAD':
return { phase: 'info', error: null };
case 'DOWNLOAD_COMPLETE':
return { phase: 'warming_up', error: null };
case 'MODEL_READY':
return { phase: 'ready', error: null };
case 'START_GENERATING':
return { phase: 'generating', error: null };
case 'GENERATION_DONE':
return { phase: 'ready', error: null };
case 'ERROR':
return { phase: 'error', error: action.message };
default:
return state;
}
}
function LocalAIChat({ model, modelUrl }) {
const [state, dispatch] = useReducer(lifecycleReducer, initialState);
const capabilities = useDeviceCapability();
const { tokens, isGenerating, generate, stop } = useStreamingInference(model);
const [prompt, setPrompt] = useState('');
const mountedRef = useRef(true);
useEffect(() => {
mountedRef.current = true;
return () => { mountedRef.current = false; };
}, []);
const handleGenerate = useCallback(() => {
dispatch({ type: 'START_GENERATING' });
generate(prompt).then(() => {
if (mountedRef.current) dispatch({ type: 'GENERATION_DONE' });
}).catch((err) => {
if (mountedRef.current) dispatch({ type: 'ERROR', message: err.message || 'Generation failed.' });
});
}, [prompt, generate]);
return (
<div className="local-ai-chat">
<CompatibilityBanner
capabilities={capabilities}
requiredMemoryGB={model.requiredMemoryGB ?? 4}
/>
{state.phase === 'info' && (
<ModelInfoCard
model={model}
cacheStatus="pending"
onDownload={() => dispatch({ type: 'START_DOWNLOAD' })}
/>
)}
{state.phase === 'downloading' && (
<ModelDownloadProgress
modelUrl={modelUrl}
onComplete={() => dispatch({ type: 'DOWNLOAD_COMPLETE' })}
onCancel={() => dispatch({ type: 'CANCEL_DOWNLOAD' })}
/>
)}
{state.phase === 'warming_up' && (
<WarmUpIndicator
isWarmingUp
onReady={() => dispatch({ type: 'MODEL_READY' })}
/>
)}
{state.phase === 'ready' && (
<div>
<textarea
placeholder="Ask something…"
value={prompt}
onChange={(e) => setPrompt(e.target.value)}
/>
<button onClick={handleGenerate} disabled={!prompt.trim()}>
Generate
</button>
</div>
)}
{state.phase === 'generating' && (
<StreamingResponseDisplay
tokens={tokens}
isGenerating={isGenerating}
onStop={stop}
/>
)}
{state.phase === 'error' && (
<div className="error-banner" role="alert">
<p>Something went wrong: {state.error}</p>
<button onClick={() => dispatch({ type: 'MODEL_READY' })}>Try again</button>
</div>
)}
</div>
);
}
The state machine ensures transitions are explicit and the UI never lands in an ambiguous state. The CANCEL_DOWNLOAD action allows the user to abort a download and return to the info phase, keeping the reducer in sync with the useModelLoader hook's internal status. The ERROR state captures failures from any phase (network errors during download, WASM initialization failures, or inference crashes) and provides recovery options. A mounted-ref guard prevents dispatch calls after the component unmounts, for example if the user navigates away during generation. Each component handles its own visual treatment while the container orchestrates flow. Note that WarmUpIndicator does not call onReady internally; the parent must invoke it when the model runtime signals initialization is complete (e.g., after a promise returned by the model loader resolves).
The state machine ensures transitions are explicit and the UI never lands in an ambiguous state.
Summary and Component Library Reference
Four patterns cover the local inference UX lifecycle. Download progress components handle multi-gigabyte model fetches with byte-level tracking and cancellation. Warm-up indicators give users phased feedback during cold-start latency. Streaming displays render token-by-token output with auto-scroll and stop controls. Capability detection checks hardware and browser support before the user commits to a download. Each component and hook is runtime-agnostic with appropriate adapter wrappers. They can wrap Transformers.js pipelines, WebLLM sessions, or Ollama local endpoints, provided the model object exposes a generate(prompt: string): AsyncIterable<string> interface. All CSS class names referenced in the components (badge-green, shimmer-skeleton, download-progress, etc.) are presentational placeholders; supply your own styles or design system tokens when integrating.