Building a Privacy-Preserving RAG System in the Browser


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Retrieval Augmented Generation has become the standard architecture for building AI-powered search and Q&A over private document collections. The typical approach sends document chunks to a remote embedding API, stores vectors in a cloud-hosted database, and routes queries through a hosted LLM. Every step in that pipeline can leak data. A browser-based RAG system that runs entirely client-side, with local vector search and no outbound requests, eliminates that exposure. This tutorial walks through a fully zero-server implementation using WebAssembly-compiled vector databases, in-browser transformer models, and local LLM inference via WebGPU.
The regulatory pressure is real. GDPR's data minimization principle, healthcare settings where HIPAA prohibits sending patient data to third-party servers, and enterprise data governance policies all create scenarios where externalizing document contents is either banned outright or adds weeks of legal and compliance review per integration. The system built here ingests documents, chunks them, generates embeddings, stores vectors, retrieves relevant passages, and generates grounded answers without a single network request after initial model downloads, when operating in fully local mode. The optional API fallback described later does send data externally and should be used only with explicit user disclosure.
The complete working demo and source code will be published at the project repository upon article publication.
Table of Contents
- Architecture Overview of a Browser-Based RAG Pipeline
- Setting Up the Project
- Document Ingestion and Chunking on the Client
- Generating Embeddings with Transformers.js
- Storing and Querying Vectors Client-Side
- Generating Answers with a Local LLM
- Putting It All Together
- Testing
- Security, Limitations, and What's Next
- Key Takeaways
Architecture Overview of a Browser-Based RAG Pipeline
The pipeline follows the same logical stages as any server-side RAG system: ingestion, chunking, embedding, vector storage, query embedding, retrieval, and generation. The difference is that every stage maps to a client-side technology.
A plain JavaScript sentence-boundary splitter handles chunking. Transformers.js runs embedding using ONNX-format models like all-MiniLM-L6-v2 (384 dimensions, roughly 23 MB) directly in the browser. Voy, a Rust-compiled WebAssembly library that implements HNSW indexing and ships at under 100 KB gzipped, provides vector storage and search. WebLLM loads quantized models like Phi-3-mini-4k-instruct via WebGPU for hardware-accelerated generation.
WebGPU handles LLM inference. WebAssembly powers the vector index. IndexedDB persists vectors and cached models across sessions. Web Workers keep embedding and generation off the main thread.
A browser-based RAG system that runs entirely client-side, with local vector search and no outbound requests, eliminates that exposure.
Consider the trade-offs against server-side RAG significant but bounded. Model sizes max out at what fits in browser memory (typically 2B to 4B parameters quantized). First-load latency hits hard since models must download once. Device requirements are real: WebGPU support and a GPU with at least 4 GB of VRAM (tested on Apple M2 integrated graphics and NVIDIA RTX 3060) are necessary for the generation step. For document sets under 5,000 to 10,000 chunks on modern hardware, the pipeline holds up in practice.
Setting Up the Project
Prerequisites and Tooling
The build step requires Node.js (18+). Vite serves as the bundler because of its native support for WebAssembly imports and top-level await, both of which this stack depends on. The target browser needs WebGPU enabled (Chrome 113+, Edge 113+, with Firefox requiring the dom.webgpu.enabled flag in about:config as of mid-2025 and not available by default).
Project Scaffold
// package.json (relevant dependencies)
{
"dependencies": {
"@huggingface/transformers": "3.0.0",
"voy-search": "0.6.3",
"@mlc-ai/web-llm": "0.2.62",
"idb": "8.0.0",
"pdfjs-dist": "4.0.0"
},
"devDependencies": {
"vite": "5.4.0"
}
}
Note: Dependencies are pinned to exact versions to prevent breaking API changes from entering builds silently. Confirm the latest stable version of @mlc-ai/web-llm at npmjs.com/package/@mlc-ai/web-llm before pinning. Verify that the CreateMLCEngine export exists in your chosen version. Always commit your package-lock.json and use npm ci in CI environments.
// vite.config.js
import { defineConfig } from 'vite';
export default defineConfig({
optimizeDeps: {
exclude: ['voy-search']
},
worker: {
format: 'es'
},
server: {
headers: {
'Cross-Origin-Opener-Policy': 'same-origin',
'Cross-Origin-Embedder-Policy': 'require-corp'
}
}
});
The optimizeDeps.exclude for voy-search prevents Vite from trying to pre-bundle the WASM module, which breaks the async initialization. The ES worker format ensures Web Workers can use static imports. The Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy headers are required for SharedArrayBuffer access, which some WASM modules depend on internally. These same headers must also be configured on your production host.
Document Ingestion and Chunking on the Client
Loading Files in the Browser
The File API and drag-and-drop events handle document input. For plain text files, FileReader.readAsText() is sufficient. For PDFs, pdfjs-dist (Mozilla's client-side PDF renderer) extracts raw text page by page without any server round-trip.
import * as pdfjsLib from 'pdfjs-dist';
// pdf.js requires an explicit worker path; adjust for your Vite build
pdfjsLib.GlobalWorkerOptions.workerSrc = new URL(
'pdfjs-dist/build/pdf.worker.min.mjs',
import.meta.url
).toString();
async function extractTextFromPDF(file) {
const arrayBuffer = await file.arrayBuffer();
const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
const pages = [];
for (let i = 1; i <= pdf.numPages; i++) {
try {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
pages.push(content.items.map(item => item.str).join(' '));
} catch (err) {
console.warn(`PDF page ${i} could not be extracted:`, err);
pages.push(''); // preserve page alignment; skip corrupt page
}
}
return pages.join('
');
}
Note: The pdfjs-dist worker path must be configured explicitly in Vite builds. The default path assumptions break at bundle time. Verify the path resolves correctly in your project structure. A single corrupt or encrypted page will be skipped with a warning rather than aborting the entire extraction.
Chunking Strategy
This chunker splits on sentence boundaries with a word-count budget and overlap, optimizing for retrieval quality. Note: this is not the same as recursive character splitting (as in LangChain), which applies a separator hierarchy. The overlap ensures that concepts spanning a chunk boundary still appear in at least one complete chunk, preventing the retriever from missing relevant passages that happen to straddle a split point.
// chunkSizeWords and overlapWords are measured in words, not tokens.
// Word count is used as a rough approximation of token count; actual
// BPE token counts will be ~1.3–1.5× the word count.
// Lookbehind requires ES2018+. Verify target browser compatibility.
// Safari 16.4+ supports lookbehind; earlier Safari versions do not
// and will throw a SyntaxError.
function chunkText(text, chunkSizeWords = 300, overlapWords = 50) {
const chunks = [];
const sentences = text.split(/(?<=[.!?])\s+/);
let currentChunk = '';
for (const sentence of sentences) {
// Avoid leading space when currentChunk is empty
const combined = currentChunk ? currentChunk + ' ' + sentence : sentence;
if (combined.split(/\s+/).length > chunkSizeWords) {
if (currentChunk.trim()) chunks.push(currentChunk.trim());
// Grab the last `overlapWords` words from the current chunk
const words = currentChunk.split(/\s+/).filter(Boolean);
currentChunk = words.slice(-overlapWords).join(' ') + (currentChunk ? ' ' : '') + sentence;
} else {
currentChunk = combined;
}
}
if (currentChunk.trim()) chunks.push(currentChunk.trim());
return chunks;
}
This splits on sentence boundaries first, accumulates until the word-count budget is reached, then carries the last 50 words forward into the next chunk. For a typical 10-page document, this produces 30 to 60 chunks depending on density.
Generating Embeddings with Transformers.js
Loading the Model in a Web Worker
Offloading embedding to a Web Worker is non-negotiable. The all-MiniLM-L6-v2 model, even at 23 MB, blocks the main thread for several seconds during both model loading (WASM compilation, model file parsing) and inference on a batch of chunks. That freezes the UI completely.
// embed-worker.js
import { pipeline } from '@huggingface/transformers';
let embedder = null;
self.onmessage = async (e) => {
const { type, requestId } = e.data;
if (type === 'init') {
embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
quantized: true
});
self.postMessage({ type: 'ready' });
return;
}
if (type === 'embed') {
const { texts, ids } = e.data;
const results = [];
try {
for (let i = 0; i < texts.length; i++) {
const output = await embedder(texts[i], { pooling: 'mean', normalize: true });
results.push({ id: ids[i], embedding: Array.from(output.data) });
self.postMessage({ type: 'progress', requestId, current: i + 1, total: texts.length });
}
self.postMessage({ type: 'embeddings', requestId, results });
} catch (err) {
self.postMessage({ type: 'error', requestId, message: err.message });
}
}
};
The pooling: 'mean' option performs mean pooling across token positions, and normalize: true applies L2 normalization, together producing unit vectors suitable for cosine similarity search. Each chunk is embedded individually to provide progress feedback to the UI. A requestId is included in every message to support multiplexed concurrent calls, and inference errors are caught and propagated back to the main thread rather than silently hanging.
Communicating with the Embed Worker
The main thread creates the worker instance and provides a promise-based wrapper for embedding queries. Each call includes a unique requestId so that concurrent embedding requests (e.g., a query firing while chunk ingestion is in progress) resolve independently without cross-contamination:
// Guard against SSR/Node contexts where Worker is unavailable
let embedWorker;
if (typeof Worker !== 'undefined') {
embedWorker = new Worker(
new URL('./embed-worker.js', import.meta.url),
{ type: 'module' }
);
}
let _reqCounter = 0;
// Initialize the model on app startup
function initEmbedWorker() {
return new Promise((resolve) => {
embedWorker.addEventListener('message', function handler(e) {
if (e.data.type === 'ready') {
embedWorker.removeEventListener('message', handler);
resolve();
}
});
embedWorker.postMessage({ type: 'init' });
});
}
// Embed a single query string, returning the embedding array
function embedQuery(text) {
const requestId = ++_reqCounter;
return new Promise((resolve, reject) => {
function handler(e) {
if (e.data.requestId !== requestId) return;
if (e.data.type === 'embeddings') {
embedWorker.removeEventListener('message', handler);
resolve(e.data.results[0].embedding);
}
if (e.data.type === 'error') {
embedWorker.removeEventListener('message', handler);
reject(new Error(e.data.message));
}
}
embedWorker.addEventListener('message', handler);
embedWorker.postMessage({ type: 'embed', requestId, texts: [text], ids: ['query'] });
});
}
// Embed an array of chunks, returning [{id, embedding}, ...]
function embedChunks(texts, ids) {
const requestId = ++_reqCounter;
return new Promise((resolve, reject) => {
function handler(e) {
if (e.data.requestId !== requestId) return;
if (e.data.type === 'embeddings') {
embedWorker.removeEventListener('message', handler);
resolve(e.data.results);
}
if (e.data.type === 'error') {
embedWorker.removeEventListener('message', handler);
reject(new Error(e.data.message));
}
}
embedWorker.addEventListener('message', handler);
embedWorker.postMessage({ type: 'embed', requestId, texts, ids });
});
}
Caching Models with IndexedDB
Transformers.js automatically caches downloaded ONNX model files in the browser's Cache API. On first load, the 23 MB all-MiniLM-L6-v2 download takes a few seconds on broadband. Subsequent loads pull from cache and initialize in under a second. To verify caching, inspect Cache Storage in DevTools. To bust the cache (for model updates), call caches.delete() with the cache name shown in DevTools → Application → Cache Storage (verify the exact name, as it may differ by library version).
Storing and Querying Vectors Client-Side
Initializing Voy (WASM Vector Search)
Voy is a Rust-compiled WASM library that implements Hierarchical Navigable Small World (HNSW) indexing, a graph-based approximate nearest neighbor algorithm that trades some recall accuracy for fast search. It ships under 100 KB gzipped, making it well-suited for browser deployment.
The chunkMap maintains a mapping from chunk IDs to their original text. The code persists both the HNSW index and the chunk map to IndexedDB so that retrieval works correctly after a page reload.
import { Voy } from 'voy-search';
import { openDB } from 'idb';
let voy = new Voy();
// Maintains a mapping from chunk ID to original text for retrieval.
// This MUST be persisted alongside the vector index; see saveIndex/loadIndex.
const chunkMap = new Map();
// Module-level singleton DB handle to avoid leaking IDBDatabase connections
let _db = null;
async function getDB() {
if (_db) return _db;
_db = await openDB('rag-store', 1, {
upgrade(db) {
db.createObjectStore('vectors');
}
});
return _db;
}
function indexChunks(embeddedChunks, documentName) {
const resource = {
embeddings: embeddedChunks.map((chunk, i) => {
const id = `${documentName}-${i}`;
chunkMap.set(id, chunk.text);
return {
id,
title: documentName,
url: `chunk-${i}`,
embeddings: chunk.embedding
};
})
};
voy.index(resource);
}
// Persist both the HNSW index and chunkMap to IndexedDB
async function saveIndex() {
const db = await getDB();
const serializedIndex = voy.serialize();
const serializedMap = JSON.stringify([...chunkMap.entries()]);
const tx = db.transaction('vectors', 'readwrite');
tx.objectStore('vectors').put(serializedIndex, 'main-index');
tx.objectStore('vectors').put(serializedMap, 'chunk-map');
await tx.done;
}
// Restore both the HNSW index and chunkMap from IndexedDB on page load
async function loadIndex() {
const db = await getDB();
const tx = db.transaction('vectors', 'readonly');
const [serializedIndex, serializedMap] = await Promise.all([
tx.objectStore('vectors').get('main-index'),
tx.objectStore('vectors').get('chunk-map')
]);
await tx.done;
if (serializedIndex) {
if (serializedMap) {
const entries = JSON.parse(serializedMap);
entries.forEach(([k, v]) => chunkMap.set(k, v));
}
return Voy.deserialize(serializedIndex);
}
return new Voy();
}
The serialize() method exports the entire HNSW graph as a transferable object. The chunk map is serialized alongside it so that retrieved chunk IDs resolve to their original text on subsequent page loads. The singleton getDB() function ensures only one IDBDatabase handle is opened for the lifetime of the page, preventing connection leaks that would block version-change events in other tabs.
Note: The indexChunks function expects each element of embeddedChunks to have both a text and an embedding property. The caller is responsible for merging the original chunk texts with the embeddings returned by embedChunks. See the integration example in "Putting It All Together" below.
Running a Similarity Search
async function searchChunks(queryText, topK = 3) {
// Embed the query using the same worker
const queryEmbedding = await embedQuery(queryText);
// voy-search 0.6.x expects an object with an `embeddings` property
const results = voy.search({ embeddings: queryEmbedding }, topK);
return results.neighbors.map(n => ({
id: n.id,
score: n.score,
text: chunkMap.get(n.id) // retrieve original text from the Map
}));
}
The query is embedded with the identical model and normalization used for document chunks. Voy returns neighbors sorted by similarity score.
Performance Considerations
On Apple M-series hardware, embedding 100 chunks with all-MiniLM-L6-v2 takes approximately 4 to 8 seconds (~8s on M1, ~4s on M2). Search consistently completes in under 5 milliseconds. These benchmarks vary significantly by device, chip generation, and available memory. The practical ceiling for in-browser vector search sits around 5,000 to 10,000 chunks before memory pressure and indexing latency start degrading the experience. Beyond that threshold, a server-side vector database becomes the more appropriate choice.
Generating Answers with a Local LLM
Loading a Small LLM via WebLLM
WebLLM uses WebGPU (not WebAssembly) because GPUs parallelize the matrix multiplications that dominate transformer inference. The q4f16_1 suffix in the model ID refers to a 4-bit quantization format with 16-bit floating-point activations, which reduces model size while producing quality comparable to the full FP16 model on short-form Q&A, with measurable degradation on multi-hop reasoning. A Q4-quantized Phi-3-mini-4k-instruct model runs at 20 to 40 tokens per second (~20 tok/s on M2 integrated, ~40 tok/s on RTX 3060-class discrete GPUs; results will vary by GPU model and available VRAM).
Important: Before running, check that WebGPU is available in the browser. If navigator.gpu is undefined, the engine initialization will fail.
import { CreateMLCEngine } from '@mlc-ai/web-llm';
// Verify this model ID against the prebuilt model list in the WebLLM repository
// (mlc-ai/web-llm/blob/main/src/config.ts) before use. Registry contents change
// across library versions.
if (!navigator.gpu) {
console.error('WebGPU is not available in this browser. Local LLM generation will not work.');
}
const engine = await CreateMLCEngine('Phi-3-mini-4k-instruct-q4f16_1-MLC', {
initProgressCallback: (progress) => {
console.log(`Model loading: ${progress.text}`);
}
});
async function generateAnswer(contextChunks, question) {
const prompt = buildRAGPrompt(contextChunks, question);
try {
const response = await engine.chat.completions.create({
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: prompt }
],
temperature: 0.3,
max_tokens: 512,
stream: true
});
let answer = '';
for await (const chunk of response) {
const delta = chunk.choices[0]?.delta?.content || '';
answer += delta;
updateUI(answer); // stream to DOM — see GitHub repo for UI implementation
}
return answer;
} catch (err) {
console.error('LLM generation failed:', err);
return 'Generation failed. This may be due to a WebGPU context loss. Please reload the page and try again.';
}
}
The max_tokens: 512 parameter caps generation length, preventing runaway output from consuming unbounded memory if the model enters a repetition loop. The try/catch block handles WebGPU context loss (common on mobile when the browser backgrounds the tab) and surfaces a user-facing error rather than leaving the UI in an unrecoverable state.
Prompt Engineering for Grounded Answers
const SYSTEM_PROMPT = `You are a document assistant. Answer the user's question using ONLY the provided context chunks. Cite which chunk(s) support your answer using [Chunk N] notation. If the context does not contain sufficient information to answer, respond with "I don't have enough context to answer that question."`;
function buildRAGPrompt(chunks, question) {
const context = chunks
.map((c, i) => `[Chunk ${i + 1}] (score: ${(c.score ?? 0).toFixed(3)})
${c.text}`)
.join('
');
return `Context:
${context}
Question: ${question}
Answer:`;
}
The low temperature (0.3) reduces hallucination. Including similarity scores in the prompt is optional; I have not empirically validated whether LLMs use these numeric signals, and the effect is likely model-dependent. The c.score ?? 0 guard prevents a TypeError if a score is missing.
Fallback: Using an External API Responsibly
For devices without WebGPU, an optional fallback sends only the retrieved chunks (not the full documents) to an external LLM API. This pattern minimizes exposure: the API sees 3 to 5 short text passages rather than entire document collections, a meaningful reduction in data surface even when full local processing is not possible.
This fallback sends document passages to an external server, which directly affects the privacy guarantees of the system. Implement explicit user disclosure and an opt-in consent mechanism before activating the API fallback.
Warning: This fallback sends document passages to an external server, which directly affects the privacy guarantees of the system. Implement explicit user disclosure and an opt-in consent mechanism before activating the API fallback. In GDPR-scoped deployments, ensure a Data Processing Agreement is in place with the API provider. Do not enable this path silently.
Putting It All Together
The end-to-end flow: a user drops a file, watches chunking progress, types a question, sees retrieved chunks highlighted with similarity scores, and reads a streamed answer grounded in those chunks.
During ingestion, the caller must merge chunk texts with the embeddings returned by embedChunks so that indexChunks receives objects containing both text and embedding:
async function ingestDocument(file) {
const rawText = await extractTextFromPDF(file);
const chunks = chunkText(rawText);
const ids = chunks.map((_, i) => `${file.name}-${i}`);
const embedded = await embedChunks(chunks, ids);
// Merge original text with embeddings for indexing
const enriched = embedded.map((e, i) => ({ ...e, text: chunks[i] }));
indexChunks(enriched, file.name);
await saveIndex();
}
Querying uses searchChunks as the single source of truth for retrieval, avoiding duplicated logic:
async function askQuestion(query) {
// Retrieve top-K relevant chunks (embeds the query internally)
const topChunks = await searchChunks(query, 3);
// Display retrieved chunks in the UI
displayRetrievedChunks(topChunks); // See GitHub repo for UI implementation
// Generate a grounded answer
const answer = await generateAnswer(topChunks, query);
return answer;
}
The full source, including the drag-and-drop UI, streaming display, and displayRetrievedChunks / updateUI implementations, will be published at the project repository upon article publication.
Testing
The following tests verify the core behaviors of the pipeline. They can be run with any standard test runner (e.g., Vitest, which integrates naturally with the Vite build).
// --- Unit Tests ---
// TEST 1: chunkText — no leading-space word inflation on first sentence
test('chunkText does not produce leading-space word inflation', () => {
const longFirstSentence = Array(310).fill('word').join(' ') + '.';
const chunks = chunkText(longFirstSentence, 300, 50);
assert(chunks.length === 1);
assert(!chunks[0].startsWith(' '));
});
// TEST 2: chunkText — overlap carries forward
test('chunkText carries last overlapWords words into next chunk', () => {
const text = Array(600).fill('word').join(' ') + '.';
const chunks = chunkText(text, 300, 50);
assert(chunks.length >= 2);
const overlapWords = chunks[1].split(/\s+/).slice(0, 50);
const tailWords = chunks[0].split(/\s+/).slice(-50);
assert.deepEqual(overlapWords, tailWords);
});
// TEST 3: embedQuery / embedChunks — concurrent calls resolve independently
test('concurrent embedQuery and embedChunks do not cross-resolve', async () => {
const [queryResult, chunkResults] = await Promise.all([
embedQuery('test query'),
embedChunks(['chunk one', 'chunk two'], ['id1', 'id2'])
]);
assert(Array.isArray(queryResult) && queryResult.length === 384);
assert(chunkResults.length === 2);
assert(chunkResults[0].id === 'id1');
});
// TEST 4: loadIndex — chunkMap restored after round-trip
test('saveIndex/loadIndex round-trip restores chunkMap', async () => {
chunkMap.set('doc-0', 'hello world');
await saveIndex();
chunkMap.clear();
await loadIndex();
assert.strictEqual(chunkMap.get('doc-0'), 'hello world');
});
// TEST 5: extractTextFromPDF — corrupt page does not abort
test('extractTextFromPDF returns partial text on corrupt page', async () => {
const mockPdf = {
numPages: 2,
getPage: async (i) => {
if (i === 2) throw new Error('corrupt page');
return { getTextContent: async () => ({ items: [{ str: 'valid text' }] }) };
}
};
const result = await extractTextFromPDFWithMock(mockPdf);
assert(result.includes('valid text'));
});
// --- Integration Test ---
test('end-to-end: ingest doc → search → non-undefined chunk text', async () => {
const text = 'The quick brown fox jumps over the lazy dog. '.repeat(20);
const chunks = chunkText(text, 50, 10);
const ids = chunks.map((_, i) => `doc-${i}`);
const embedded = await embedChunks(chunks, ids);
// Merge original text with embeddings
const enriched = embedded.map((e, i) => ({ ...e, text: chunks[i] }));
indexChunks(enriched, 'test-doc');
const results = await searchChunks('fox jumps', 3);
assert(results.length > 0);
assert(typeof results[0].text === 'string' && results[0].text.length > 0);
});
// --- Sanity Check (browser DevTools console) ---
// Expected: array of 384 numbers, all finite, vector magnitude ≈ 1.0
embedQuery('hello world').then(v => {
const mag = Math.sqrt(v.reduce((s, x) => s + x * x, 0));
console.assert(v.length === 384, 'wrong embedding dimension');
console.assert(Math.abs(mag - 1.0) < 0.001, 'vector not normalized');
console.log('Embedding sanity check PASSED, magnitude:', mag);
});
Security, Limitations, and What's Next
Actual Privacy Guarantees
This design protects against server-side data collection. It does not defend against compromised client devices, malicious browser extensions with DOM access, or side-channel attacks on WebGPU memory. For production deployment, a strict Content Security Policy that blocks inline scripts and restricts connect-src to 'none' (after model download) hardens the boundary.
Current Limitations
Model quality has a ceiling. A 3B-parameter quantized model handles direct factual lookups well ("What is the retention policy described in section 4?") but fails on multi-step reasoning ("Compare the retention policies in sections 4 and 7 and identify contradictions"). It cannot match GPT-4-class synthesis. First-load latency is the biggest UX hurdle: the LLM download ranges from 1 to 2 GB for Phi-3-mini-4k-instruct-q4f16_1-MLC specifically, though subsequent loads pull from cache. Chrome, Edge, and Safari Technology Preview ship WebGPU as of mid-2025; Firefox requires the dom.webgpu.enabled flag in about:config and does not enable it by default.
The practical ceiling for in-browser vector search sits around 5,000 to 10,000 chunks before memory pressure and indexing latency start degrading the experience.
Future Outlook
Chrome's experimental Prompt API could eliminate the multi-gigabyte download entirely if it reaches stable release. Larger quantized models (7B+) will become practical once browser VRAM ceilings rise above 8 GB for integrated GPUs. On-device fine-tuning remains blocked by the lack of efficient backward-pass support in WebGPU shader compilers.
Key Takeaways
- A fully private RAG pipeline runs in the browser today for document sets under roughly 10,000 chunks, using Transformers.js for embeddings, Voy for vector search, and WebLLM for generation.
- Running in fully local mode protects against server-side data collection. The optional API fallback sends data externally and requires explicit user consent. Client-side threats need separate mitigation.
- Budget for first-load latency: 1 to 2 GB of model downloads, cached after the first visit.
- No WebGPU, no local generation. A chunked-only API fallback provides graceful degradation with appropriate user disclosure.
- For large document collections or multi-step reasoning tasks, server-side RAG with proper access controls remains the stronger choice.