Client-Side RAG: Building Knowledge Graphs in the Browser with GitNexus


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Retrieval-augmented generation for code repositories has a fundamental tension: the most sensitive data in any organization, its source code, must be shipped to third-party servers for embedding, indexing, and retrieval. The browser is now capable enough to absorb this workload.
Table of Contents
- Why RAG Is Moving to the Client
- What GitNexus Does Differently
- The Client-Side RAG Pipeline, Step by Step
- Querying the Graph: Retrieval Meets Graph Traversal
- Limitations and When to Reach for the Server
- Key Takeaways for Building Local Agent Infrastructure
Why RAG Is Moving to the Client
Retrieval-augmented generation for code repositories has a fundamental tension: the most sensitive data in any organization, its source code, must be shipped to third-party servers for embedding, indexing, and retrieval. Every cloud-based RAG pipeline introduces a trust boundary. For teams working under NDA, in regulated industries, or simply protective of proprietary logic, that boundary violates their data handling requirements. Local agent infrastructure addresses this by keeping the entire pipeline on the user's machine, and increasingly, inside the browser itself.
Cost compounds the privacy problem. Cloud embedding APIs charge per token (OpenAI's ada-002 costs roughly $0.0001 per 1K tokens as of early 2025), vector database hosting adds monthly overhead, and round-trip latency to retrieval endpoints slows every query. For individual developers and small teams, this infrastructure tax is hard to justify for exploratory code understanding.
The browser is now capable enough to absorb this workload. WebAssembly runs near-native computation, and Web Workers provide concurrent execution off the main thread, preventing UI blocking without shared-memory parallelism.
The browser is now capable enough to absorb this workload. WebAssembly runs near-native computation, and Web Workers provide concurrent execution off the main thread, preventing UI blocking without shared-memory parallelism. IndexedDB offers persistent structured storage measured in gigabytes. WebGPU, available in Chrome and Edge 113+ but still behind flags or unavailable in Firefox and Safari, promises GPU-accelerated inference directly in browser tabs. Together, these APIs form a runtime sufficient for small-to-medium-scale RAG without a single server process.
What GitNexus Does Differently
GitNexus is a browser-based tool that constructs knowledge graphs from Git repositories entirely client-side. There is no backend server. No API keys are required for core graph functionality, which means parsing, chunking, embedding, storage, and retrieval all happen in the browser tab. (Verify the current feature set and version in the GitNexus repository README.)
Those tradeoffs become clear against server-side RAG tools like Cursor, Greptile, and Sourcegraph Cody, which, as of their architectures at the time of writing, rely on cloud infrastructure for embedding generation, vector storage, or both. Verify current architecture in each product's documentation, as these tools evolve rapidly. Those tools offer access to larger, higher-quality models and can handle massive codebases, but they require sending repository content off-device.
GitNexus makes an explicit tradeoff: constrained model size and compute budget in exchange for complete data sovereignty. The embedding models that fit in a browser are smaller and less expressive than server-side alternatives. But for many codebases, particularly those under 10,000 files, the quality is sufficient for meaningful code exploration and question answering.
The Client-Side RAG Pipeline, Step by Step
Prerequisites: The code examples below assume a bundler-based project (e.g., Vite or webpack 5+). Worker type: 'module' requires an HTTP server — the file:// protocol will not work. Some older browser versions (Firefox before 114, for example) do not support module-type Workers. You will also need an internet connection for the initial model download.
Parsing and Chunking Repository Files in the Browser
GitNexus ingests repository content directly in the client. Once a user points it at a repository (cloned locally or fetched via the GitHub API), the tool reads file contents and splits them into semantically meaningful chunks.
Note on GitHub API usage: The GitHub API requires authentication (a personal access token) for private repositories and enforces rate limits: 60 requests/hour unauthenticated, 5,000 requests/hour authenticated. CORS restrictions apply to cross-origin requests to api.github.com.
Naive text splitting, breaking files at fixed character counts, destroys the structural meaning that makes code searchable. AST-aware chunking preserves it. By parsing source files into abstract syntax trees, the chunker can extract function declarations, class definitions, and module exports as discrete units, each carrying metadata about its file path, symbol name, and line range.
Handling multiple languages requires either language-specific parsers or a universal parsing library. Tree-sitter, compiled to WebAssembly, provides grammars for over 100 languages through community contributions, though browser-compatible WASM builds may require separate compilation. For JavaScript and TypeScript specifically, lighter-weight parsers work well:
// Requires @babel/parser v7.x — pin your version: npm install @babel/parser@^7.24.0
import { parse } from '@babel/parser';
function chunkSourceFile(code, filePath) {
const ast = parse(code, {
sourceType: 'module',
plugins: ['typescript', 'jsx'],
});
const chunks = [];
const skipped = [];
const HANDLED_TYPES = new Set([
'FunctionDeclaration',
'ClassDeclaration',
'ExportNamedDeclaration',
]);
for (const node of ast.program.body) {
if (HANDLED_TYPES.has(node.type)) {
// Guard against nodes lacking location data
if (!node.loc) {
console.warn(
`Node of type ${node.type} in ${filePath} has no location info; skipping`
);
continue;
}
// Handle re-exports where node.declaration is null
const name =
node.id?.name ??
node.declaration?.id?.name ??
(node.specifiers?.length
? node.specifiers.map((s) => s.exported.name).join(',')
: 'anonymous');
chunks.push({
filePath,
symbolName: name,
lineStart: node.loc.start.line,
lineEnd: node.loc.end.line,
content: code.slice(node.start, node.end),
type: node.type,
});
} else {
skipped.push(node.type);
}
}
if (skipped.length > 0) {
console.debug(
`[chunkSourceFile] ${filePath}: skipped node types: ${[...new Set(skipped)].join(', ')}`
);
}
return { chunks, skippedNodeTypes: [...new Set(skipped)] };
// Note: this example captures only FunctionDeclaration, ClassDeclaration,
// and ExportNamedDeclaration nodes. Arrow functions (const fn = () => {}),
// VariableDeclaration-bound functions, and ExportDefaultDeclaration require
// additional node type handling. In a modern JS/TS codebase, these patterns
// are dominant — without extending the node type list, large portions of
// your codebase will be skipped. The returned skippedNodeTypes array lets
// callers detect this situation rather than silently losing chunks.
//
// The `code` parameter must be the raw, unmodified source string passed to
// parse(). If code is pre-processed (e.g., BOM-stripped or re-encoded),
// node.start/node.end byte offsets will misalign with string positions.
}
Each chunk object carries enough metadata to reconstruct its location and role in the codebase, which matters both for retrieval ranking and for building graph edges (relationships such as import references, function calls, and class inheritance, described in the "Storing the Knowledge Graph" section below). The returned skippedNodeTypes array makes it visible when node types are not handled, so callers can distinguish "empty file" from "parser matched nothing for this file's patterns."
Generating Embeddings Without a Server
With chunks extracted, the next step is generating vector embeddings. Transformers.js, the browser-compatible port of Hugging Face's transformers library, runs ONNX-format models directly in JavaScript. The all-MiniLM-L6-v2 model is a common choice: approximately 22.7 million parameters, 384-dimension output vectors, and small enough to load quickly in a browser context. Quantized variants reduce the download further, trading marginal precision for faster startup.
Generating embeddings consumes heavy CPU time, so running the process on the main thread would freeze the UI. Web Workers solve this cleanly. The main thread sends chunks to a worker, which loads the model once and processes chunks in sequence or in batches:
// embedding-worker.js
// requires @huggingface/transformers v3.x — pin exactly: npm install @huggingface/transformers@^3.0.0
import { pipeline } from '@huggingface/transformers';
let embedder = null;
let initInFlight = false;
self.onmessage = async (event) => {
const { type, chunks } = event.data;
if (type === 'init') {
// Guard against double-initialization: if already initialized, just confirm ready
if (embedder !== null) {
self.postMessage({ type: 'ready' });
return;
}
// Guard against concurrent init: if init is already in progress, ignore duplicate
if (initInFlight) {
console.warn('Init already in progress; ignoring duplicate init message');
return;
}
initInFlight = true;
try {
embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2',
{ quantized: true }
);
self.postMessage({ type: 'ready' });
} catch (err) {
embedder = null; // allow retry on failure
self.postMessage({ type: 'error', message: `Init failed: ${err.message}` });
} finally {
initInFlight = false;
}
return;
}
if (type === 'embed') {
// Descriptive error if embed is called before init completes
if (!embedder) {
self.postMessage({
type: 'error',
message: 'Embedder not initialized. Send an init message first.',
});
return;
}
const results = [];
const errors = [];
const transferables = [];
// Per-chunk error isolation: one bad chunk does not abort the entire batch
for (const chunk of chunks) {
try {
const output = await embedder(chunk.content, {
pooling: 'mean',
normalize: true,
});
// Validate output shape before storing.
// Transformers.js feature-extraction with pooling returns dims [1, 384].
// output.data is always a flat typed array regardless of dims.
if (
!output ||
!output.data ||
!output.dims ||
output.dims[output.dims.length - 1] !== 384
) {
throw new Error(
`Unexpected output shape: dims=${JSON.stringify(output?.dims)}`
);
}
const embedding = new Float32Array(output.data);
transferables.push(embedding.buffer);
results.push({ ...chunk, embedding });
} catch (err) {
errors.push({
chunkId: chunk.symbolName ?? chunk.filePath,
message: err.message,
});
}
}
// Transfer Float32Array buffers zero-copy instead of structured-clone
self.postMessage({ type: 'embeddings', results, errors }, transferables);
}
};
// main.js
// Guard against file:// protocol — module Workers require HTTP/HTTPS
if (typeof location !== 'undefined' && location.protocol === 'file:') {
throw new Error(
'GitNexus must be served over HTTP/HTTPS. The file:// protocol is not supported.'
);
}
// Bundler-safe worker path resolution using import.meta.url
const worker = new Worker(
new URL('./embedding-worker.js', import.meta.url),
{ type: 'module' }
);
worker.onerror = (err) => {
console.error(
'Worker failed to load or threw an uncaught error:',
err.message,
err.filename,
err.lineno
);
};
worker.postMessage({ type: 'init' });
worker.onmessage = (e) => {
if (e.data.type === 'ready') {
worker.postMessage({ type: 'embed', chunks: codeChunks });
}
if (e.data.type === 'embeddings') {
if (e.data.errors?.length) {
console.warn('Partial embed failures:', e.data.errors);
}
storeEmbeddings(e.data.results);
}
if (e.data.type === 'error') {
console.error('Worker error:', e.data.message);
}
};
The quantized: true flag loads the default quantized variant for this model (typically int8); check the model card for the exact quantization format used. Mean pooling with normalization produces unit vectors suitable for cosine similarity without further preprocessing.
Storing the Knowledge Graph in IndexedDB
In-memory-only storage disappears when the tab closes. IndexedDB persists across sessions, supports structured queries via indexes, and in most browsers allows storage in the hundreds of megabytes to low gigabytes range before triggering quota prompts — though the exact quota varies by browser, device, and available disk space. Always specify a schema version in indexedDB.open(name, version) and handle the onupgradeneeded event for migrations to avoid corrupting existing user data.
GitNexus uses a schema with three primary object stores. Nodes represent code entities (functions, classes, modules), each keyed by a composite of file path and symbol name. Edges encode relationships between them: import references, function calls, class inheritance, and type usage. Vectors store the float arrays alongside node references, indexed for batch retrieval.
For a typical repository of 5,000 files producing 15,000 chunks with 384-dimensional vectors, the vector store alone occupies roughly 22 MB when stored as float32 (4 bytes per value); int8 quantized storage would reduce this to approximately 5.7 MB. Node and edge metadata adds comparatively little. This fits comfortably within browser storage limits for any single repository, though monorepos with hundreds of thousands of files push well beyond practical thresholds.
Querying the Graph: Retrieval Meets Graph Traversal
Vector Similarity Search in the Browser
At query time, the user's natural-language question is embedded using the same model and compared against stored vectors. For repositories producing tens of thousands of chunks, brute-force cosine similarity is surprisingly viable.
For larger indexes, approximate nearest neighbor libraries like hnswlib-wasm (available on npm as hnswlib-wasm; verify maintenance status before production use) bring HNSW (Hierarchical Navigable Small World) indexing to the browser via WebAssembly, trading a small recall penalty for sublinear search time.
The following function demonstrates vector search using cursor-based retrieval instead of loading the entire store into memory. Because the embedding model runs inside a Web Worker (as shown above), the query embedding must be obtained via postMessage — the pipeline object cannot be called directly from the main thread. The simplified version below assumes the embedding has already been obtained from the worker:
// NOTE: In production, obtain queryVec by sending the question to the
// embedding worker via postMessage and awaiting the response.
// The embedder pipeline object is not transferable across thread boundaries.
async function queryCodebase(question, db, queryVec, topK = 5) {
if (!Array.isArray(queryVec) || queryVec.length === 0) {
throw new Error('queryVec must be a non-empty array');
}
return new Promise((resolve, reject) => {
const tx = db.transaction('vectors', 'readonly');
// Handle IndexedDB transaction errors explicitly
tx.onerror = () => reject(tx.error);
tx.onabort = () => reject(new Error('IDB transaction aborted'));
const store = tx.objectStore('vectors');
const heap = []; // bounded candidate list
// Cursor-based iteration avoids loading the entire store into the JS heap.
// store.getAll() would cause OOM for repositories exceeding ~5,000 chunks.
const request = store.openCursor();
request.onerror = () => reject(request.error);
request.onsuccess = (event) => {
const cursor = event.target.result;
if (cursor) {
const item = cursor.value;
// Dimension mismatch guard: skip corrupt or mismatched records
if (item.embedding.length !== queryVec.length) {
console.warn(
`Dimension mismatch for node ${item.nodeId}: expected ${queryVec.length}, got ${item.embedding.length}`
);
cursor.continue();
return;
}
let dot = 0;
for (let i = 0; i < queryVec.length; i++) {
dot += queryVec[i] * item.embedding[i];
}
heap.push({ ...item, score: dot });
// Keep only topK * 2 candidates in memory to bound heap size
if (heap.length > topK * 2) {
heap.sort((a, b) => b.score - a.score);
heap.splice(topK * 2);
}
cursor.continue();
} else {
// Cursor exhausted — return final top-K results
heap.sort((a, b) => b.score - a.score);
resolve(heap.slice(0, topK));
}
};
});
}
// db is an IDBDatabase instance returned by indexedDB.open(name, version).
// You must initialize the database and create object stores in the
// onupgradeneeded handler before calling this function.
Because both query and stored vectors are L2-normalized, the dot product equals cosine similarity, avoiding the need for separate magnitude computation. (This equivalence holds only when both vectors are L2-normalized; if normalization is skipped upstream, rankings will be incorrect.) The cursor-based approach iterates one record at a time, keeping memory usage proportional to topK rather than the total store size.
Graph-Augmented Retrieval
Vector similarity alone retrieves chunks that are textually or semantically close to the query. But code understanding often demands structural context: what calls this function, what does it import, what interface does it implement.
Without graph edges, the LLM receives isolated snippets and cannot trace call chains across files.
Without graph edges, the LLM receives isolated snippets and cannot trace call chains across files. After vector retrieval returns the top-k chunks, GitNexus walks the graph to pull in related symbols. If a retrieved function calls three utility functions, those utilities get added to the context window. If a class inherits from a base class, that base class definition is included. The traversal depth is typically bounded to one or two hops to keep context size manageable.
GitNexus formats the combined context (vector-retrieved chunks plus graph-traversed neighbors) and passes it to an LLM for answer generation. In manual testing, this graph-augmented approach returned caller/callee context and dependency chains that flat vector retrieval missed entirely, giving the LLM the structural information it needs to reason about interactions between components rather than answering from isolated code fragments.
Limitations and When to Reach for the Server
Browser-based RAG has real ceilings. The embedding models that fit in a browser tab, constrained by WASM heap limits (typically 2-4 GB) and acceptable download/startup time, produce lower-quality vectors than server-side models like OpenAI's text-embedding-3-large or Cohere's Embed v3. On the MTEB retrieval benchmarks, all-MiniLM-L6-v2 scores meaningfully below these larger models; check the MTEB leaderboard for current numbers. For nuanced semantic queries, this quality gap matters.
Repository size is the harder constraint. IndexedDB storage, embedding throughput, and in-memory graph traversal all degrade as file counts climb. Repositories exceeding around 50,000 files, depending on chunk size and hardware, push against practical limits in both processing time and storage budget. Monorepos are generally not candidates for pure client-side RAG.
A hybrid architecture offers a pragmatic middle ground: build and query the knowledge graph client-side, but route LLM inference to a server-side model when answer quality demands it.
A hybrid architecture offers a pragmatic middle ground: build and query the knowledge graph client-side, but route LLM inference to a server-side model when answer quality demands it. This preserves the privacy benefit for the retrieval layer only; the client includes retrieved code chunks in the prompt it sends to the server-side LLM, and those chunks do leave the device at inference time. Teams should evaluate whether the code content in those retrieved chunks meets their data sensitivity requirements before adopting hybrid mode.
Key Takeaways for Building Local Agent Infrastructure
The browser is now a viable RAG runtime for repositories under roughly 10,000 to 50,000 files, depending on hardware. Client-side knowledge graphs eliminate both the infrastructure cost and the privacy exposure of server-side pipelines. The core pattern (parse, chunk, embed, store, retrieve, augment) generalizes beyond code to documentation, internal wikis, personal notes, and any corpus where data sensitivity discourages cloud transmission.
WebGPU, larger quantized models, and maturing libraries like Transformers.js and hnswlib-wasm will continue to push the boundary of what runs locally. But the tooling works today. The server is not going away, but for a growing class of use cases, it is already optional.