Local RAG for Agents: Integrating Private Knowledge Bases with Awesome-LLM-Apps


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Autonomous coding agents like Claude Code, Aider, and Cursor promise to accelerate development by reading, writing, and refactoring code on a developer's behalf. In practice, these agents operating on large codebases routinely hit context window limits and hallucinate architectural decisions, API usage patterns, and internal conventions they have no way of knowing. A local RAG pipeline for agents integrating private knowledge bases with awesome-llm-apps addresses this directly.
How to Build a Local RAG Pipeline for Coding Agents
- Clone the
awesome-llm-appsrepository and explore the RAG templates inrag_tutorials/. - Install Ollama, pull
nomic-embed-text, and start the ChromaDB server locally. - Ingest your codebase using function-aware chunking that respects code boundaries.
- Generate embeddings locally so proprietary code never leaves your machine.
- Store vectors in ChromaDB with content-addressed IDs and file-path metadata.
- Connect retrieval to your coding agent via a middleware script that augments prompts with relevant chunks.
- Configure hallucination guardrails: distance thresholds, provenance tagging, and a grounded system prompt.
- Automate re-indexing with a file watcher to keep the knowledge base current.
Table of Contents
- Why Your Coding Agent Keeps Getting It Wrong
- What Is Local RAG and Why Does It Matter for Coding Agents?
- Prerequisites and Project Setup
- Building the Local RAG Pipeline for Your Codebase
- Connecting Your RAG Pipeline to Claude Code
- Preventing Hallucination in Autonomous Editing
- Implementation Checklist and Complete Pipeline Reference
- Common Pitfalls
- Next Step
Why Your Coding Agent Keeps Getting It Wrong
Autonomous coding agents like Claude Code, Aider, and Cursor promise to accelerate development by reading, writing, and refactoring code on a developer's behalf. In practice, these agents operating on large codebases routinely hit context window limits and hallucinate architectural decisions, API usage patterns, and internal conventions they have no way of knowing. The result is silent bugs, incorrect refactors that pass cursory review, and wasted cycles spent undoing damage the agent introduced with confidence.
A local RAG pipeline for agents integrating private knowledge bases with awesome-llm-apps addresses this directly. Rather than relying on the agent's training data or cramming entire repositories into a context window, retrieval-augmented generation fetches only the relevant documentation and code before the agent acts. By the end of this tutorial, readers will have a working local RAG system built from composable templates in the awesome-llm-apps repository, connected to a coding agent, and configured with guardrails to prevent hallucination. This removes the context window bottleneck that makes agents unreliable in real-world projects.
Rather than relying on the agent's training data or cramming entire repositories into a context window, retrieval-augmented generation fetches only the relevant documentation and code before the agent acts.
What Is Local RAG and Why Does It Matter for Coding Agents?
RAG in 60 Seconds
Retrieval-Augmented Generation combines information retrieval with language model generation. Instead of asking an LLM to answer purely from its parameters, the system first queries a knowledge base, retrieves relevant documents, and appends them to the prompt. A query hits a retriever that performs vector search over local documents, the retrieved content augments the prompt sent to the LLM, and the model generates a response grounded in actual source material rather than parametric memory alone.
Local vs. Cloud RAG: The Privacy Tradeoff
Running RAG locally means the developer's machine generates embeddings and stores vectors in a local database, and no proprietary code leaves the network boundary. For organizations with compliance requirements or simply teams that prefer not to pipe internal source code through third-party embedding APIs, local RAG is the only viable option. The trade-offs are real: nomic-embed-text runs on CPU but benefits significantly from GPU acceleration for large codebase indexing. On a modern CPU, expect to index ~10,000 chunks in under 10 minutes. Embedding quality comparisons are workload-dependent; evaluate both nomic-embed-text and hosted models like OpenAI's text-embedding-3-small on a representative sample of your codebase before deciding. Query latency for local setups typically lands under 50ms once the model is loaded (no network round-trip), versus 100-300ms for a hosted API call, but indexing large codebases on CPU-only machines can be slow. Benchmark your own setup with console.time/console.timeEnd around the embedding and query calls.
How awesome-llm-apps Fits In
The Shubhamsaboo/awesome-llm-apps repository is a curated collection of open-source LLM application templates spanning RAG pipelines, autonomous agents, and multi-modal workflows. For this tutorial, the relevant resources are the local RAG pipeline templates in the rag_tutorials/ directory that pair document loaders with vector stores, and the agent integration patterns that demonstrate how to wire retrieval into autonomous workflows. Inspect the subdirectories within rag_tutorials/ to identify the templates closest to your use case. These templates are composable: developers can swap embedding models, vector databases, and chunking strategies without rewriting the pipeline from scratch.
Prerequisites and Project Setup
What You'll Need
You need Node.js 18 or later installed. You also need Python 3.10 or later, since several awesome-llm-apps utilities are Python-based. Install Ollama and start it with ollama serve, then pull the embedding model with ollama pull nomic-embed-text. Alternatively, you can use an OpenAI API key for hosted embeddings. Install the ChromaDB server (pip install chromadb==0.5.23) and confirm it runs before any script executes. Install the Claude Code CLI and authenticate it; verify non-interactive flags with claude --help. Finally, have a target codebase ready to index (your own project or a sample repository).
Note on runtimes: This tutorial uses both Python (for ChromaDB server, awesome-llm-apps utilities) and Node.js (for the custom RAG pipeline scripts). Python dependencies are installed in a virtual environment in the cloned repository root. Node.js code lives in a separate project directory you create alongside or inside the repo. Both Ollama and ChromaDB must be running as background services before executing any Node.js pipeline scripts.
Cloning and Exploring awesome-llm-apps
Start by cloning the repository, setting up a Python virtual environment, and installing dependencies:
git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps
# Set up Python virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install Python dependencies (includes ChromaDB server and utilities)
pip install -r requirements.txt
pip list # Verify installed packages and versions
# Set up Node.js project for the custom pipeline code
mkdir rag-agent && cd rag-agent
npm init -y
npm install [email protected] [email protected]
The rag_tutorials/ directory contains multiple pipeline templates. For this tutorial, the local RAG patterns using vector stores and document loaders for code are the starting point. The npm init and install commands set up the Node.js project that will house the custom pipeline code.
Version pinning: This tutorial was tested against [email protected] (npm client), [email protected] (npm client), ChromaDB server 0.5.23, and Ollama daemon 0.5.x. Pin these versions in your package.json and requirements.txt to avoid API drift. If you use different versions, verify API response shapes and method signatures against your installed version's documentation.
Building the Local RAG Pipeline for Your Codebase
Step 1: Ingesting and Chunking Your Codebase
Naive text splitting, where files are cut at fixed character counts, destroys the structural information that makes code useful as context. A function split across two chunks loses its meaning. Code-aware chunking strategies respect file boundaries, function boundaries, and import blocks. For JavaScript and TypeScript files, chunking at function-level granularity preserves callable units. Chunk size and overlap are critical parameters: too small and individual chunks lack sufficient context; too large and irrelevant code pollutes the retrieval results. A chunk size of 1500 to 2000 characters with 200 characters of overlap works well for source code: in testing on JS/TS repos, chunks below 1000 characters lost function-level context, while chunks above 2500 caused retrieval precision to drop because irrelevant code co-occurred in the same chunk.
const fs = require('fs');
const path = require('path');
const EXTENSIONS = ['.js', '.ts', '.md', '.json'];
const CHUNK_SIZE = 1800;
const OVERLAP = 200;
// Fail fast on misconfiguration that would cause an infinite loop
if (OVERLAP >= CHUNK_SIZE) {
throw new Error(`OVERLAP (${OVERLAP}) must be less than CHUNK_SIZE (${CHUNK_SIZE})`);
}
function getFiles(dir, files = [], visited = new Set()) {
let entries;
try {
entries = fs.readdirSync(dir, { withFileTypes: true });
} catch (err) {
console.warn(`Skipping unreadable directory ${dir}: ${err.message}`);
return files;
}
for (const entry of entries) {
const full = path.join(dir, entry.name);
if (entry.isDirectory() && !entry.name.startsWith('.') && entry.name !== 'node_modules') {
const real = (() => { try { return fs.realpathSync(full); } catch { return full; } })();
if (visited.has(real)) {
console.warn(`Cycle detected, skipping: ${full}`);
continue;
}
visited.add(real);
getFiles(full, files, visited);
} else if (entry.isFile() && EXTENSIONS.includes(path.extname(entry.name))) {
files.push(full);
}
}
return files;
}
function chunkCode(content, filePath) {
const isFunctionAware = /\.(js|ts)$/.test(filePath);
if (isFunctionAware) {
const funcRegex = /^(?:export\s+)?(?:async\s+)?function\s+\w+|^const\s+\w+\s*=\s*(?:async\s*)?\(/gm;
const boundaries = [...content.matchAll(funcRegex)].map(m => m.index);
if (boundaries.length > 1) {
return boundaries.map((start, i) => {
const end = boundaries[i + 1] || content.length;
return { text: content.slice(start, end).trim(), file: filePath };
}).filter(c => c.text.length > 50);
}
}
// Fallback: fixed-size chunking with overlap
const chunks = [];
for (let i = 0; i < content.length; i += CHUNK_SIZE - OVERLAP) {
chunks.push({ text: content.slice(i, i + CHUNK_SIZE), file: filePath });
}
return chunks;
}
function ingestCodebase(rootDir) {
const files = getFiles(rootDir);
const allChunks = files.flatMap(f => chunkCode(fs.readFileSync(f, 'utf-8'), f));
console.log(`Ingested ${allChunks.length} chunks from ${files.length} files`);
return allChunks;
}
module.exports = { ingestCodebase, getFiles, chunkCode };
This script recursively walks a project directory, filters by extension, and applies function-boundary chunking for function declarations and const-assigned function expressions in JS/TS files. Arrow functions, class methods, and object method shorthand are not detected by this regex and fall back to fixed-size chunking. For broader coverage of modern JavaScript and TypeScript patterns, consider a parser-based approach such as tree-sitter. For other file types, the script uses fixed-size chunking with overlap.
Step 2: Generating Embeddings Locally
Local embedding generation ensures that proprietary code never leaves the developer's machine. The nomic-embed-text model running through Ollama produces 768-dimensional embeddings and handles code-related content well enough for retrieval over typical project codebases. For teams that already have an OpenAI API key and fewer privacy constraints, text-embedding-3-small is a viable alternative. Evaluate both on a representative sample of your code before committing.
Before running this step: Ensure the Ollama daemon is running (ollama serve) and the embedding model is pulled (ollama pull nomic-embed-text). If using the OpenAI path, store your API key in an environment variable (export OPENAI_API_KEY=...) and pass process.env.OPENAI_API_KEY rather than hardcoding it.
const { Ollama } = require('ollama');
const { withTimeout } = require('./utils');
const OLLAMA_HOST = process.env.OLLAMA_HOST || 'http://localhost:11434';
const ollama = new Ollama({ host: OLLAMA_HOST });
async function embedChunks(chunks, { useLocal = true, openaiKey = null } = {}) {
const embeddings = [];
const failures = [];
for (const chunk of chunks) {
try {
if (useLocal) {
const response = await withTimeout(
ollama.embeddings({
model: 'nomic-embed-text',
prompt: chunk.text,
}),
15000
);
// Ollama npm v0.5.x returns { embedding: number[] } (singular)
const vector = response.embedding;
if (!Array.isArray(vector) || vector.length === 0) {
throw new Error(`Unexpected embedding shape from Ollama: ${JSON.stringify(Object.keys(response))}`);
}
embeddings.push({ ...chunk, embedding: vector });
} else if (openaiKey) {
const res = await fetch('https://api.openai.com/v1/embeddings', {
method: 'POST',
headers: { 'Authorization': `Bearer ${openaiKey}`, 'Content-Type': 'application/json' },
body: JSON.stringify({ model: 'text-embedding-3-small', input: chunk.text }),
});
if (!res.ok) {
const body = await res.text();
throw new Error(`OpenAI API error ${res.status}: ${body.slice(0, 200)}`);
}
const data = await res.json();
if (!data?.data?.[0]?.embedding) {
throw new Error(`Unexpected OpenAI response shape: ${JSON.stringify(data).slice(0, 200)}`);
}
embeddings.push({ ...chunk, embedding: data.data[0].embedding });
}
} catch (err) {
console.error(`Failed to embed chunk from ${chunk.file}:`, err.message);
failures.push({ chunk, error: err.message });
}
}
if (failures.length > 0) {
throw new Error(
`Embedding failed for ${failures.length}/${chunks.length} chunks. ` +
`First failure: ${failures[0].error}. Aborting to prevent partial index.`
);
}
console.log(`Generated ${embeddings.length} embeddings`);
return embeddings;
}
module.exports = { embedChunks };
// utils.js — timeout helper for async operations
function withTimeout(promise, ms) {
let timer;
const timeout = new Promise((_, reject) => {
timer = setTimeout(() => reject(new Error(`Operation timed out after ${ms}ms`)), ms);
});
return Promise.race([promise, timeout]).finally(() => clearTimeout(timer));
}
module.exports = { withTimeout };
The function processes chunks sequentially to avoid overwhelming the local inference server. Each embedding call has a 15-second timeout to prevent the pipeline from hanging indefinitely if Ollama stalls. If any chunk fails to embed, the function throws an error rather than silently producing an incomplete index. For large codebases, sequential embedding will be slow; consider batching with a concurrency limit (e.g., processing 5 chunks in parallel) and adding a progress indicator.
Step 3: Storing Vectors in ChromaDB
ChromaDB requires zero external infrastructure, supports persistent storage, and offers both Python and JavaScript clients, making it a practical local vector store.
⚠️ Security warning: ChromaDB's default HTTP server has no authentication. Any local process can read your indexed codebase vectors by hitting http://localhost:8000. For private code, restrict the port with a firewall rule (e.g., ufw deny from any to any port 8000 then ufw allow from 127.0.0.1 to any port 8000), or use ChromaDB's ephemeral in-process client for local-only use.
Before running this step: Start ChromaDB locally with chroma run --path ./chroma_data. Verify this command against your installed ChromaDB version (chroma --help).
Collections in ChromaDB store embeddings alongside metadata, which is essential for tracing retrieved results back to their source files.
const { ChromaClient } = require('chromadb');
const crypto = require('crypto');
const CHROMA_HOST = process.env.CHROMA_HOST || 'http://localhost:8000';
const client = new ChromaClient({ path: CHROMA_HOST });
async function storeEmbeddings(embeddedChunks, collectionName = 'codebase') {
const collection = await client.getOrCreateCollection({ name: collectionName });
// Content-addressed IDs using 128-bit prefix to reduce collision risk
const ids = embeddedChunks.map(c =>
crypto.createHash('sha256').update(c.file + c.text).digest('hex').slice(0, 32)
);
const embeddings = embeddedChunks.map(c => c.embedding);
const documents = embeddedChunks.map(c => c.text);
const metadatas = embeddedChunks.map(c => ({
file: c.file,
language: c.file.split('.').pop(),
indexed_at: new Date().toISOString(),
}));
await collection.upsert({ ids, embeddings, documents, metadatas });
const count = await collection.count();
console.log(`Upserted ${embeddedChunks.length} chunks; collection '${collectionName}' now contains ${count} total chunks`);
return collection;
}
module.exports = { storeEmbeddings };
The upsert operation allows re-indexing without duplicating entries. Content-addressed IDs (based on file path + chunk text) ensure that re-indexing after adding or removing files does not corrupt existing entries. Positional IDs like chunk_0 silently overwrite wrong vectors when files change, because the same index maps to different content.
Step 4: Querying the Knowledge Base
The retrieval function performs similarity search against the vector store and returns ranked results. The k parameter controls how many chunks are returned; for code context, a value between 5 and 10 typically provides enough relevant material without flooding the agent's context window.
Both Ollama and ChromaDB must be running before calling this function.
// retrieve.js — module-scoped clients to avoid per-call re-instantiation
const { ChromaClient } = require('chromadb');
const { Ollama } = require('ollama');
const CHROMA_HOST = process.env.CHROMA_HOST || 'http://localhost:8000';
const OLLAMA_HOST = process.env.OLLAMA_HOST || 'http://localhost:11434';
const chromaClient = new ChromaClient({ path: CHROMA_HOST });
const ollamaClient = new Ollama({ host: OLLAMA_HOST });
async function retrieveContext(query, k = 5, collectionName = 'codebase') {
if (!query || typeof query !== 'string' || query.trim().length === 0) {
throw new Error('retrieveContext: query must be a non-empty string');
}
const collection = await chromaClient.getCollection({ name: collectionName });
const queryEmbedding = await ollamaClient.embeddings({ model: 'nomic-embed-text', prompt: query });
// Ollama npm v0.5.x returns { embedding: number[] } (singular)
const vector = queryEmbedding.embedding;
if (!Array.isArray(vector)) {
throw new Error(`Bad query embedding shape: ${JSON.stringify(Object.keys(queryEmbedding))}`);
}
const results = await collection.query({
queryEmbeddings: [vector],
nResults: k,
});
if (!results?.documents?.[0] || !results?.distances?.[0]) {
return [];
}
return results.documents[0].map((doc, i) => ({
text: doc,
file: results.metadatas[0][i]?.file ?? 'unknown',
distance: results.distances[0][i],
}));
}
module.exports = { retrieveContext };
This function embeds the query using the same model used for indexing (a requirement for consistent similarity measurement), queries ChromaDB, and returns results annotated with file paths and distance scores. ChromaDB defaults to L2 distance; lower values indicate higher similarity under this metric. If you configure cosine distance on your collection, interpretation is the same direction but the scale differs. Client instances are created at module scope to avoid connection overhead in agent loops that make many queries.
Connecting Your RAG Pipeline to Claude Code
Architecture Overview
The integration pattern uses a lightweight middleware script that sits between the developer and the coding agent. When a developer issues a task, the middleware intercepts it, runs a retrieval query against the local vector store, formats the results into a structured context block, and injects that context into the agent's prompt. The agent then operates with grounded knowledge of relevant internal code and documentation. The flow is: User Task → RAG Middleware → Vector Search → Context Injection → Agent (Claude Code) → Grounded Edit.
Integration with Claude Code
Claude Code supports custom instructions through CLAUDE.md files and system prompt configuration. The integration prepends retrieved context to the task description before Claude Code processes it.
Platform note: The shell script and execSync-based orchestrator target Unix shells (bash/zsh). On Windows, use WSL or use the spawnSync alternative shown in the orchestrator below.
retrieve-and-print.js — a helper script that reads the task from an environment variable, avoiding shell injection:
// retrieve-and-print.js
const { retrieveContext } = require('./retrieve');
const MAX_TASK_LENGTH = 8192;
(async () => {
const task = process.env.TASK;
if (!task) {
console.error('TASK environment variable is required');
process.exit(1);
}
if (task.length > MAX_TASK_LENGTH) {
console.error(`TASK too long (${task.length} chars, max ${MAX_TASK_LENGTH})`);
process.exit(1);
}
try {
const results = await retrieveContext(task, 7);
const formatted = results.map(r =>
'--- File: ' + r.file + ' (distance: ' + r.distance.toFixed(3) + ') ---
' + r.text
).join('
');
console.log(formatted);
} catch (err) {
console.error('Retrieval failed:', err.message);
process.exit(1);
}
})();
rag-claude.sh — RAG middleware for Claude Code:
#!/bin/bash
# rag-claude.sh — RAG middleware for Claude Code
# Uses env var to pass the task safely, avoiding shell injection
export TASK="$*"
CONTEXT=$(node retrieve-and-print.js) || { echo "Retrieval failed" >&2; exit 1; }
AUGMENTED_PROMPT="## Retrieved Codebase Context
The following code snippets were retrieved from the local codebase and are directly relevant to this task. Use them as your primary reference. Cite file paths when making changes.
${CONTEXT}
## Task
${TASK}"
# Verify the correct non-interactive flag for your Claude Code version with: claude --help
echo "$AUGMENTED_PROMPT" | claude --print
// orchestrator.js — Programmatic alternative (cross-platform safe)
const { retrieveContext } = require('./retrieve');
const { spawnSync } = require('child_process');
async function runWithRAG(task) {
const results = await retrieveContext(task, 7);
const contextBlock = results.map(r =>
`--- File: ${r.file} (distance: ${r.distance.toFixed(3)}) ---
${r.text}`
).join('
');
const augmentedPrompt = `## Retrieved Codebase Context
${contextBlock}
## Task
${task}`;
// Using spawnSync avoids shell interpolation entirely — safe on all platforms
// Verify the correct non-interactive flag for your Claude Code version with: claude --help
const result = spawnSync('claude', ['--print'], {
input: augmentedPrompt,
encoding: 'utf8',
stdio: ['pipe', 'pipe', 'inherit'], // capture stdout for programmatic use
});
if (result.error) {
console.error('Failed to spawn claude:', result.error.message);
process.exit(1);
}
if (result.status !== 0) {
console.error('Claude Code exited with status', result.status);
process.exit(1);
}
console.log(result.stdout);
return result.stdout;
}
runWithRAG(process.argv.slice(2).join(' '));
The shell script and Node.js orchestrator accomplish the same thing: they retrieve relevant chunks, format them with file path provenance, and pipe the augmented prompt into Claude Code. The --print flag outputs the response without entering interactive mode. Verify this flag exists in your version with claude --help.
Preventing Hallucination in Autonomous Editing
Why RAG Alone Isn't Enough
Retrieval quality degrades when indexes go stale after recent code changes, surfacing outdated information. Poor chunk boundaries split critical logic across two chunks, leaving neither sufficiently informative. Similarity search can also surface syntactically similar but semantically irrelevant results, especially in codebases with repeated patterns. If retrieved context is ambiguous or insufficient, the agent will still hallucinate with the false confidence of having "looked something up."
If retrieved context is ambiguous or insufficient, the agent will still hallucinate with the false confidence of having "looked something up."
Guardrail Strategies for Grounded Edits
Start by setting a maximum distance score for retrieved results. If no chunk falls below the threshold, inject an explicit statement: "No relevant internal documentation was found for this task." Injecting noise misleads the agent more than having no context at all.
Every injected chunk should include its source file path and the indexed timestamp. This lets both the agent and human reviewers verify where information came from and whether it is current.
After the agent produces edits, a post-processing step compares the diff against retrieved context and flags modifications that reference APIs, functions, or patterns absent from any retrieved chunk. This is not foolproof but catches obvious fabrications. For example, if the agent calls db.performQuery() but no retrieved chunk contains that method signature, the diff check surfaces the discrepancy.
Timestamp metadata on chunks enables warnings when retrieved documentation exceeds a configurable age (for example, 30 days). Instruct the agent to treat stale context with lower confidence and recommend verification.
Practical Prompt Engineering for RAG-Augmented Agents
You are a coding assistant with access to retrieved context from the project's codebase.
Rules:
1. Base all code changes ONLY on the retrieved context provided above.
2. When referencing a function, type, or pattern, cite the source file path.
3. If the retrieved context does not contain information relevant to the task, state: "I do not have sufficient context to make this change confidently" and stop.
4. Do NOT invent API signatures, configuration options, or architectural patterns not present in the retrieved context.
5. If retrieved context is marked as older than 30 days, note this in your response and recommend the developer verify the information is current.
6. Do NOT edit files for which no context was retrieved unless the task explicitly requires creating new files.
7. When uncertain, ask a clarifying question rather than guessing.
This prompt template enforces grounded behavior by making the agent's obligations explicit. The instruction to refuse edits without sufficient context is particularly important for autonomous workflows where the agent might otherwise proceed with fabricated assumptions.
Implementation Checklist and Complete Pipeline Reference
Full Implementation Checklist
- Clone
awesome-llm-appsand explore the local RAG pipeline templates inrag_tutorials/ - Set up a Python virtual environment (
python -m venv .venv && source .venv/bin/activate) and install Python dependencies - Install Node.js dependencies (
[email protected],[email protected]) and set up ChromaDB server (pip install chromadb==0.5.23) and Ollama withnomic-embed-text - Start both Ollama (
ollama serve) and ChromaDB (chroma run --path ./chroma_data) as background services - Configure file extension filters and chunk strategy for your codebase (function-aware for JS/TS
functiondeclarations andconst-assigned expressions; fixed-size for other file types and patterns) - Run the ingestion script against your target codebase and verify the chunk count in console output
- Generate embeddings locally using the Ollama endpoint and confirm embedding count matches chunk count
- Store embeddings in ChromaDB and verify collection count with
collection.count() - Test retrieval with 3 to 5 sample queries that reflect real tasks in your codebase; inspect distance scores and file paths
- Set up the middleware wrapper script for Claude Code (or adapt for your agent of choice)
- Configure hallucination guardrails: set distance threshold, enable provenance tagging, add the grounded system prompt
- Run an end-to-end test by giving the agent a real task and reviewing whether its edits cite retrieved context accurately
- Set up automatic re-indexing for changed files (e.g., using
chokidar:npm install chokidarand implement a file-watcher script, or use a cron job such as*/5 * * * * node /path/to/reindex.js)
This checklist covers every step from initial setup to a working local integration. It does not cover production concerns like monitoring, logging, backup, or horizontal scaling. Each step is independently verifiable, making it easier to diagnose where a pipeline might be failing.
Common Pitfalls
- Both Ollama and ChromaDB must be running as background processes before any pipeline script executes. There are no startup checks in the code. If either is down, scripts will throw connection errors.
- ChromaDB's JavaScript client API has changed across major versions. If you see
TypeErrorongetOrCreateCollection,upsert, orquery, check that your installedchromadbnpm version matches the pinned version in this tutorial. - The
ollamanpm package'sembeddings()method returns{ embedding: number[] }(singular). If you seeundefinedembeddings downstream, log the raw response withconsole.log(response)and verify the field name. - The bash middleware and pipe-based patterns do not work on Windows outside WSL. Use the
spawnSync-based orchestrator for cross-platform support. - The code samples include error handling for embedding failures, API errors, and missing results. Review the
withTimeoututility and the failure-aggregation pattern inembedChunksas starting points; extend these patterns to cover additional edge cases in your deployment.
This removes the context window bottleneck that makes agents unreliable in real-world projects.
Next Step
Add a file watcher (e.g., chokidar) to re-index on save and measure retrieval precision over a week of real usage. That feedback loop will tell you whether your chunk boundaries, embedding model, and distance thresholds are actually working for your codebase, or whether they need tuning. The awesome-llm-apps repository includes additional templates for multi-modal and agentic RAG patterns worth exploring once the base pipeline is stable.