AI & ML

Local RAG Without the Cloud: Building Private Document Q&A Systems

· 5 min read
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Retrieval Augmented Generation has become the dominant pattern for building document Q&A systems, and for good reason: it grounds large language model responses in actual source material. But the standard cloud-based RAG pipeline, where document chunks get embedded via OpenAI's API, stored in Pinecone, and answered by GPT-4, sends every byte of your proprietary data through third-party servers. For organizations handling regulated health records, privileged legal documents, classified defense materials, or sensitive financial data, that pipeline is a non-starter. Local RAG eliminates that exposure entirely. This tutorial walks through building a complete private document Q&A system, from ingestion to answer generation, with zero cloud dependencies. By the end, readers will have a working architecture they can deploy on a single machine behind an air gap if needed.

Table of Contents

Local RAG Architecture Overview

How Cloud RAG Differs from Local RAG

A typical cloud-dependent pipeline works like this: you chunk documents and send them to OpenAI's embedding API. A managed service like Pinecone or Weaviate Cloud stores the resulting vectors. GPT-4 then answers user queries using those vectors as context. Every step involves network calls to infrastructure outside your control.

A fully local pipeline replaces each of those components. Local embedding models from Hugging Face or Ollama handle vectorization. Chroma or LanceDB stores vectors on local disk. A local LLM running through Ollama (Llama 3, Mistral, or similar) generates answers. The trade-offs are real: cloud pipelines offer lower latency on lightweight hardware and require less operational setup, while local pipelines demand more RAM and ideally a GPU, but provide absolute data sovereignty, eliminate per-token API costs at scale, and work in air-gapped environments where HIPAA, GDPR, or ITAR compliance mandates zero external data transmission.

Local pipelines demand more RAM and ideally a GPU, but provide absolute data sovereignty, eliminate per-token API costs at scale, and work in air-gapped environments where HIPAA, GDPR, or ITAR compliance mandates zero external data transmission.

Component Breakdown

The local RAG stack has five layers:

  1. The document loader and chunker reads PDFs, Markdown, and plain text, then splits them into retrieval-friendly segments.
  2. A local embedding model converts text chunks into dense vectors using sentence-transformers or Ollama-hosted models.
  3. The vector database stores and indexes vectors for similarity search. Chroma and LanceDB are the two strongest options for embedded, local-first use.
  4. For LLM inference, Ollama serves models like Llama 3 (8B) and Mistral (7B) through a simple HTTP interface.
  5. LangChain acts as the orchestration layer, tying the pieces together.

Architecture Diagram:

Document Files (PDF/MD/TXT)
  ┌─────────────┐
  │  Chunker     │  (RecursiveCharacterTextSplitter)
  └──────┬──────┘
  ┌─────────────────────┐
  │ Local Embedding Model│  (sentence-transformers / Ollama)
  └──────┬──────────────┘
  ┌─────────────────────┐
  │  Vector Database     │  (Chroma or LanceDB)
  └──────┬──────────────┘
  ┌─────────────────────┐
  │  Query Pipeline      │  Embed query → Retrieve top-k → Rerank
  └──────┬──────────────┘
  ┌─────────────────────┐
  │  Local LLM (Ollama)  │  (Llama 3 / Mistral)
  └──────┬──────────────┘
    Grounded Answer with Source References

Setting Up the Local Environment

Hardware Considerations

Local RAG is accessible on hardware most developers already own. Minimum viable specs: 16 GB system RAM, a modern multi-core CPU, and Python 3.9 or later. That handles embedding generation and vector storage without trouble, and can run quantized 7B-parameter LLMs, though expect 10-60 seconds per response on CPU-only 7B inference. For a responsive experience, a GPU with 8 GB or more of VRAM (NVIDIA RTX 3060 or better) cuts LLM inference time by roughly 3-10x depending on model size and quantization level. Apple Silicon users benefit from Metal GPU acceleration via Ollama's llama.cpp backend, making M1/M2/M3 Macs with 16 GB unified memory capable local RAG machines.

Installing Dependencies

Note: Llama 3 models require accepting Meta's community license at llama.meta.com before use.

# Verify Python version (3.9+ required)
python --version

# Create a virtual environment
python -m venv local-rag-env
source local-rag-env/bin/activate  # On Windows: .\local-rag-env\Scripts\activate

# Install Python packages (pin versions for reproducibility)
pip install langchain==0.2.16 langchain-community==0.2.16 \
    langchain-huggingface==0.0.6 langchain-ollama==0.1.3 \
    chromadb==0.5.5 lancedb==0.6.13 \
    sentence-transformers pypdf tiktoken pandas pyarrow unstructured

# Install Ollama (macOS/Linux)
# DO NOT run in production without verifying checksum.
# Review the install script before execution, or use the signed package
# installer from https://ollama.com/download for production environments.
# Windows users: download the installer from https://ollama.com/download.
curl -fsSL https://ollama.com/install.sh | sh

# Start the Ollama daemon (starts automatically on macOS after installation)
# Bind to localhost only for security — see Security Considerations below.
OLLAMA_HOST=127.0.0.1:11434 ollama serve &

# Verify Ollama is running
curl http://127.0.0.1:11434

# Pull models for local inference — pin explicit size tags for reproducibility
ollama pull llama3.2:3b    # Llama 3.2 3B (~2 GB). Use llama3.2:8b for the 8B variant.
ollama pull nomic-embed-text

The ollama pull llama3.2:3b command downloads the Llama 3.2 3B model. Pinning to an explicit size tag (:3b or :8b) prevents silent model substitution when Ollama updates its registry defaults. The nomic-embed-text model provides a strong local embedding alternative with a default output dimension of 768 (it supports reduced dimensions via Matryoshka truncation).

Document Ingestion and Chunking

Loading Documents Locally

Supporting multiple file formats is straightforward with LangChain's loader ecosystem. PyPDFLoader handles PDFs, TextLoader covers plain text, and UnstructuredMarkdownLoader parses Markdown (requires the unstructured package, included in the install command above). The critical design choice is chunking strategy: chunks that are too large dilute relevance during retrieval, while chunks that are too small lose context. A chunk_size of 512 characters with 50 characters of overlap provides a reliable starting point for most document types. To chunk by tokens instead, replace length_function=len with a tiktoken-based counter (e.g., length_function=lambda text: len(tiktoken.get_encoding("cl100k_base").encode(text))) and adjust the size value accordingly—512 characters is roughly 100–170 tokens.

Implementing a Chunking Pipeline

import os
import logging
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

logger = logging.getLogger(__name__)

# Ensure the documents directory exists and contains files
# mkdir -p documents && cp /your/files/*.pdf documents/

# Load all PDFs and text files from a directory
pdf_loader = DirectoryLoader("./documents", glob="**/*.pdf", loader_cls=PyPDFLoader)
text_loader = DirectoryLoader("./documents", glob="**/*.txt", loader_cls=TextLoader)

documents = pdf_loader.load() + text_loader.load()
assert len(documents) > 0, "No documents loaded — check ./documents path and contents"
print(f"Loaded {len(documents)} raw document pages/files")

# Configure chunking (chunk_size and chunk_overlap are in characters when using len)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["

", "
", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)
assert len(chunks) > 0, "Splitter produced zero chunks — check document contents"
print(f"Created {len(chunks)} chunks")
print(f"Sample chunk:
{chunks[0].page_content[:200]}...")

The RecursiveCharacterTextSplitter tries to split on paragraph boundaries first, falling back to sentences and then words. This preserves semantic coherence within chunks far better than naive fixed-length splitting.

Local Embedding Models: Choosing and Using Them

Model Options Compared

ModelDimensionsRelative SpeedMTEB Retrieval ScoreMemory Footprint (model weights)
all-MiniLM-L6-v2384Fast~49~80 MB
nomic-embed-text (Ollama)768 (default; supports lower via Matryoshka truncation)Moderate~55~270 MB
BAAI/bge-base-en-v1.5768Moderate~53~440 MB

MTEB scores are approximate and vary by benchmark version and dataset subset. Refer to the MTEB leaderboard for current rankings. Memory figures are model weights only; runtime RSS will be higher depending on batch size.

all-MiniLM-L6-v2 is the speed champion, suitable for prototyping or resource-constrained environments. nomic-embed-text, served through Ollama, delivers stronger retrieval quality at 768 dimensions and integrates cleanly with the same Ollama runtime used for LLM inference. BAAI/bge-base-en-v1.5 offers a solid balance and is widely benchmarked on MTEB (Massive Text Embedding Benchmark).

Generating Embeddings Locally

The embedding step must produce a vectors list regardless of which backend you choose. Select one embedding backend and run the corresponding block — both storage sections below depend on vectors and embeddings_model being defined.

import time

# ──────────────────────────────────────────────────────────────────────
# Select ONE embedding backend. Both Chroma and LanceDB sections below
# depend on `embeddings_model` and `vectors` being defined here.
# ──────────────────────────────────────────────────────────────────────
EMBEDDING_BACKEND = "huggingface"  # Change to "ollama" to use Ollama embeddings

if EMBEDDING_BACKEND == "huggingface":
    from langchain_huggingface import HuggingFaceEmbeddings
    embeddings_model = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={"device": "cpu"},  # Change to "cuda" for GPU
    )

elif EMBEDDING_BACKEND == "ollama":
    from langchain_ollama import OllamaEmbeddings
    embeddings_model = OllamaEmbeddings(model="nomic-embed-text")

else:
    raise ValueError(f"Unknown EMBEDDING_BACKEND: {EMBEDDING_BACKEND!r}")

# Embed the full corpus for use in vector storage below
assert len(chunks) > 0, "No chunks to embed"
texts = [chunk.page_content for chunk in chunks]

start = time.time()
vectors = embeddings_model.embed_documents(texts)
elapsed = time.time() - start

print(f"Embedded {len(vectors)} chunks in {elapsed:.2f}s ({elapsed/len(vectors):.3f}s per chunk)")
print(f"Vector dimension: {len(vectors[0])}")

On a modern multi-core CPU, all-MiniLM-L6-v2 embeds roughly 100 chunks per second, though throughput varies with hardware and chunk length. The Ollama-based model is slower per call due to the HTTP overhead but produces higher-quality vectors for retrieval tasks.

Vector Storage: Chroma vs. LanceDB

Chroma: Setup and Indexing

Chroma operates in-process, persists to disk via SQLite, and requires no external server. It handles collection creation, metadata storage, and approximate nearest neighbor search through a clean Python API. For moderate-scale document sets (tens of thousands to low hundreds of thousands of chunks), it works well and gets developers productive fast.

import os
import logging
import chromadb

logger = logging.getLogger(__name__)

# Disable telemetry via environment variable (stable across Chroma versions)
os.environ["ANONYMIZED_TELEMETRY"] = "False"

# Initialize persistent Chroma client
client = chromadb.PersistentClient(path="./vector_store/chroma")

collection = client.get_or_create_collection(
    name="local_docs",
    metadata={"hnsw:space": "cosine"}
)

# Batch insert — O(1) round-trips per batch instead of O(n) single-record adds
batch_size = 512  # Chroma recommended upper bound per call
for start in range(0, len(chunks), batch_size):
    batch_chunks = chunks[start:start + batch_size]
    batch_vectors = vectors[start:start + batch_size]

    collection.add(
        ids=[f"chunk_{start + j}" for j in range(len(batch_chunks))],
        embeddings=[v for v in batch_vectors],
        documents=[c.page_content for c in batch_chunks],
        metadatas=[
            {
                "source": c.metadata.get("source", "unknown"),
                "chunk_id": start + j,
            }
            for j, c in enumerate(batch_chunks)
        ],
    )

print(f"Indexed {collection.count()} chunks in Chroma")

# Query
query_vector = embeddings_model.embed_query("What are the return conditions?")
results = collection.query(
    query_embeddings=[query_vector],
    n_results=5,
    include=["documents", "metadatas", "distances"]
)

for doc, meta, dist in zip(results["documents"][0], results["metadatas"][0], results["distances"][0]):
    print(f"[Score: {1 - dist:.4f}] Source: {meta['source']}
{doc[:150]}...
")

LanceDB: Setup and Indexing

LanceDB takes a different approach. It uses the columnar Lance format, is designed for larger-scale local workloads, supports IVF-PQ indexing for datasets exceeding one million vectors, and handles versioned datasets natively. The API is DataFrame-oriented, which feels natural for data engineering workflows.

import lancedb
import pyarrow as pa

# Create LanceDB connection
db = lancedb.connect("./vector_store/lancedb")

# Build data with explicit schema for cosine metric indexing
schema = pa.schema([
    pa.field("text", pa.utf8()),
    pa.field("vector", pa.list_(pa.float32(), len(vectors[0]))),
    pa.field("source", pa.utf8()),
    pa.field("chunk_id", pa.int64()),
])

data = [
    {
        "text": chunk.page_content,
        "vector": vector,
        "source": chunk.metadata.get("source", "unknown"),
        "chunk_id": i,
    }
    for i, (chunk, vector) in enumerate(zip(chunks, vectors))
]

# Create table (or overwrite existing)
# WARNING: "overwrite" drops existing table. Change to mode="append" to preserve prior data.
table = db.create_table("local_docs", data=data, schema=schema, mode="overwrite")
print(f"Indexed {table.count_rows()} chunks in LanceDB")

# Explicit cosine metric on index creation; replace=True allows safe re-runs
table.create_index(metric="cosine", num_partitions=256, num_sub_vectors=96, replace=True)

# Query with cosine metric explicitly
query_vector = embeddings_model.embed_query("What are the return conditions?")
results = (
    table.search(query_vector, metric="cosine")
    .limit(5)
    .to_pandas()
)

for _, row in results.iterrows():
    print(f"[Cosine Distance: {row['_distance']:.4f}] Source: {row['source']}
{row['text'][:150]}...
")

Head-to-Head Comparison

FeatureChromaLanceDB
Storage formatSQLite + HNSW (Chroma ≥0.4; earlier versions used DuckDB)Lance (columnar)
Max practical dataset size~500K vectors1M+ vectors with explicit IVF-PQ indexing
Query latency (10K vectors)Sub-millisecondSub-millisecond
Disk usageModerateLower (columnar compression)
API styleCollection-orientedDataFrame-oriented
VersioningManualBuilt-in
Ecosystem maturityMore established, larger communityNewer, growing rapidly

Chroma is the better pick for prototyping, moderate document sets, and teams already using LangChain's Chroma integration. LanceDB fits better when you're working with larger corpora, need dataset versioning, plan for multimodal use cases, or run production workloads where columnar compression matters.

Building the Query Pipeline

Retrieval with Reranking

Naive top-k vector retrieval returns the chunks closest in embedding space, but embedding distances don't reliably predict which chunks answer a specific question. A lightweight cross-encoder reranker like cross-encoder/ms-marco-MiniLM-L-6-v2 scores each query-chunk pair directly, producing more accurate ranking. This model runs locally. On modern CPU hardware with the model already loaded, reranking 20 candidates typically takes 100-500ms; a GPU reduces this to under 50ms. Benchmark on your target hardware.

Embedding distances don't reliably predict which chunks answer a specific question. A lightweight cross-encoder reranker scores each query-chunk pair directly, producing more accurate ranking.

Connecting to a Local LLM via Ollama

import logging
from langchain_ollama import OllamaLLM
from sentence_transformers import CrossEncoder

logger = logging.getLogger(__name__)

# Initialize components
# Ensure Ollama is running: OLLAMA_HOST=127.0.0.1:11434 ollama serve
# Verify with: curl http://127.0.0.1:11434
llm = OllamaLLM(model="llama3.2:3b", temperature=0.1, request_timeout=120.0)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


def query_local_rag(
    question: str,
    collection,
    embeddings_model,
    llm,
    reranker,
    top_k: int = 5,
) -> dict:
    if not question or not question.strip():
        raise ValueError("question must be a non-empty string")

    # Step 1: Embed the question
    try:
        query_vector = embeddings_model.embed_query(question)
    except Exception as exc:
        logger.error("Embedding failed: %s", exc)
        raise

    # Step 2: Retrieve candidates from Chroma
    try:
        results = collection.query(
            query_embeddings=[query_vector],
            n_results=top_k * 2,  # Over-fetch for reranking
            include=["documents", "metadatas"],
        )
    except Exception as exc:
        logger.error("Vector retrieval failed: %s", exc)
        raise

    # For LanceDB alternative:
    # results_df = table.search(query_vector, metric="cosine").limit(top_k * 2).to_pandas()

    candidates = results["documents"][0]
    metadatas = results["metadatas"][0]

    if not candidates:
        logger.warning("No candidates returned for question: %r", question)
        return {"answer": "No relevant documents found.", "sources": []}

    assert len(candidates) == len(metadatas), (
        f"Candidate/metadata length mismatch: {len(candidates)} vs {len(metadatas)}"
    )

    # Step 3: Rerank
    pairs = [[question, doc] for doc in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(candidates, metadatas, scores),
        key=lambda x: x[2],
        reverse=True,
    )[:top_k]

    # Step 4: Build prompt with context
    context = "

---

".join(
        f"[Source: {meta['source']}]
{doc}" for doc, meta, _ in ranked
    )

    prompt = (
        "Answer the question based only on the following context.
"
        "If the context doesn't contain enough information, say so.

"
        f"Context:
{context}

"
        f"Question: {question}

"
        "Answer:"
    )

    # Step 5: Generate answer with local LLM
    try:
        answer = llm.invoke(prompt)
    except Exception as exc:
        logger.error("LLM invocation failed: %s", exc)
        raise

    # Step 6: Return answer with ordered-deduplicated sources
    sources = list(dict.fromkeys(meta["source"] for _, meta, _ in ranked))
    return {"answer": answer, "sources": sources}


# Example usage
result = query_local_rag(
    "What is the policy for returning damaged items?",
    collection,
    embeddings_model,
    llm,
    reranker,
)

print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

This pipeline embeds the user question with the same model used during ingestion (critical for consistent similarity scoring), over-fetches candidates, reranks them with a cross-encoder, constructs a grounded prompt, and sends it to Llama 3.2 running locally through Ollama. The request_timeout=120.0 parameter on the LLM prevents a stalled Ollama daemon from blocking the pipeline indefinitely.

Optimizations and Production Hardening

Chunking and Retrieval Tuning

Chunk size has an outsized impact on retrieval quality. Smaller chunks (256 characters) improve precision for fact-lookup queries but lose surrounding context. Larger chunks (1024 characters) preserve more context but may dilute relevance. Nothing else reliably optimizes this: test multiple sizes against representative queries. Hybrid search, combining BM25 keyword matching with vector similarity, catches cases where exact terminology matters more than semantic closeness, particularly when documents contain domain-specific acronyms or codes that embeddings may not distinguish. LangChain's EnsembleRetriever supports this pattern. Metadata filtering (by document source, date, or category) becomes essential for multi-document collections to prevent cross-contamination of answers.

Performance and Scalability

Quantized embedding models (ONNX runtime with INT8 quantization) can improve embedding throughput; benchmark on your hardware to confirm. LanceDB supports IVF-PQ indexing for large datasets. After ingestion, create the index explicitly with replace=True to allow safe re-runs:

table.create_index(metric="cosine", num_partitions=256, num_sub_vectors=96, replace=True)

This keeps query latency flat beyond one million vectors. Batch ingestion with bulk inserts rather than single-record adds reduces indexing time by eliminating per-record round-trips. For repeated queries, caching the embedding and retrieval results avoids redundant computation.

Security Considerations

Set vector database storage directories to 0700 permissions (or the OS equivalent). Running Ollama in a sandboxed process or container limits the blast radius of any model-level vulnerability. By default, Ollama binds to 0.0.0.0:11434, exposing its API on all network interfaces. For air-gapped or compliance deployments, set OLLAMA_HOST=127.0.0.1:11434 in the environment to restrict the API to localhost only. For compliance-sensitive deployments, implement audit logging that records every query, the chunks retrieved, and the generated response, without logging the full document corpus.

By default, Ollama binds to 0.0.0.0:11434, exposing its API on all network interfaces. For air-gapped or compliance deployments, set OLLAMA_HOST=127.0.0.1:11434 in the environment to restrict the API to localhost only.

Starter Template Walkthrough

The starter template below organizes the local RAG pipeline into a clean, configurable structure:

local-rag/
├── ingest.py          # Document loading, chunking, embedding, indexing
├── query.py           # Query pipeline with reranking and LLM generation
├── config.yaml        # Swap vector DB, embedding model, LLM here
├── requirements.txt   # All Python dependencies (pinned versions)
├── documents/         # Drop PDF/TXT/MD files here
└── vector_store/      # Persistent vector DB storage

The config.yaml file controls which vector database (Chroma or LanceDB), embedding model, and LLM to use. Switching from Chroma to LanceDB or from all-MiniLM-L6-v2 to nomic-embed-text requires changing a single line, not rewriting code.

When Local RAG Is the Right Choice

Fully private, zero-cloud document Q&A is production-viable today with commodity hardware. The strongest use cases remain regulated industries (legal, healthcare, finance, defense) and internal knowledge bases where data cannot leave the network perimeter. This tutorial does not cover multimodal RAG (handling images alongside text), fine-tuning local models for domain-specific accuracy, or agentic retrieval patterns that decompose complex questions into sub-queries. Each of those warrants its own treatment.