AI & ML

Optimizing Token Usage: Context Compression Techniques

· 5 min read
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Optimize Token Usage with Context Compression

  1. Count baseline tokens with TikToken using the correct encoding for your target model.
  2. Build a base RAG retrieval pipeline with chunked documents and a FAISS vector store.
  3. Apply extraction-based compression via LLMChainExtractor to select relevant sentences from each chunk.
  4. Apply selection-based compression via LLMChainFilter to keep or discard entire chunks.
  5. Compare token counts across baseline, extraction, and selection methods.
  6. Calculate real-dollar savings using a token-to-cost formula across your expected call volume.
  7. Layer selection before extraction for maximum reduction when retrieval returns many chunks.
  8. Benchmark answer quality on at least 50 representative queries before deploying to production.

This tutorial walks through two concrete compression strategies, extraction-based and selection-based, using LangChain's contextual compression retrievers and TikToken for precise token accounting. Readers will implement both approaches against the same retrieval pipeline, measure token reduction percentages, and calculate real-dollar savings at scale.

Table of Contents

Why Token Optimization Matters Now

Token optimization is no longer a nice-to-have. GPT-4 (8K context) costs $30/M input tokens; GPT-4o costs $2.50/M input tokens (OpenAI pricing, verify current rates at platform.openai.com before budgeting — prices change frequently). In agentic workflows that loop repeatedly through their context windows, uncompressed context can quietly transform a promising AI product into one where context costs exceed the inference costs themselves. Context compression — reducing the number of tokens fed into an LLM's prompt without destroying the information needed for accurate responses — is the most direct lever engineers have for controlling these costs.

Context compression — reducing the number of tokens fed into an LLM's prompt without destroying the information needed for accurate responses — is the most direct lever engineers have for controlling these costs.

How Context Windows Drain Your Budget

Token Counting Fundamentals with TikToken

Tokenization is where cost accounting starts, and character count is a misleading proxy. A 4,000-character passage might tokenize to 900 tokens or 1,200 tokens depending on vocabulary density, whitespace, and special characters. OpenAI's tiktoken library provides exact counts using the same tokenizer the models use internally. The cl100k_base encoding covers GPT-4 and GPT-3.5-turbo. GPT-4o and GPT-4o-mini use o200k_base. Pass the correct encoding for your target model: tiktoken.encoding_for_model("gpt-4o") returns o200k_base.

import tiktoken
from functools import lru_cache

@lru_cache(maxsize=8)
def _get_encoding(encoding_name: str) -> tiktoken.Encoding:
    """Cache tokenizer instances — construction is expensive."""
    return tiktoken.get_encoding(encoding_name)

def count_tokens(text: str, encoding_name: str = "o200k_base") -> int:
    """Count tokens for a given text.

    Args:
        text: The string to tokenize.
        encoding_name: Tiktoken encoding name.
                       Use 'cl100k_base' for GPT-4/GPT-3.5-turbo.
                       Use 'o200k_base' for GPT-4o/GPT-4o-mini.
    Returns:
        Exact integer token count.
    """
    if not isinstance(text, str):
        raise TypeError(f"Expected str, got {type(text).__name__}")
    encoding = _get_encoding(encoding_name)
    return len(encoding.encode(text))

# Sample retrieved RAG passage for demonstration
# Replace with: open("your_file.txt").read()
sample_context = (
    "This is a sample retrieved passage used to demonstrate token counting. "
    "Replace this with your actual retrieved document text."
)

baseline_tokens = count_tokens(sample_context)
print(f"Baseline token count: {baseline_tokens}")

This function returns the exact integer token count. Every cost calculation downstream depends on this number being precise, not estimated.

The Compounding Cost in Agentic Loops

The cost picture worsens fast in agentic architectures. A ReAct-style agent re-sends the full accumulated context with every reasoning step (the exact behavior depends on your agent framework — verify your framework's context handling). If retrieved context contributes 3,000 tokens per step and the agent takes 5 steps to resolve a query, that single invocation consumes roughly 15,000 context tokens (excluding system prompt and prior turns, which add further). Multiply by 1,000 daily queries and you hit $37.50/month in input costs alone at GPT-4o rates — before accounting for output tokens or compression overhead. This compounding effect makes context compression essential rather than optional for reducing LLM costs.

Extraction vs. Selection: When to Use Which

Extraction-Based Compression

Extraction-based compression uses an LLM to identify and return verbatim relevant sentences from retrieved content before injecting them into the main prompt. The model reads the source material and selects the sentences most pertinent to the query. This works best with long-form or narrative source documents where important details scatter throughout the text. The main weakness: it requires an extra LLM call, adding both latency and its own token cost (a meta-cost that must be factored into savings calculations). For true summarization — where the LLM rewrites and paraphrases content into a shorter form — a custom summarization chain is required.

Selection-Based Compression

Selection-based compression takes a different approach in granularity. Instead of selecting sentences within documents, it filters retrieved document chunks, passing only those the LLM judges relevant, without modification. Selection preserves the original wording verbatim, which matters for citation-heavy applications, legal documents, or any context where exact phrasing counts. The trade-off: this approach loses surrounding context. Because the filter operates at the chunk level (not sentence level), it discards entire chunks rather than trimming them — even chunks where nearly every sentence carries weight. Sentence-level selection requires EmbeddingsFilter or a custom implementation.

Decision Framework

CriteriaExtractionSelection
LatencyHigher (+1 LLM round-trip per chunk)Lower (filter only, no rewrite)
Fidelity to sourceHigher (verbatim sentences)Higher (verbatim chunks)
Best forNarrative docs, broad questionsFactual lookups, citation-heavy apps
Extra LLM costYesYes — one LLM call per chunk (lower than extraction). Zero only when using embedding-based selection (e.g., EmbeddingsFilter).
Token reduction depthHigher (sentence-level selection)Lower (chunk-level granularity)

Extraction reduces tokens more aggressively because it selects at the sentence level, while chunk-level selection is constrained by the granularity of the source material. Actual reduction ratios vary significantly by document type, chunk size, and query distribution — measure against your own corpus before committing to either approach.

Implementing Context Compression with LangChain

Prerequisites

Before running any code below, set up your environment:

# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install dependencies (pin versions for reproducibility)
# Tested with langchain==0.2.16, langchain-community==0.2.16, langchain-openai==0.1.23
pip install tiktoken langchain langchain-community langchain-openai faiss-cpu openai python-dotenv

Note (Apple Silicon): faiss-cpu may require installation via conda or Rosetta on M-series Macs if pip installation fails.

Set your OpenAI API key as an environment variable. Never hardcode API keys in source files. Use a .env file or a secrets manager in production.

import os
from dotenv import load_dotenv  # pip install python-dotenv

load_dotenv()  # Reads from .env file; never commit .env to version control

if not os.environ.get("OPENAI_API_KEY"):
    raise EnvironmentError(
        "OPENAI_API_KEY not set. "
        "Create a .env file with OPENAI_API_KEY=<your-key> "
        "or set the variable in your shell environment."
    )

Setting Up the Base RAG Pipeline

The baseline pipeline loads a document, splits it into chunks, embeds them, retrieves the top-k most relevant chunks, and passes them to the model. Token counting before compression establishes the cost baseline for an efficient RAG system.

import hashlib
import tiktoken
from pathlib import Path
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Align tiktoken encoding with the actual model being called.
# GPT-4o-mini uses o200k_base, not cl100k_base.
COMPRESSION_MODEL = "gpt-4o-mini"
ENCODING_NAME = "o200k_base"  # correct for gpt-4o and gpt-4o-mini

# Use double newline to preserve chunk boundaries; avoids merging
# adjacent sentences from different chunks into a single run-on string.
CHUNK_SEPARATOR = "

"

# Create a sample knowledge base if you don't have one.
# Replace this content with your own documents for production use.
if not Path("knowledge_base.txt").exists():
    Path("knowledge_base.txt").write_text(
        "Q3 2024 Performance Report

"
        "Revenue grew 12% year-over-year to $4.2 billion, driven primarily by "
        "expansion in the cloud services division. Operating margin improved to "
        "28%, up from 25% in the prior quarter. Customer acquisition cost "
        "decreased by 8% while lifetime value increased by 15%. The net "
        "promoter score reached 72, the highest in company history. Employee "
        "headcount grew to 15,000 across 12 global offices. R&D spending "
        "represented 18% of revenue, focused on AI-driven product features. "
        "Churn rate held steady at 3.2% monthly. Free cash flow reached "
        "$890 million, enabling accelerated share buybacks.
",
        encoding="utf-8",
    )

# Load and split
loader = TextLoader("knowledge_base.txt", encoding="utf-8")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
splits = splitter.split_documents(docs)

# Embed and index
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)

# Optional: persist the FAISS index to avoid re-embedding on subsequent runs
# vectorstore.save_local("faiss_index")
#
# To reload safely, verify the index file's integrity before deserializing:
#
# def load_faiss_index(index_dir, embeddings, expected_sha256=None):
#     """Load a FAISS index only after optional integrity verification.
#
#     ⚠️ WARNING: allow_dangerous_deserialization=True enables pickle loading.
#     Only load indexes you created yourself or from a trusted source.
#     Use expected_sha256 to verify index integrity before loading.
#     Compute once with: hashlib.sha256(Path("faiss_index/index.faiss").read_bytes()).hexdigest()
#     """
#     index_path = Path(index_dir) / "index.faiss"
#     if not index_path.exists():
#         raise FileNotFoundError(f"FAISS index not found at {index_path}")
#     if expected_sha256 is not None:
#         actual = hashlib.sha256(index_path.read_bytes()).hexdigest()
#         if actual != expected_sha256:
#             raise ValueError(
#                 f"FAISS index integrity check failed.
"
#                 f"  Expected: {expected_sha256}
"
#                 f"  Actual:   {actual}
"
#                 "Do not load an index from an untrusted or modified source."
#             )
#     return FAISS.load_local(index_dir, embeddings, allow_dangerous_deserialization=True)
#
# vectorstore = load_faiss_index("faiss_index", embeddings, expected_sha256="<your-hash>")

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Retrieve and count baseline tokens
query = "What are the key performance metrics for the Q3 report?"
retrieved_docs = base_retriever.invoke(query)

if not retrieved_docs:
    raise ValueError("Retriever returned no documents — check your knowledge base and query.")

baseline_context = CHUNK_SEPARATOR.join(doc.page_content for doc in retrieved_docs)

encoding = tiktoken.get_encoding(ENCODING_NAME)
baseline_tokens = len(encoding.encode(baseline_context))
print(f"Baseline context tokens: {baseline_tokens}")

Adding LLM-Based Extraction Compression

LangChain's LLMChainExtractor wraps a base retriever and prompts the LLM to return verbatim relevant sentences from each retrieved document. This is an extraction path: LLMChainExtractor prompts the LLM to return verbatim relevant sentences from each document. It does not rewrite or paraphrase. For true summarization, a custom summarization chain is required.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
import tiktoken

# Align tiktoken encoding with the actual model being called.
COMPRESSION_MODEL = "gpt-4o-mini"
ENCODING_NAME = "o200k_base"
CHUNK_SEPARATOR = "

"

# Re-declare encoding in case this block is run independently
encoding = tiktoken.get_encoding(ENCODING_NAME)

# Guard: baseline_tokens must exist from the base pipeline block
try:
    baseline_tokens
except NameError:
    raise RuntimeError(
        "baseline_tokens is not defined. "
        "Run the base RAG pipeline block first to initialize it "
        "before running this block."
    )

if baseline_tokens == 0:
    raise ValueError(
        "baseline_tokens is zero — retriever returned no documents. "
        "Check your knowledge base and query."
    )

llm = ChatOpenAI(model=COMPRESSION_MODEL, temperature=0, max_tokens=512)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

compressed_docs = compression_retriever.invoke(query)
extracted_context = CHUNK_SEPARATOR.join(doc.page_content for doc in compressed_docs)
extracted_tokens = len(encoding.encode(extracted_context))

reduction_pct = (1 - extracted_tokens / baseline_tokens) * 100
print(f"Extracted context tokens: {extracted_tokens}")
print(f"Reduction: {reduction_pct:.1f}%")

Adding Selection-Based Compression

Swapping LLMChainExtractor for LLMChainFilter switches to selection mode. The filter asks the LLM to decide which retrieved document chunks are relevant and passes only those through in their entirety, without rewriting any content. The filter operates at the chunk level — it keeps or discards entire chunks.

from langchain.retrievers.document_compressors import LLMChainFilter
import tiktoken

# Align tiktoken encoding with the actual model being called.
COMPRESSION_MODEL = "gpt-4o-mini"
ENCODING_NAME = "o200k_base"
CHUNK_SEPARATOR = "

"

# Re-declare encoding in case this block is run independently
encoding = tiktoken.get_encoding(ENCODING_NAME)

# Guard: baseline_tokens must exist from the base pipeline block
try:
    baseline_tokens
except NameError:
    raise RuntimeError(
        "baseline_tokens is not defined. "
        "Run the base RAG pipeline block first to initialize it "
        "before running this block."
    )

if baseline_tokens == 0:
    raise ValueError(
        "baseline_tokens is zero — retriever returned no documents. "
        "Check your knowledge base and query."
    )

llm = ChatOpenAI(model=COMPRESSION_MODEL, temperature=0, max_tokens=512)
filter_compressor = LLMChainFilter.from_llm(llm)
filter_retriever = ContextualCompressionRetriever(
    base_compressor=filter_compressor,
    base_retriever=base_retriever
)

filtered_docs = filter_retriever.invoke(query)
selected_context = CHUNK_SEPARATOR.join(doc.page_content for doc in filtered_docs)
selected_tokens = len(encoding.encode(selected_context))

selection_reduction_pct = (1 - selected_tokens / baseline_tokens) * 100
print(f"Selected context tokens: {selected_tokens}")
print(f"Reduction: {selection_reduction_pct:.1f}%")

Comparing Results

print(f"{'Method':<20} {'Tokens':>8} {'Reduction':>10}")
print(f"{'Baseline':<20} {baseline_tokens:>8} {'—':>10}")
print(f"{'Extraction':<20} {extracted_tokens:>8} {reduction_pct:>9.1f}%")
print(f"{'Selection':<20} {selected_tokens:>8} {selection_reduction_pct:>9.1f}%")

Measuring Real-Dollar Savings

Token-to-Cost Formula

This formula covers input token costs only. For total cost, add output token costs separately: output_cost = (output_tokens / 1_000_000) * output_price_per_m * num_runs. For GPT-4o, output is $10.00/M tokens as of mid-2024. calculate_cost scales input costs across multiple runs and compares all three approaches against GPT-4o pricing.

def calculate_cost(tokens_per_call: int, price_per_m: float, num_runs: int) -> float:
    """Calculate input token cost. Does NOT include output token costs or
    compression LLM call costs — add those separately for production budgeting.

    Args:
        tokens_per_call: Number of input tokens per call. Must be > 0.
        price_per_m: Price per 1M input tokens (USD).
        num_runs: Number of calls. Must be > 0.

    Returns:
        Total cost in USD.

    Raises:
        ValueError: If tokens_per_call or num_runs is <= 0.
    """
    if tokens_per_call <= 0:
        raise ValueError(
            f"tokens_per_call must be > 0, got {tokens_per_call}. "
            "This likely means the retriever returned no documents."
        )
    if num_runs <= 0:
        raise ValueError(f"num_runs must be > 0, got {num_runs}.")
    return (tokens_per_call / 1_000_000) * price_per_m * num_runs

gpt4o_price = 2.50  # USD per 1M input tokens for GPT-4o (mid-2024 — verify at platform.openai.com)

runs = 1000
cost_baseline = calculate_cost(baseline_tokens, gpt4o_price, runs)
cost_extracted = calculate_cost(extracted_tokens, gpt4o_price, runs)
cost_selected = calculate_cost(selected_tokens, gpt4o_price, runs)

# Note: Extraction and selection both incur additional LLM call costs for compression
# that are not captured in this calculator. Check your OpenAI usage dashboard for
# actual compression overhead.

print(f"Cost over {runs} runs (GPT-4o input only):")
print(f"  Baseline:      ${cost_baseline:.4f}")
print(f"  Extraction:    ${cost_extracted:.4f}")
print(f"  Selection:     ${cost_selected:.4f}")
print(f"  Savings (extraction): ${cost_baseline - cost_extracted:.4f}")
print(f"  Savings (selection):  ${cost_baseline - cost_selected:.4f}")

Calculator: "Cost to Run This Agent 1,000 Times"

The table below shows input token costs only. Both extraction and selection incur additional LLM call costs for the compression step — check your provider's usage dashboard for actual totals. Output token costs (e.g., GPT-4o output at $10.00/M) must also be added for a complete cost model.

ParameterBaselineExtractionSelection
Avg context tokens/call2,0005001,000
Agent steps555
Total input tokens/run10,0002,5005,000
Compression call costAdditional (see note)Additional (see note)
1,000 runs (GPT-4o @ $2.50/M input)$25.00$6.25 + compression$12.50 + compression
1,000 runs (GPT-4o-mini @ $0.15/M input)$1.50$0.38 + compression$0.75 + compression
1,000 runs (gpt-3.5-turbo-0125 @ $0.50/M input — verify at platform.openai.com)$5.00$1.25 + compression$2.50 + compression

Plug in your own agent's numbers. Replace the token counts and step counts with values from your production telemetry to get an accurate projection. Add output token costs and compression overhead for a complete budget estimate.

Best Practices and Pitfalls

Avoid Over-Compression

Aggressive compression can degrade answer accuracy. Before deploying any compression strategy to production, evaluate it against a minimum of 50 representative queries. Measure answer quality alongside token savings. A 70% token reduction that drops accuracy below acceptable thresholds is not a savings; it is a different kind of cost.

A 70% token reduction that drops accuracy below acceptable thresholds is not a savings; it is a different kind of cost.

When Caching Pays Off

When the same documents surface repeatedly across queries, caching the compressed version eliminates redundant LLM calls. This matters most for extraction-based compression, where each compression invocation carries its own token cost. A simple key-value cache keyed on document source path (e.g., doc.metadata.get("source") or str(hash(doc.page_content))) and query hash can cut compression calls by a factor proportional to your cache hit rate — measure it. Note that LangChain Document objects do not always have a built-in id field — use metadata or assign stable IDs in your pipeline.

Layer Selection Before Extraction

When retrieval returns more than three chunks and extraction latency is acceptable, layering selection first as a cheap filter to discard irrelevant chunks, then extracting from the survivors, reduces tokens more than either approach alone. This layered pipeline keeps the expensive extraction step focused on a smaller input set. The trade-off: two LLM call stages instead of one. If your retrieval already returns only one or two highly relevant chunks, the extra filtering pass adds cost without meaningful compression.

This compounding effect makes context compression essential rather than optional for reducing LLM costs.

Handle Edge Cases in Production

Add rate-limit retry logic around LLM calls for high-volume agentic loops. Persist your FAISS index (vectorstore.save_local("faiss_index")) to avoid re-embedding and re-incurring embedding API costs on every restart. Guard against empty retrieval results to prevent division-by-zero errors in reduction calculations.

Next Steps

Token compression remains the fastest lever to reduce LLM operating costs in retrieval-augmented and agentic systems. Extraction wins on compression depth; selection wins on speed and source fidelity. Benchmark your specific pipeline using the calculator and token counting patterns above, then integrate compression into your existing RAG or agent workflow. Next, try semantic caching of compressed results with a tool like GPTCache.