Context Compression: Fitting More Signal into Your LLM’s Context Window

You retrieve 10 documents. Each is 800 tokens. That’s 8,000 tokens of context before you’ve written a single word of your system prompt. And most of those tokens? Boilerplate, preamble, tangentially related sentences — noise, not signal.

Context compression solves this by extracting or distilling only the relevant parts of retrieved documents before passing them to the LLM. The result: the same answer quality (sometimes better) with 50–80% fewer tokens, lower cost, and faster generation.

Why Context Compression Matters

Retrieved document (600 tokens):
"XYZ Corp was founded in 1985 and has grown to 50,000 employees worldwide.
The company operates in 30 countries and serves over 10 million customers.
Our customer service team is available 24/7 and we pride ourselves on
customer satisfaction scores in the top quartile for our industry.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Our return policy allows customers to return items within 30 days of purchase
for a full refund, provided the item is in original condition. Items purchased
during sale events may have a 15-day return window. For defective items,
we offer a 90-day return period regardless of sale status.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
We are committed to sustainability and have reduced carbon emissions by 30%..."

Query: "What is the return policy?"

Relevant portion (80 tokens):
"Our return policy allows customers to return items within 30 days of purchase
for a full refund, provided the item is in original condition. Items purchased
during sale events may have a 15-day return window. For defective items,
we offer a 90-day return period regardless of sale status."

87% token reduction, 100% of the answer preserved.

Extractive vs Abstractive Compression

Extractive compression selects sentences or passages directly from the document:

Fast
No risk of hallucination (words are copied verbatim)
Preserves exact phrasing
May include redundant adjacent sentences

Abstractive compression generates a compressed summary:

More concise
Can synthesize multiple passages
Risks introducing paraphrase errors
Better for complex multi-part documents

Most production systems use extractive compression for precision-critical applications and abstractive for summary-oriented use cases.

LangChain Contextual Compression Retriever

LangChain’s ContextualCompressionRetriever wraps any retriever with a compression step:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# LLM-based extractor: asks LLM to extract relevant portions
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,  # any retriever
)

# Returns only the relevant portions of each document
compressed_docs = compression_retriever.invoke(
    "What is the return policy for defective items?"
)

for doc in compressed_docs:
    print(doc.page_content)  # only the relevant extract

EmbeddingsFilter: Fast Sentence-Level Filtering

For low-latency use cases, filter at the sentence level using embedding similarity instead of LLM calls:

from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings

embeddings_filter = EmbeddingsFilter(
    embeddings=OpenAIEmbeddings(),
    similarity_threshold=0.76,  # keep sentences with >0.76 cosine sim to query
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter,
    base_retriever=base_retriever,
)

This requires no LLM calls — just embedding the retrieved sentences and filtering by similarity. Throughput is much higher but precision is lower than LLM-based extraction.

LLMLingua: Prompt Compression at Token Level

LLMLingua (Microsoft Research, 2023) takes a different approach: compress the prompt itself using a small language model that scores token importance, then drops low-importance tokens.

Original prompt (800 tokens):
"Please answer the following question based on the provided context.
The context is from our product documentation.
[Full 600-token document with all sentences]
Question: What is the return policy?"

LLMLingua compressed (200 tokens):
"Answer based on context.
[return policy allow customer return items 30 days full refund original condition
sale events 15-day return defective items 90-day]
Question: return policy?"

75% compression with ~95% answer accuracy preservation

pip install llmlingua

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True,
)

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

compressed_prompt = llm_lingua.compress_prompt(
    context,
    question="What is the return policy?",
    target_token=200,  # compress to 200 tokens
    rank_method="longllmlingua",
    context_budget="+100",
)

final_prompt = f"{compressed_prompt}\n\nQuestion: What is the return policy?"

LLMLingua compression adds ~50–100ms of latency but can reduce total LLM costs by 3–5× for retrieval-heavy workflows.

Pipeline-Level Compression

In a full RAG pipeline, compression sits between retrieval and generation:

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question using only the provided context."),
    ("human", "Context:\n{context}\n\nQuestion: {question}"),
])

def format_compressed_docs(docs):
    return "\n\n---\n\n".join([doc.page_content for doc in docs])

# Full compressed RAG chain
rag_chain = (
    {
        "context": compression_retriever | format_compressed_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | ChatOpenAI(model="gpt-4o")
    | StrOutputParser()
)

answer = rag_chain.invoke("What is the return policy for defective items?")

Compression Strategy Comparison

Strategy	Latency	Cost	Quality	Use Case
LLM extraction (GPT-4o-mini)	+300ms	Medium	High	Precision-critical
Embeddings filter	+50ms	Low	Medium	High-throughput
LLMLingua	+80ms	Low	High	Token-budget constrained
Sentence window (manual)	< 10ms	None	Medium	Simple use cases
No compression	0ms	None	Lower	Small docs only

Measuring Compression Effectiveness

Track these metrics to ensure compression isn’t removing critical content:

def evaluate_compression(
    original_docs: list[str],
    compressed_docs: list[str],
    answers_with_original: list[str],
    answers_with_compressed: list[str],
    ground_truth: list[str],
) -> dict:
    token_reduction = 1 - (
        sum(len(c.split()) for c in compressed_docs) /
        sum(len(d.split()) for d in original_docs)
    )

    # Use ROUGE-L to compare answer quality
    from rouge_score import rouge_scorer
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

    original_quality = sum(
        scorer.score(gt, ans)['rougeL'].fmeasure
        for gt, ans in zip(ground_truth, answers_with_original)
    ) / len(ground_truth)

    compressed_quality = sum(
        scorer.score(gt, ans)['rougeL'].fmeasure
        for gt, ans in zip(ground_truth, answers_with_compressed)
    ) / len(ground_truth)

    return {
        "token_reduction": token_reduction,
        "quality_original": original_quality,
        "quality_compressed": compressed_quality,
        "quality_preservation": compressed_quality / original_quality,
    }

Target: > 50% token reduction with > 90% quality preservation.

2025 Trend: Query-Conditioned Compression

Newer compression systems condition not just on document-query relevance but on the type of question being asked. A “when did X happen?” question should retain temporal markers. A “how does X work?” question should retain procedural steps. These question-type signals improve compression precision by 10–20% over generic relevance filtering.

Context compression is one of the most cost-effective optimizations in a production RAG system. Once retrieval quality is solid, compression reduces operational costs significantly without requiring changes to any other part of the pipeline.