Context Compression: Fitting More Signal into Your LLM’s Context Window
You retrieve 10 documents. Each is 800 tokens. That’s 8,000 tokens of context before you’ve written a single word of your system prompt. And most of those tokens? Boilerplate, preamble, tangentially related sentences — noise, not signal.
Context compression solves this by extracting or distilling only the relevant parts of retrieved documents before passing them to the LLM. The result: the same answer quality (sometimes better) with 50–80% fewer tokens, lower cost, and faster generation.
Why Context Compression Matters
Retrieved document (600 tokens):"XYZ Corp was founded in 1985 and has grown to 50,000 employees worldwide.The company operates in 30 countries and serves over 10 million customers.Our customer service team is available 24/7 and we pride ourselves oncustomer satisfaction scores in the top quartile for our industry.━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Our return policy allows customers to return items within 30 days of purchasefor a full refund, provided the item is in original condition. Items purchasedduring sale events may have a 15-day return window. For defective items,we offer a 90-day return period regardless of sale status.━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━We are committed to sustainability and have reduced carbon emissions by 30%..."
Query: "What is the return policy?"
Relevant portion (80 tokens):"Our return policy allows customers to return items within 30 days of purchasefor a full refund, provided the item is in original condition. Items purchasedduring sale events may have a 15-day return window. For defective items,we offer a 90-day return period regardless of sale status."
87% token reduction, 100% of the answer preserved.Extractive vs Abstractive Compression
Extractive compression selects sentences or passages directly from the document:
- Fast
- No risk of hallucination (words are copied verbatim)
- Preserves exact phrasing
- May include redundant adjacent sentences
Abstractive compression generates a compressed summary:
- More concise
- Can synthesize multiple passages
- Risks introducing paraphrase errors
- Better for complex multi-part documents
Most production systems use extractive compression for precision-critical applications and abstractive for summary-oriented use cases.
LangChain Contextual Compression Retriever
LangChain’s ContextualCompressionRetriever wraps any retriever with a compression step:
from langchain.retrievers import ContextualCompressionRetrieverfrom langchain.retrievers.document_compressors import LLMChainExtractorfrom langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# LLM-based extractor: asks LLM to extract relevant portionscompressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=base_retriever, # any retriever)
# Returns only the relevant portions of each documentcompressed_docs = compression_retriever.invoke( "What is the return policy for defective items?")
for doc in compressed_docs: print(doc.page_content) # only the relevant extractEmbeddingsFilter: Fast Sentence-Level Filtering
For low-latency use cases, filter at the sentence level using embedding similarity instead of LLM calls:
from langchain.retrievers.document_compressors import EmbeddingsFilterfrom langchain_openai import OpenAIEmbeddings
embeddings_filter = EmbeddingsFilter( embeddings=OpenAIEmbeddings(), similarity_threshold=0.76, # keep sentences with >0.76 cosine sim to query)
compression_retriever = ContextualCompressionRetriever( base_compressor=embeddings_filter, base_retriever=base_retriever,)This requires no LLM calls — just embedding the retrieved sentences and filtering by similarity. Throughput is much higher but precision is lower than LLM-based extraction.
LLMLingua: Prompt Compression at Token Level
LLMLingua (Microsoft Research, 2023) takes a different approach: compress the prompt itself using a small language model that scores token importance, then drops low-importance tokens.
Original prompt (800 tokens):"Please answer the following question based on the provided context.The context is from our product documentation.[Full 600-token document with all sentences]Question: What is the return policy?"
LLMLingua compressed (200 tokens):"Answer based on context.[return policy allow customer return items 30 days full refund original conditionsale events 15-day return defective items 90-day]Question: return policy?"
75% compression with ~95% answer accuracy preservationpip install llmlinguafrom llmlingua import PromptCompressor
llm_lingua = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", use_llmlingua2=True,)
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
compressed_prompt = llm_lingua.compress_prompt( context, question="What is the return policy?", target_token=200, # compress to 200 tokens rank_method="longllmlingua", context_budget="+100",)
final_prompt = f"{compressed_prompt}\n\nQuestion: What is the return policy?"LLMLingua compression adds ~50–100ms of latency but can reduce total LLM costs by 3–5× for retrieval-heavy workflows.
Pipeline-Level Compression
In a full RAG pipeline, compression sits between retrieval and generation:
from langchain_core.runnables import RunnablePassthroughfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([ ("system", "Answer the question using only the provided context."), ("human", "Context:\n{context}\n\nQuestion: {question}"),])
def format_compressed_docs(docs): return "\n\n---\n\n".join([doc.page_content for doc in docs])
# Full compressed RAG chainrag_chain = ( { "context": compression_retriever | format_compressed_docs, "question": RunnablePassthrough(), } | prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser())
answer = rag_chain.invoke("What is the return policy for defective items?")Compression Strategy Comparison
| Strategy | Latency | Cost | Quality | Use Case |
|---|---|---|---|---|
| LLM extraction (GPT-4o-mini) | +300ms | Medium | High | Precision-critical |
| Embeddings filter | +50ms | Low | Medium | High-throughput |
| LLMLingua | +80ms | Low | High | Token-budget constrained |
| Sentence window (manual) | < 10ms | None | Medium | Simple use cases |
| No compression | 0ms | None | Lower | Small docs only |
Measuring Compression Effectiveness
Track these metrics to ensure compression isn’t removing critical content:
def evaluate_compression( original_docs: list[str], compressed_docs: list[str], answers_with_original: list[str], answers_with_compressed: list[str], ground_truth: list[str],) -> dict: token_reduction = 1 - ( sum(len(c.split()) for c in compressed_docs) / sum(len(d.split()) for d in original_docs) )
# Use ROUGE-L to compare answer quality from rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
original_quality = sum( scorer.score(gt, ans)['rougeL'].fmeasure for gt, ans in zip(ground_truth, answers_with_original) ) / len(ground_truth)
compressed_quality = sum( scorer.score(gt, ans)['rougeL'].fmeasure for gt, ans in zip(ground_truth, answers_with_compressed) ) / len(ground_truth)
return { "token_reduction": token_reduction, "quality_original": original_quality, "quality_compressed": compressed_quality, "quality_preservation": compressed_quality / original_quality, }Target: > 50% token reduction with > 90% quality preservation.
2025 Trend: Query-Conditioned Compression
Newer compression systems condition not just on document-query relevance but on the type of question being asked. A “when did X happen?” question should retain temporal markers. A “how does X work?” question should retain procedural steps. These question-type signals improve compression precision by 10–20% over generic relevance filtering.
Context compression is one of the most cost-effective optimizations in a production RAG system. Once retrieval quality is solid, compression reduces operational costs significantly without requiring changes to any other part of the pipeline.