Context Compression: Trimming Retrieved Documents for Efficient RAG

Learn context compression for RAG — LLMLingua, extractive compression, contextual compression retriever, and reducing token costs without losing information.

Context Compression: Fitting More Signal into Your LLM’s Context Window

You retrieve 10 documents. Each is 800 tokens. That’s 8,000 tokens of context before you’ve written a single word of your system prompt. And most of those tokens? Boilerplate, preamble, tangentially related sentences — noise, not signal.

Context compression solves this by extracting or distilling only the relevant parts of retrieved documents before passing them to the LLM. The result: the same answer quality (sometimes better) with 50–80% fewer tokens, lower cost, and faster generation.

Why Context Compression Matters

Retrieved document (600 tokens):
"XYZ Corp was founded in 1985 and has grown to 50,000 employees worldwide.
The company operates in 30 countries and serves over 10 million customers.
Our customer service team is available 24/7 and we pride ourselves on
customer satisfaction scores in the top quartile for our industry.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Our return policy allows customers to return items within 30 days of purchase
for a full refund, provided the item is in original condition. Items purchased
during sale events may have a 15-day return window. For defective items,
we offer a 90-day return period regardless of sale status.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
We are committed to sustainability and have reduced carbon emissions by 30%..."
Query: "What is the return policy?"
Relevant portion (80 tokens):
"Our return policy allows customers to return items within 30 days of purchase
for a full refund, provided the item is in original condition. Items purchased
during sale events may have a 15-day return window. For defective items,
we offer a 90-day return period regardless of sale status."
87% token reduction, 100% of the answer preserved.

Extractive vs Abstractive Compression

Extractive compression selects sentences or passages directly from the document:

  • Fast
  • No risk of hallucination (words are copied verbatim)
  • Preserves exact phrasing
  • May include redundant adjacent sentences

Abstractive compression generates a compressed summary:

  • More concise
  • Can synthesize multiple passages
  • Risks introducing paraphrase errors
  • Better for complex multi-part documents

Most production systems use extractive compression for precision-critical applications and abstractive for summary-oriented use cases.

LangChain Contextual Compression Retriever

LangChain’s ContextualCompressionRetriever wraps any retriever with a compression step:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# LLM-based extractor: asks LLM to extract relevant portions
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever, # any retriever
)
# Returns only the relevant portions of each document
compressed_docs = compression_retriever.invoke(
"What is the return policy for defective items?"
)
for doc in compressed_docs:
print(doc.page_content) # only the relevant extract

EmbeddingsFilter: Fast Sentence-Level Filtering

For low-latency use cases, filter at the sentence level using embedding similarity instead of LLM calls:

from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings
embeddings_filter = EmbeddingsFilter(
embeddings=OpenAIEmbeddings(),
similarity_threshold=0.76, # keep sentences with >0.76 cosine sim to query
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=embeddings_filter,
base_retriever=base_retriever,
)

This requires no LLM calls — just embedding the retrieved sentences and filtering by similarity. Throughput is much higher but precision is lower than LLM-based extraction.

LLMLingua: Prompt Compression at Token Level

LLMLingua (Microsoft Research, 2023) takes a different approach: compress the prompt itself using a small language model that scores token importance, then drops low-importance tokens.

Original prompt (800 tokens):
"Please answer the following question based on the provided context.
The context is from our product documentation.
[Full 600-token document with all sentences]
Question: What is the return policy?"
LLMLingua compressed (200 tokens):
"Answer based on context.
[return policy allow customer return items 30 days full refund original condition
sale events 15-day return defective items 90-day]
Question: return policy?"
75% compression with ~95% answer accuracy preservation
Terminal window
pip install llmlingua
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
use_llmlingua2=True,
)
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
compressed_prompt = llm_lingua.compress_prompt(
context,
question="What is the return policy?",
target_token=200, # compress to 200 tokens
rank_method="longllmlingua",
context_budget="+100",
)
final_prompt = f"{compressed_prompt}\n\nQuestion: What is the return policy?"

LLMLingua compression adds ~50–100ms of latency but can reduce total LLM costs by 3–5× for retrieval-heavy workflows.

Pipeline-Level Compression

In a full RAG pipeline, compression sits between retrieval and generation:

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", "Answer the question using only the provided context."),
("human", "Context:\n{context}\n\nQuestion: {question}"),
])
def format_compressed_docs(docs):
return "\n\n---\n\n".join([doc.page_content for doc in docs])
# Full compressed RAG chain
rag_chain = (
{
"context": compression_retriever | format_compressed_docs,
"question": RunnablePassthrough(),
}
| prompt
| ChatOpenAI(model="gpt-4o")
| StrOutputParser()
)
answer = rag_chain.invoke("What is the return policy for defective items?")

Compression Strategy Comparison

StrategyLatencyCostQualityUse Case
LLM extraction (GPT-4o-mini)+300msMediumHighPrecision-critical
Embeddings filter+50msLowMediumHigh-throughput
LLMLingua+80msLowHighToken-budget constrained
Sentence window (manual)< 10msNoneMediumSimple use cases
No compression0msNoneLowerSmall docs only

Measuring Compression Effectiveness

Track these metrics to ensure compression isn’t removing critical content:

def evaluate_compression(
original_docs: list[str],
compressed_docs: list[str],
answers_with_original: list[str],
answers_with_compressed: list[str],
ground_truth: list[str],
) -> dict:
token_reduction = 1 - (
sum(len(c.split()) for c in compressed_docs) /
sum(len(d.split()) for d in original_docs)
)
# Use ROUGE-L to compare answer quality
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
original_quality = sum(
scorer.score(gt, ans)['rougeL'].fmeasure
for gt, ans in zip(ground_truth, answers_with_original)
) / len(ground_truth)
compressed_quality = sum(
scorer.score(gt, ans)['rougeL'].fmeasure
for gt, ans in zip(ground_truth, answers_with_compressed)
) / len(ground_truth)
return {
"token_reduction": token_reduction,
"quality_original": original_quality,
"quality_compressed": compressed_quality,
"quality_preservation": compressed_quality / original_quality,
}

Target: > 50% token reduction with > 90% quality preservation.

2025 Trend: Query-Conditioned Compression

Newer compression systems condition not just on document-query relevance but on the type of question being asked. A “when did X happen?” question should retain temporal markers. A “how does X work?” question should retain procedural steps. These question-type signals improve compression precision by 10–20% over generic relevance filtering.

Context compression is one of the most cost-effective optimizations in a production RAG system. Once retrieval quality is solid, compression reduces operational costs significantly without requiring changes to any other part of the pipeline.