Context Window in RAG: Operating Within Constraints
A context window is the maximum number of tokens (roughly words) that an LLM can process at once. Every RAG system operates within these constraints. Understanding and optimizing for context windows is critical to building effective systems.
What Is a Token?
Tokens are the fundamental units LLMs process. A token is roughly a word, but not exactly:
- English word ≈ 1-1.5 tokens
- “don’t” = 2 tokens: “don” + “‘t”
- Mathematical symbols: often 1 token
- Whitespace: sometimes counted, sometimes not
Examples:
"Hello, world!" = 4 tokens: ["Hello", ",", "world", "!"]"The quick brown fox" = 4 tokens"GPT-4 is awesome!" = 5 tokens: ["GPT", "-", "4", "is", "awesome", "!"]Token counting libraries:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")tokens = encoding.encode("The quick brown fox")print(len(tokens)) # 4Context Window Sizes Across Models
LLM context windows have grown dramatically:
| Model | Context | Cost Impact |
|---|---|---|
| GPT-3.5 | 4K tokens | Baseline |
| GPT-4 | 8K tokens | Standard |
| GPT-4 Turbo | 128K tokens | ~4x baseline |
| Claude 3 Opus | 200K tokens | Higher per-token |
| Llama 2 | 4K base, 32K extended | Open source |
| Llama 3 | 8K base, 128K extended | Open source |
Larger windows enable:
- More retrieved documents in single request
- Longer conversations with history
- More complex reasoning
- Less need for query preprocessing
The RAG Context Window Trade-off
In a typical RAG request:
Total tokens = System prompt + Retrieved docs + Query + Response space
Example (GPT-4, 8K context):- System prompt: 200 tokens- Retrieved documents (5 chunks × 500 tokens): 2500 tokens- User query: 50 tokens- Response buffer (must leave space): 1000 tokens─────────────────────────────────Total: 3750 tokens (safe margin)The pressure: Every additional document chunk consumes tokens. More retrieval = less context space = tighter constraints.
Strategies for Context Window Management
Strategy 1: Compression
Remove unnecessary information before sending to LLM.
Techniques:
Document summarization: Instead of full retrieved documents, send summaries.
Full document: "The company's fiscal year ends in December... [500 tokens]"Summary: "Fiscal year ends December, revenue $10B, margin 25%." [10 tokens]Extractive compression: Keep only relevant sentences from retrieved documents.
Retrieved document has 20 sentences.Extract 3-4 most relevant to query.Save 75% of tokens.Abstractive compression: Use another LLM to compress before feeding to main LLM (increases cost but saves main LLM tokens).
Strategy 2: Selective Retrieval
Retrieve fewer documents, but better quality.
Approach 1: Confidence-based filtering
Retrieve top 20 candidatesFilter to only those with similarity > 0.85May drop to 3-5 documents instead of 20Approach 2: Diversity-based retrieval
Retrieve 20 similar documentsRemove duplicates and very similar onesSelect diverse subset (cover different aspects)Result: Fewer documents, more information coverageApproach 3: Hierarchical retrieval
Retrieve top 5 "overview" documents firstPass to LLM for answer extractionIf insufficient, retrieve detailed "detail" documentsAdaptive based on query complexityStrategy 3: Prompt Optimization
Reduce system prompt token count.
Compress prose:
Original (verbose): "You are an AI assistant specializing in customer support..."Compressed: "You are a support AI."Or even better: Use role parameter in API if availableUse structured formats:
Instead of: "Here are retrieved documents:Document 1: The company...Document 2: The product..."
Use:{ "docs": [ {"id": "doc1", "source": "...", "content": "..."}, {"id": "doc2", "source": "...", "content": "..."} ]}Structured formats often compress better.
Strategy 4: Multi-Turn Processing
Break complex tasks into multiple LLM calls.
Example:
Turn 1: Query LLM with compressed context → intermediate answerTurn 2: Based on gaps, retrieve more specific documentsTurn 3: Generate final answer with all contextTotal tokens might be higher but lower cost if subsequent calls are cheaper or use smaller models.
Advanced Context Management Techniques
Long-Context Models for Retrieval
With 128K+ context windows, you can retrieve comprehensively:
Before (4K context):
- Retrieve 3-5 documents
- Pray they contain the answer
- Often need expensive reranking
After (128K context):
- Retrieve 30-50 documents
- Include all possibly relevant information
- Let LLM do the synthesis
- Eliminates “what if the answer was in a different document” anxiety
This shifts the problem: from “compress to fit” to “retrieve broadly and synthesize.”
Sparse-Dense Retrieval Fusion
Use both keyword search (sparse) and semantic search (dense):
Sparse retrieval (BM25): 10 documents with keyword matchesDense retrieval (embeddings): 10 documents with semantic similarityUnion/intersection/ranking: Select diverse set of 5-10 documents
Result: Cover both keyword specificity and semantic understandingCost: More retrieval computation, but often better results.
Hierarchical Chunking with Selective Retrieval
Store documents at multiple levels:
Level 1 (abstract): Document summaries (100 tokens each)Level 2 (detailed): Full document chunks (500 tokens each)
Retrieval strategy:1. Retrieve top documents at Level 12. Score for relevance3. For top 3, retrieve Level 2 details4. Feed selected detailed chunks to LLMFewer total tokens, better coverage.
Measuring Context Window Efficiency
Track these metrics:
Utilization Rate
tokens_used / tokens_availableTarget: 60-80% (leave margin for variability)Retrieval-to-Response Ratio
tokens_in_retrieved_docs / tokens_in_responseMonitor for outliersCost per Query
(tokens_used × cost_per_token) / queriesOptimize for throughput and quality balanceCommon Context Window Mistakes
Wasteful system prompts: Overly long instructions when terse works equally well.
No compression: Retrieving 20 documents when 5 reranked documents work.
Ignoring token counting: Assuming documents are smaller than they are, then failing at runtime.
Not using longer-context models: Staying on 4K models when 128K would solve problems.
Over-optimization: Spending engineering effort to save 50 tokens when models are abundant and cheap.
Context Windows in 2024 and Beyond
Trend 1: Infinite context Research on recurrent models and efficient attention suggests context limits will eventually become irrelevant.
Trend 2: Variable-cost models Some models charge less per input token than output token, changing optimization strategies.
Trend 3: Speculative retrieval Models that suggest what additional context they need mid-generation, enabling dynamic retrieval.
Trend 4: Hierarchical LLMs Small models filter for large models, optimizing retrieval before expensive processing.
Context window constraints are real but manageable. Understand your limits, measure your usage, and optimize strategically.