Context Window in RAG: Operating Within Constraints

A context window is the maximum number of tokens (roughly words) that an LLM can process at once. Every RAG system operates within these constraints. Understanding and optimizing for context windows is critical to building effective systems.

What Is a Token?

Tokens are the fundamental units LLMs process. A token is roughly a word, but not exactly:

English word ≈ 1-1.5 tokens
“don’t” = 2 tokens: “don” + “‘t”
Mathematical symbols: often 1 token
Whitespace: sometimes counted, sometimes not

Examples:

"Hello, world!" = 4 tokens: ["Hello", ",", "world", "!"]
"The quick brown fox" = 4 tokens
"GPT-4 is awesome!" = 5 tokens: ["GPT", "-", "4", "is", "awesome", "!"]

Token counting libraries:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("The quick brown fox")
print(len(tokens))  # 4

Context Window Sizes Across Models

LLM context windows have grown dramatically:

Model	Context	Cost Impact
GPT-3.5	4K tokens	Baseline
GPT-4	8K tokens	Standard
GPT-4 Turbo	128K tokens	~4x baseline
Claude 3 Opus	200K tokens	Higher per-token
Llama 2	4K base, 32K extended	Open source
Llama 3	8K base, 128K extended	Open source

Larger windows enable:

More retrieved documents in single request
Longer conversations with history
More complex reasoning
Less need for query preprocessing

The RAG Context Window Trade-off

In a typical RAG request:

Total tokens = System prompt + Retrieved docs + Query + Response space

Example (GPT-4, 8K context):
- System prompt: 200 tokens
- Retrieved documents (5 chunks × 500 tokens): 2500 tokens
- User query: 50 tokens
- Response buffer (must leave space): 1000 tokens
─────────────────────────────────
Total: 3750 tokens (safe margin)

The pressure: Every additional document chunk consumes tokens. More retrieval = less context space = tighter constraints.

Strategies for Context Window Management

Strategy 1: Compression

Remove unnecessary information before sending to LLM.

Techniques:

Document summarization: Instead of full retrieved documents, send summaries.

Full document: "The company's fiscal year ends in December... [500 tokens]"
Summary: "Fiscal year ends December, revenue $10B, margin 25%." [10 tokens]

Extractive compression: Keep only relevant sentences from retrieved documents.

Retrieved document has 20 sentences.
Extract 3-4 most relevant to query.
Save 75% of tokens.

Abstractive compression: Use another LLM to compress before feeding to main LLM (increases cost but saves main LLM tokens).

Strategy 2: Selective Retrieval

Retrieve fewer documents, but better quality.

Approach 1: Confidence-based filtering

Retrieve top 20 candidates
Filter to only those with similarity > 0.85
May drop to 3-5 documents instead of 20

Approach 2: Diversity-based retrieval

Retrieve 20 similar documents
Remove duplicates and very similar ones
Select diverse subset (cover different aspects)
Result: Fewer documents, more information coverage

Approach 3: Hierarchical retrieval

Retrieve top 5 "overview" documents first
Pass to LLM for answer extraction
If insufficient, retrieve detailed "detail" documents
Adaptive based on query complexity

Strategy 3: Prompt Optimization

Reduce system prompt token count.

Compress prose:

Original (verbose): "You are an AI assistant specializing in customer support..."
Compressed: "You are a support AI."
Or even better: Use role parameter in API if available

Use structured formats:

Instead of: "Here are retrieved documents:
Document 1: The company...
Document 2: The product..."

Use:
{
  "docs": [
    {"id": "doc1", "source": "...", "content": "..."},
    {"id": "doc2", "source": "...", "content": "..."}
  ]
}

Structured formats often compress better.

Strategy 4: Multi-Turn Processing

Break complex tasks into multiple LLM calls.

Example:

Turn 1: Query LLM with compressed context → intermediate answer
Turn 2: Based on gaps, retrieve more specific documents
Turn 3: Generate final answer with all context

Total tokens might be higher but lower cost if subsequent calls are cheaper or use smaller models.

Advanced Context Management Techniques

Long-Context Models for Retrieval

With 128K+ context windows, you can retrieve comprehensively:

Before (4K context):

Retrieve 3-5 documents
Pray they contain the answer
Often need expensive reranking

After (128K context):

Retrieve 30-50 documents
Include all possibly relevant information
Let LLM do the synthesis
Eliminates “what if the answer was in a different document” anxiety

This shifts the problem: from “compress to fit” to “retrieve broadly and synthesize.”

Sparse-Dense Retrieval Fusion

Use both keyword search (sparse) and semantic search (dense):

Sparse retrieval (BM25): 10 documents with keyword matches
Dense retrieval (embeddings): 10 documents with semantic similarity
Union/intersection/ranking: Select diverse set of 5-10 documents

Result: Cover both keyword specificity and semantic understanding

Cost: More retrieval computation, but often better results.

Hierarchical Chunking with Selective Retrieval

Store documents at multiple levels:

Level 1 (abstract): Document summaries (100 tokens each)
Level 2 (detailed): Full document chunks (500 tokens each)

Retrieval strategy:
1. Retrieve top documents at Level 1
2. Score for relevance
3. For top 3, retrieve Level 2 details
4. Feed selected detailed chunks to LLM

Fewer total tokens, better coverage.

Measuring Context Window Efficiency

Track these metrics:

Utilization Rate

tokens_used / tokens_available
Target: 60-80% (leave margin for variability)

Retrieval-to-Response Ratio

tokens_in_retrieved_docs / tokens_in_response
Monitor for outliers

Cost per Query

(tokens_used × cost_per_token) / queries
Optimize for throughput and quality balance

Common Context Window Mistakes

Wasteful system prompts: Overly long instructions when terse works equally well.

No compression: Retrieving 20 documents when 5 reranked documents work.

Ignoring token counting: Assuming documents are smaller than they are, then failing at runtime.

Not using longer-context models: Staying on 4K models when 128K would solve problems.

Over-optimization: Spending engineering effort to save 50 tokens when models are abundant and cheap.

Context Windows in 2024 and Beyond

Trend 1: Infinite context Research on recurrent models and efficient attention suggests context limits will eventually become irrelevant.

Trend 2: Variable-cost models Some models charge less per input token than output token, changing optimization strategies.

Trend 3: Speculative retrieval Models that suggest what additional context they need mid-generation, enabling dynamic retrieval.

Trend 4: Hierarchical LLMs Small models filter for large models, optimizing retrieval before expensive processing.

Context window constraints are real but manageable. Understand your limits, measure your usage, and optimize strategically.