Context Window
The context window is one of the most practically important constraints in LLM engineering. It determines how much information a model can “see” at once — and understanding it helps you design better prompts, cheaper architectures, and more reliable applications.
What Is the Context Window?
The context window is the maximum number of tokens a model can process in a single forward pass. It includes everything: the system prompt, conversation history, retrieved documents, tool outputs, and the response being generated.
┌─────────────────────────────────────────────────────────────┐│ CONTEXT WINDOW (128K tokens) ││ ││ System Prompt │ Chat History │ Retrieved Docs │ Response ││ (~500 tokens) │ (~2K tokens) │ (~20K tokens) │ (gen'd) │└─────────────────────────────────────────────────────────────┘Modern context windows (2024–2026):
- GPT-4o: 128K tokens
- Claude 3.5: 200K tokens
- Gemini 1.5 Pro: 1M tokens
- Gemini 1.5 Flash: 1M tokens
- LLaMA 3.1 (70B/405B): 128K tokens
How the KV Cache Works
When a model generates text token by token, it would be catastrophically inefficient to recompute the full attention over all previous tokens for every new token. The KV cache solves this.
During the forward pass, each Transformer layer computes Key and Value matrices for every token. These are cached in GPU memory and reused for all subsequent tokens in the generation.
Prompt (1000 tokens) → Compute + cache K,V for all 1000 tokens │ └── Generate token 1001: use cached K,V → much faster └── Generate token 1002: use cached K,V (now includes 1001) └── Generate token N: use cached K,V (grows by 1 each step)KV cache cost: For LLaMA 3 70B with a 128K context:
KV cache size = 2 × n_layers × n_kv_heads × d_head × context_len × bytes_per_param = 2 × 80 × 8 × 128 × 128,000 × 2 bytes ≈ ~42GBThis is why long contexts are expensive and why Grouped-Query Attention (GQA) matters — reducing n_kv_heads from 64 to 8 cuts KV cache memory by 8×.
The “Lost in the Middle” Problem
Longer context windows don’t mean LLMs use all of it equally well. Research published in 2023 (and repeatedly replicated since) shows a recency and primacy bias:
- Information at the beginning and end of the context is recalled reliably
- Information in the middle is often missed or underweighted
Recall accuracy by position in context:Position 0% (start) ██████████████████ ~95%Position 25% ████████████ ~70%Position 50% (middle) ███████ ~45%Position 75% █████████████ ~65%Position 100% (end) █████████████████ ~90%Practical implication: When stuffing retrieved documents into a context, put the most important information first or last. Don’t bury critical facts in the middle of 15 documents.
Context vs. Memory
A common confusion: the context window is not the model’s memory. It’s a buffer, not persistent storage.
| Property | Context Window | Long-term Memory |
|---|---|---|
| Persistence | Gone after session | Survives across sessions |
| Size | Tokens (8K–1M) | Effectively unlimited |
| Access speed | Immediate (in-context) | Requires retrieval |
| Cost | Charged per token | Storage cost |
| Implementation | Built into model | External DB / RAG |
For applications that need to remember users across sessions, RAG with a vector database is the standard solution — not extending the context window.
Strategies for Long-Document Handling
When your document exceeds the context window, you have several options:
1. Chunking + RAG (Most Common)
Split the document into chunks, embed them, store in a vector DB, retrieve the most relevant chunks at query time. Only the relevant portion enters the context.
2. Summarization Chains
Summarize segments of a long document sequentially, building a compressed representation that fits in context. Used for book-length summarization.
Chapter 1 → Summarize → 200 tokensChapter 2 → Summarize → 200 tokens...All summaries → Final synthesis → Full summary3. Map-Reduce
Apply the same operation (e.g., extract key claims) to each chunk in parallel, then reduce (combine) the results. More parallelizable than sequential chains.
4. Use a Long-Context Model
If Gemini 1.5 Pro’s 1M token context is large enough, just send the whole document. This is increasingly practical for medium-length corpora (500–1000 pages).
Context Window Economics
More context = more compute. The computational cost of attention scales quadratically with sequence length (O(N²)), though Flash Attention makes this more practical.
Rough cost guide (OpenAI GPT-4o pricing example):
| Scenario | Tokens | Cost (input) |
|---|---|---|
| Short query | 500 | $0.0025 |
| Document QA (10 pages) | 8,000 | $0.04 |
| Long analysis (100 pages) | 80,000 | $0.40 |
| Book summary (300 pages) | 240,000 | $1.20 |
For high-volume applications, context window size directly impacts operating costs. A well-implemented RAG system that retrieves 3K tokens instead of sending 80K tokens pays 26× less per query.
Practical Tips for Context Management
Track token counts proactively: Don’t wait for a context_length_exceeded error. Instrument your application to log token usage per request.
Implement sliding window for conversations: Keep only the last N turns in context, summarizing older turns. Many production chatbots do this automatically.
System prompt optimization: System prompts are in the context on every request. A 2,000-token system prompt multiplied by 10,000 daily requests = 20M tokens/day in overhead alone.
Structured context positioning: For RAG, place the question first, then supporting documents, then instructions for how to answer. Empirically better than the reverse.
Tool outputs are tokens too: In agentic workflows, tool call results accumulate in context. Long JSON responses from APIs can eat context fast — trim unnecessary fields before returning tool output to the model.
Where This Is Heading
The “context window arms race” of 2024–2025 has somewhat plateaued. Models now have enough context for most practical tasks. The frontier questions are:
- Quality at length: Can models actually use 1M token contexts well, or do they still lose things in the middle?
- KV cache offloading: Moving the KV cache to CPU or SSD to enable longer inference sessions at lower GPU cost
- Selective attention / memory: Systems that compress the context dynamically, keeping recent and high-attention tokens and summarizing the rest
For most practitioners in 2026, 128K–200K tokens is more than enough if you architect your application well. The limiting factor usually isn’t the context window — it’s the signal-to-noise ratio of what you put in it.