Context Window

The context window is one of the most practically important constraints in LLM engineering. It determines how much information a model can “see” at once — and understanding it helps you design better prompts, cheaper architectures, and more reliable applications.

What Is the Context Window?

The context window is the maximum number of tokens a model can process in a single forward pass. It includes everything: the system prompt, conversation history, retrieved documents, tool outputs, and the response being generated.

┌─────────────────────────────────────────────────────────────┐
│                    CONTEXT WINDOW (128K tokens)              │
│                                                             │
│  System Prompt  │  Chat History  │  Retrieved Docs  │  Response │
│  (~500 tokens)  │  (~2K tokens)  │  (~20K tokens)   │ (gen'd)   │
└─────────────────────────────────────────────────────────────┘

Modern context windows (2024–2026):

GPT-4o: 128K tokens
Claude 3.5: 200K tokens
Gemini 1.5 Pro: 1M tokens
Gemini 1.5 Flash: 1M tokens
LLaMA 3.1 (70B/405B): 128K tokens

How the KV Cache Works

When a model generates text token by token, it would be catastrophically inefficient to recompute the full attention over all previous tokens for every new token. The KV cache solves this.

During the forward pass, each Transformer layer computes Key and Value matrices for every token. These are cached in GPU memory and reused for all subsequent tokens in the generation.

Prompt (1000 tokens) → Compute + cache K,V for all 1000 tokens
  │
  └── Generate token 1001: use cached K,V → much faster
  └── Generate token 1002: use cached K,V (now includes 1001)
  └── Generate token N: use cached K,V (grows by 1 each step)

KV cache cost: For LLaMA 3 70B with a 128K context:

KV cache size = 2 × n_layers × n_kv_heads × d_head × context_len × bytes_per_param
             = 2 × 80 × 8 × 128 × 128,000 × 2 bytes  ≈ ~42GB

This is why long contexts are expensive and why Grouped-Query Attention (GQA) matters — reducing n_kv_heads from 64 to 8 cuts KV cache memory by 8×.

The “Lost in the Middle” Problem

Longer context windows don’t mean LLMs use all of it equally well. Research published in 2023 (and repeatedly replicated since) shows a recency and primacy bias:

Information at the beginning and end of the context is recalled reliably
Information in the middle is often missed or underweighted

Recall accuracy by position in context:
Position 0%  (start)   ██████████████████  ~95%
Position 25%           ████████████        ~70%
Position 50% (middle)  ███████             ~45%
Position 75%           █████████████       ~65%
Position 100% (end)    █████████████████   ~90%

Practical implication: When stuffing retrieved documents into a context, put the most important information first or last. Don’t bury critical facts in the middle of 15 documents.

Context vs. Memory

A common confusion: the context window is not the model’s memory. It’s a buffer, not persistent storage.

Property	Context Window	Long-term Memory
Persistence	Gone after session	Survives across sessions
Size	Tokens (8K–1M)	Effectively unlimited
Access speed	Immediate (in-context)	Requires retrieval
Cost	Charged per token	Storage cost
Implementation	Built into model	External DB / RAG

For applications that need to remember users across sessions, RAG with a vector database is the standard solution — not extending the context window.

Strategies for Long-Document Handling

When your document exceeds the context window, you have several options:

1. Chunking + RAG (Most Common)

Split the document into chunks, embed them, store in a vector DB, retrieve the most relevant chunks at query time. Only the relevant portion enters the context.

2. Summarization Chains

Summarize segments of a long document sequentially, building a compressed representation that fits in context. Used for book-length summarization.

Chapter 1 → Summarize → 200 tokens
Chapter 2 → Summarize → 200 tokens
...
All summaries → Final synthesis → Full summary

3. Map-Reduce

Apply the same operation (e.g., extract key claims) to each chunk in parallel, then reduce (combine) the results. More parallelizable than sequential chains.

4. Use a Long-Context Model

If Gemini 1.5 Pro’s 1M token context is large enough, just send the whole document. This is increasingly practical for medium-length corpora (500–1000 pages).

Context Window Economics

More context = more compute. The computational cost of attention scales quadratically with sequence length (O(N²)), though Flash Attention makes this more practical.

Rough cost guide (OpenAI GPT-4o pricing example):

Scenario	Tokens	Cost (input)
Short query	500	$0.0025
Document QA (10 pages)	8,000	$0.04
Long analysis (100 pages)	80,000	$0.40
Book summary (300 pages)	240,000	$1.20

For high-volume applications, context window size directly impacts operating costs. A well-implemented RAG system that retrieves 3K tokens instead of sending 80K tokens pays 26× less per query.

Practical Tips for Context Management

Track token counts proactively: Don’t wait for a context_length_exceeded error. Instrument your application to log token usage per request.

Implement sliding window for conversations: Keep only the last N turns in context, summarizing older turns. Many production chatbots do this automatically.

System prompt optimization: System prompts are in the context on every request. A 2,000-token system prompt multiplied by 10,000 daily requests = 20M tokens/day in overhead alone.

Structured context positioning: For RAG, place the question first, then supporting documents, then instructions for how to answer. Empirically better than the reverse.

Tool outputs are tokens too: In agentic workflows, tool call results accumulate in context. Long JSON responses from APIs can eat context fast — trim unnecessary fields before returning tool output to the model.

Where This Is Heading

The “context window arms race” of 2024–2025 has somewhat plateaued. Models now have enough context for most practical tasks. The frontier questions are:

Quality at length: Can models actually use 1M token contexts well, or do they still lose things in the middle?
KV cache offloading: Moving the KV cache to CPU or SSD to enable longer inference sessions at lower GPU cost
Selective attention / memory: Systems that compress the context dynamically, keeping recent and high-attention tokens and summarizing the rest

For most practitioners in 2026, 128K–200K tokens is more than enough if you architect your application well. The limiting factor usually isn’t the context window — it’s the signal-to-noise ratio of what you put in it.