Retrieval Quality in RAG: Metrics That Matter
A RAG system is fundamentally limited by the quality of its retrieval component. If the retriever doesn’t find relevant documents, the generator has nothing to work with. Yet retrieval quality is often overlooked in favor of flashier metrics about generation.
The Retrieval Quality Problem
Poor retrieval leads to:
- Missing information: Relevant documents weren’t retrieved, so the LLM can’t find the answer
- Noisy context: Irrelevant documents clutter the context window, distracting the generator
- Hallucinations: When retrieval fails, the LLM falls back to making up answers
- Cost waste: Retrieving many documents to find a few relevant ones
Getting retrieval right is the foundation for everything else.
Key Metrics for Retrieval Quality
Metric 1: Recall@K
What percentage of relevant documents appear in the top K results?
Formula:
Recall@K = (# of relevant docs in top K) / (# of total relevant docs)Example:
Document corpus has 10 relevant documents for a queryTop 5 retrieved results contain 3 of those relevant documentsRecall@5 = 3/10 = 0.30 (30%)Interpretation:
- Recall@5 = 0.3: Only catching 30% of relevant documents
- Recall@10 = 0.7: Catching 70% of relevant documents
- Recall@50 = 0.95: Catching almost everything
Why it matters: If a relevant document exists but isn’t retrieved, the LLM can’t use it. Recall measures what you’re missing.
Metric 2: Precision@K
What percentage of the top K results are actually relevant?
Formula:
Precision@K = (# of relevant docs in top K) / KExample:
Retrieved top 5 documents3 of them are relevant, 2 are irrelevantPrecision@5 = 3/5 = 0.60 (60%)Interpretation:
- Precision@5 = 0.6: Your top 5 results are 60% noise
- Precision@10 = 0.8: 80% of your top 10 are useful
Why it matters: Noise in context window hurts LLM performance. High precision means fewer wasted tokens on irrelevant documents.
Metric 3: Mean Reciprocal Rank (MRR)
What’s the average position of the first relevant result?
Formula:
MRR = (1/rank of first relevant result) averaged across queriesExample:
Query 1: First relevant result at position 2 → score = 1/2 = 0.5Query 2: First relevant result at position 1 → score = 1/1 = 1.0Query 3: First relevant result at position 5 → score = 1/5 = 0.2MRR = (0.5 + 1.0 + 0.2) / 3 = 0.57Interpretation:
- MRR = 1.0: Perfect, relevant answer always first
- MRR = 0.5: Relevant answer on average at position 2
- MRR = 0.1: Relevant answer on average at position 10
Why it matters: Users don’t scroll through 20 results. MRR captures how quickly you find what you need.
Metric 4: NDCG (Normalized Discounted Cumulative Gain)
Sophisticated metric combining relevance scoring and position discounting.
Formula (simplified):
For each position, assign relevance score (0-5)Discount by position (position 1 worth more than position 10)Normalize against perfect rankingNDCG = 0 to 1Interpretation:
- NDCG = 1.0: Perfect ranking
- NDCG = 0.8: Pretty good, some suboptimal ranking
- NDCG = 0.5: Many ranking mistakes
Why it matters: Captures nuance that Precision/Recall miss. Sometimes second-best is acceptable. NDCG reflects this.
Metric 5: Success Rate / Hit Rate
Binary: did we find at least one relevant document in top K?
Formula:
Hit Rate@K = (% of queries with at least 1 relevant doc in top K)Example:
100 test queries85 of them have at least 1 relevant doc in top 5Hit Rate@5 = 85%Why it matters: Simple, interpretable. “85% of the time we have something useful to work with.”
Building an Evaluation Dataset
To measure retrieval quality, you need test queries with known relevant documents.
Approach 1: Manual annotation
1. Collect representative queries (100-500)2. Have humans mark which documents are relevant3. Use as ground truth for evaluationCost: High effort, small scale Benefit: Gold standard, custom to your domain
Approach 2: Existing benchmarks
Use standard datasets:- MS MARCO: 500K queries over web documents- Natural Questions: 300K queries with Wikipedia- TREC: Classic IR benchmarks- SQuAD: Reading comprehension basedCost: Free, large scale Benefit: Compare with other systems, reproducibility
Approach 3: Synthetic evaluation
1. Take existing QA pairs from documentation2. Use LLM to generate variations3. Auto-label with content retrievalCost: Cheap, large scale Benefit: Fast iteration, easy generation
Approach 4: Hybrid approach (Recommended)
- Use existing benchmark for baseline- Add manual annotations for your specific domain- Add synthetic queries for coverageMeasuring End-to-End RAG Quality
Retrieval quality feeds into generation quality, but measuring the chain is complex.
Approach 1: Direct answer correctness
Test set: 100 questions with known answersRun full RAG system (retrieval + generation)Check if answer is correctAccuracy = # correct / 100Limitation: Doesn’t isolate retrieval from generation problems.
Approach 2: Component analysis
Run retrieval in isolation:- Is relevant doc in top 5? → Retrieval quality- Given perfect retrieval, can LLM answer? → Generation qualityHelps isolate failure modesApproach 3: Attribution-based evaluation
Check if LLM's answer is supported by retrieved documentsHigh attribution = retrieval workingLow attribution = generation hallucinatingSetting Retrieval Quality Targets
Conservative (prioritize precision):
- Recall@5 > 70%
- Precision@5 > 80%
- MRR > 0.6
Result: Fewer docs retrieved, higher confidence in results
Balanced:
- Recall@10 > 80%
- Precision@10 > 70%
- MRR > 0.5
Result: Reasonable coverage, acceptable noise
Aggressive (prioritize recall):
- Recall@20 > 90%
- Precision@20 > 50%
- MRR > 0.3
Result: Find almost everything, but more noise
Choose based on your use case:
- Customer support (quality critical): Conservative
- Internal research tools: Balanced
- Exploratory search: Aggressive
Improving Retrieval Quality
If recall is low:
- Retrieve more documents (increase K)
- Improve chunking (chunks too small/large?)
- Try different embedding model
- Check knowledge base coverage
If precision is low:
- Use smaller K (less noise)
- Implement reranking (cross-encoders)
- Filter low-similarity results
- Improve chunking (less irrelevant chunks)
If MRR is low:
- Implement two-stage retrieval (coarse then fine)
- Use dense + sparse hybrid retrieval
- Rerank top results aggressively
- Debug query encoding
Reranking: Boosting Precision
A common technique: retrieve broadly (recall-optimized), then rerank (precision-optimized).
Stage 1: Dense retrieval → top 50 documents Using: embedding similarity Goal: high recall, don't miss anything
Stage 2: Reranking → top 5 documents Using: cross-encoder (specialized relevance model) Goal: high precision, filter noise
Result: Best of both worldsPopular reranking models:
- Cohere Reranker
- BGE Reranker
- mxbai-rerank
Cost: Reasonable for top-K results, enables more initial retrieval.
Production Monitoring
Track retrieval quality in production:
Metrics to monitor:
- User clicks (implicit feedback)
- Explicit user ratings
- Citation success (did user verify the sources?)
- Time-to-answer (user had to dig deeper?)
Set up alerts:
- When Recall@10 drops below 0.7
- When user satisfaction scores drop
- When retrieval latency exceeds threshold
Monthly reviews:
- Compare current vs. baseline metrics
- Identify degrading queries
- Test new models/approaches
Common Retrieval Quality Mistakes
Optimizing wrong metric: High recall isn’t useful if precision is terrible (noisy context).
Not measuring at all: Assuming “it seems to work” and shipping.
Biased test set: Evaluating on easy queries, missing hard cases.
Static evaluation: Testing once, then assuming quality is stable (it degrades).
Over-engineering: Chasing 0.99 NDCG when 0.75 would ship sooner and be validated with users.
Retrieval Quality in 2024
Trends:
- Shift toward in-context retrieval quality over recall volume
- Reranking becoming standard (second-stage ranking)
- Hybrid retrieval (dense + sparse) gaining adoption
- User feedback loops improving optimization
RAG systems live or die by retrieval quality. Measuring it carefully, optimizing systematically, and monitoring continuously is essential infrastructure.