Retrieval Quality in RAG: Metrics That Matter

A RAG system is fundamentally limited by the quality of its retrieval component. If the retriever doesn’t find relevant documents, the generator has nothing to work with. Yet retrieval quality is often overlooked in favor of flashier metrics about generation.

The Retrieval Quality Problem

Poor retrieval leads to:

Missing information: Relevant documents weren’t retrieved, so the LLM can’t find the answer
Noisy context: Irrelevant documents clutter the context window, distracting the generator
Hallucinations: When retrieval fails, the LLM falls back to making up answers
Cost waste: Retrieving many documents to find a few relevant ones

Getting retrieval right is the foundation for everything else.

Key Metrics for Retrieval Quality

Metric 1: Recall@K

What percentage of relevant documents appear in the top K results?

Formula:

Recall@K = (# of relevant docs in top K) / (# of total relevant docs)

Example:

Document corpus has 10 relevant documents for a query
Top 5 retrieved results contain 3 of those relevant documents
Recall@5 = 3/10 = 0.30 (30%)

Interpretation:

Recall@5 = 0.3: Only catching 30% of relevant documents
Recall@10 = 0.7: Catching 70% of relevant documents
Recall@50 = 0.95: Catching almost everything

Why it matters: If a relevant document exists but isn’t retrieved, the LLM can’t use it. Recall measures what you’re missing.

Metric 2: Precision@K

What percentage of the top K results are actually relevant?

Formula:

Precision@K = (# of relevant docs in top K) / K

Example:

Retrieved top 5 documents
3 of them are relevant, 2 are irrelevant
Precision@5 = 3/5 = 0.60 (60%)

Interpretation:

Precision@5 = 0.6: Your top 5 results are 60% noise
Precision@10 = 0.8: 80% of your top 10 are useful

Why it matters: Noise in context window hurts LLM performance. High precision means fewer wasted tokens on irrelevant documents.

Metric 3: Mean Reciprocal Rank (MRR)

What’s the average position of the first relevant result?

Formula:

MRR = (1/rank of first relevant result) averaged across queries

Example:

Query 1: First relevant result at position 2 → score = 1/2 = 0.5
Query 2: First relevant result at position 1 → score = 1/1 = 1.0
Query 3: First relevant result at position 5 → score = 1/5 = 0.2
MRR = (0.5 + 1.0 + 0.2) / 3 = 0.57

Interpretation:

MRR = 1.0: Perfect, relevant answer always first
MRR = 0.5: Relevant answer on average at position 2
MRR = 0.1: Relevant answer on average at position 10

Why it matters: Users don’t scroll through 20 results. MRR captures how quickly you find what you need.

Metric 4: NDCG (Normalized Discounted Cumulative Gain)

Sophisticated metric combining relevance scoring and position discounting.

Formula (simplified):

For each position, assign relevance score (0-5)
Discount by position (position 1 worth more than position 10)
Normalize against perfect ranking
NDCG = 0 to 1

Interpretation:

NDCG = 1.0: Perfect ranking
NDCG = 0.8: Pretty good, some suboptimal ranking
NDCG = 0.5: Many ranking mistakes

Why it matters: Captures nuance that Precision/Recall miss. Sometimes second-best is acceptable. NDCG reflects this.

Metric 5: Success Rate / Hit Rate

Binary: did we find at least one relevant document in top K?

Formula:

Hit Rate@K = (% of queries with at least 1 relevant doc in top K)

Example:

100 test queries
85 of them have at least 1 relevant doc in top 5
Hit Rate@5 = 85%

Why it matters: Simple, interpretable. “85% of the time we have something useful to work with.”

Building an Evaluation Dataset

To measure retrieval quality, you need test queries with known relevant documents.

Approach 1: Manual annotation

1. Collect representative queries (100-500)
2. Have humans mark which documents are relevant
3. Use as ground truth for evaluation

Cost: High effort, small scale Benefit: Gold standard, custom to your domain

Approach 2: Existing benchmarks

Use standard datasets:
- MS MARCO: 500K queries over web documents
- Natural Questions: 300K queries with Wikipedia
- TREC: Classic IR benchmarks
- SQuAD: Reading comprehension based

Cost: Free, large scale Benefit: Compare with other systems, reproducibility

Approach 3: Synthetic evaluation

1. Take existing QA pairs from documentation
2. Use LLM to generate variations
3. Auto-label with content retrieval

Cost: Cheap, large scale Benefit: Fast iteration, easy generation

Approach 4: Hybrid approach (Recommended)

- Use existing benchmark for baseline
- Add manual annotations for your specific domain
- Add synthetic queries for coverage

Measuring End-to-End RAG Quality

Retrieval quality feeds into generation quality, but measuring the chain is complex.

Approach 1: Direct answer correctness

Test set: 100 questions with known answers
Run full RAG system (retrieval + generation)
Check if answer is correct
Accuracy = # correct / 100

Limitation: Doesn’t isolate retrieval from generation problems.

Approach 2: Component analysis

Run retrieval in isolation:
- Is relevant doc in top 5? → Retrieval quality
- Given perfect retrieval, can LLM answer? → Generation quality
Helps isolate failure modes

Approach 3: Attribution-based evaluation

Check if LLM's answer is supported by retrieved documents
High attribution = retrieval working
Low attribution = generation hallucinating

Setting Retrieval Quality Targets

Conservative (prioritize precision):

Recall@5 > 70%
Precision@5 > 80%
MRR > 0.6

Result: Fewer docs retrieved, higher confidence in results

Balanced:

Recall@10 > 80%
Precision@10 > 70%
MRR > 0.5

Result: Reasonable coverage, acceptable noise

Aggressive (prioritize recall):

Recall@20 > 90%
Precision@20 > 50%
MRR > 0.3

Result: Find almost everything, but more noise

Choose based on your use case:

Customer support (quality critical): Conservative
Internal research tools: Balanced
Exploratory search: Aggressive

Improving Retrieval Quality

If recall is low:

Retrieve more documents (increase K)
Improve chunking (chunks too small/large?)
Try different embedding model
Check knowledge base coverage

If precision is low:

Use smaller K (less noise)
Implement reranking (cross-encoders)
Filter low-similarity results
Improve chunking (less irrelevant chunks)

If MRR is low:

Implement two-stage retrieval (coarse then fine)
Use dense + sparse hybrid retrieval
Rerank top results aggressively
Debug query encoding

Reranking: Boosting Precision

A common technique: retrieve broadly (recall-optimized), then rerank (precision-optimized).

Stage 1: Dense retrieval → top 50 documents
  Using: embedding similarity
  Goal: high recall, don't miss anything

Stage 2: Reranking → top 5 documents
  Using: cross-encoder (specialized relevance model)
  Goal: high precision, filter noise

Result: Best of both worlds

Popular reranking models:

Cohere Reranker
BGE Reranker
mxbai-rerank

Cost: Reasonable for top-K results, enables more initial retrieval.

Production Monitoring

Track retrieval quality in production:

Metrics to monitor:

User clicks (implicit feedback)
Explicit user ratings
Citation success (did user verify the sources?)
Time-to-answer (user had to dig deeper?)

Set up alerts:

When Recall@10 drops below 0.7
When user satisfaction scores drop
When retrieval latency exceeds threshold

Monthly reviews:

Compare current vs. baseline metrics
Identify degrading queries
Test new models/approaches

Common Retrieval Quality Mistakes

Optimizing wrong metric: High recall isn’t useful if precision is terrible (noisy context).

Not measuring at all: Assuming “it seems to work” and shipping.

Biased test set: Evaluating on easy queries, missing hard cases.

Static evaluation: Testing once, then assuming quality is stable (it degrades).

Over-engineering: Chasing 0.99 NDCG when 0.75 would ship sooner and be validated with users.

Retrieval Quality in 2024

Trends:

Shift toward in-context retrieval quality over recall volume
Reranking becoming standard (second-stage ranking)
Hybrid retrieval (dense + sparse) gaining adoption
User feedback loops improving optimization

RAG systems live or die by retrieval quality. Measuring it carefully, optimizing systematically, and monitoring continuously is essential infrastructure.