RAG Evaluation: How Do You Know If Your RAG Is Actually Working?

“It feels like it’s working” is not a measurement. Neither is testing five queries by hand and seeing they look reasonable. Production RAG systems need rigorous evaluation frameworks that catch regressions, guide optimization decisions, and give stakeholders confidence in system quality.

RAG evaluation is complex because there are two systems to evaluate: the retriever and the generator. Failures can happen at either stage, and distinguishing between “bad retrieval” and “bad generation” is essential for knowing which to fix.

The RAG Evaluation Components

RAG System Components and Failure Modes:

Query → Retriever → Retrieved Docs → Generator → Answer
           ↑                              ↑
    Context Quality              Generation Quality

Context failures:                Generation failures:
  - Wrong documents retrieved      - Hallucination (unsupported claims)
  - Relevant docs missed           - Answer doesn't address query
  - Low relevance documents        - Misses key information in context
  - Too much noise                 - Incorrect information extraction

Each component needs separate metrics. Overall end-to-end accuracy tells you that something is broken; component metrics tell you where.

RAGAS: The Standard Evaluation Framework

RAGAS (Retrieval Augmented Generation Assessment) is the most widely adopted RAG evaluation framework. It provides four core metrics that cover both retrieval and generation quality:

Faithfulness

Measures whether every claim in the generated answer is supported by the retrieved context. Detects hallucination.

Answer: "The model was trained on 570GB of text data and achieves 94% accuracy."
Context: "GPT-3 was trained on approximately 570GB of filtered text."

Claim 1: "trained on 570GB of text data" → supported ✓
Claim 2: "achieves 94% accuracy" → NOT in context → hallucination ✗

Faithfulness = 1/2 = 0.5

Answer Relevancy

Measures whether the answer addresses the question being asked. Detects off-topic or vague answers.

from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas import evaluate
from datasets import Dataset

test_data = {
    "question": ["What is the refund policy?", "How do I reset my password?"],
    "answer": ["You can return items within 30 days.", "Visit account settings and click 'Forgot Password'."],
    "contexts": [
        ["Our return policy allows 30-day returns for full refund..."],
        ["Password reset: Go to Settings > Account > Security > Reset Password..."],
    ],
    "ground_truth": [
        "Items can be returned within 30 days for a full refund.",
        "Click 'Forgot Password' in account settings to reset.",
    ]
}

dataset = Dataset.from_dict(test_data)

results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.87,
#  'context_precision': 0.82, 'context_recall': 0.91}

Context Precision

Measures whether retrieved documents are actually useful — are they relevant to the query? High precision = few irrelevant chunks in the retrieved set.

Retrieved chunks for "What is the return policy?":
  Chunk 1: "30-day return policy for full refund" → relevant ✓
  Chunk 2: "Company founded in 2015 in San Francisco" → not relevant ✗
  Chunk 3: "Defective items: 90-day return period" → relevant ✓
  Chunk 4: "Annual report 2022 shows 18% growth" → not relevant ✗
  Chunk 5: "Sale items: 15-day return window" → relevant ✓

Context Precision = 3/5 = 0.60 (3 relevant out of 5 retrieved)

Context Recall

Measures whether all the information needed to answer the question was retrieved. High recall = nothing important was missed.

Ground truth answer requires: [30-day policy, sale item exception, defective item exception]
Retrieved: [30-day policy ✓, defective item exception ✓]
Missing: [sale item exception ✗]

Context Recall = 2/3 = 0.67

Building a Comprehensive Evaluation Pipeline

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
    answer_similarity,
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from datasets import Dataset
import pandas as pd

class RAGEvaluator:
    def __init__(self, rag_pipeline, test_dataset_path: str):
        self.pipeline = rag_pipeline
        self.test_data = pd.read_json(test_dataset_path)

    def run_pipeline_on_testset(self) -> Dataset:
        results = []
        for _, row in self.test_data.iterrows():
            output = self.pipeline.run(row["question"])
            results.append({
                "question": row["question"],
                "answer": output["answer"],
                "contexts": output["retrieved_texts"],
                "ground_truth": row["ground_truth"],
            })
        return Dataset.from_list(results)

    def evaluate(self) -> dict:
        dataset = self.run_pipeline_on_testset()

        metrics = [
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
            answer_correctness,
        ]

        return evaluate(dataset=dataset, metrics=metrics)

    def generate_report(self) -> pd.DataFrame:
        results = self.run_pipeline_on_testset().to_pandas()
        scores = evaluate(Dataset.from_pandas(results))

        # Per-question breakdown
        return pd.DataFrame({
            "question": results["question"],
            "faithfulness": scores["faithfulness_scores"],
            "answer_relevancy": scores["answer_relevancy_scores"],
            "context_precision": scores["context_precision_scores"],
            "context_recall": scores["context_recall_scores"],
        })

Building a Golden Dataset

A golden evaluation dataset is the foundation of good RAG evaluation. How to build one:

import anthropic

client = anthropic.Anthropic()

def generate_test_questions(documents: list[str], n_questions: int = 100) -> list[dict]:
    """Use LLM to generate questions with ground truth from documents."""
    test_cases = []

    for doc in documents[:20]:  # sample from corpus
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=600,
            messages=[{
                "role": "user",
                "content": f"""Generate 5 question-answer pairs from this document.
Each pair should test different aspects of the content.
Return as JSON array: [{{"question": "...", "answer": "...", "source_quote": "..."}}]

Document:
{doc[:1000]}"""
            }]
        )
        import json
        try:
            pairs = json.loads(response.content[0].text)
            for pair in pairs:
                pair["source_document"] = doc
                test_cases.append(pair)
        except json.JSONDecodeError:
            continue

    return test_cases[:n_questions]

Human review of generated Q&A pairs is essential — LLMs sometimes generate questions that are ambiguous or have multiple valid answers.

Component-Level Debugging

When RAGAS scores drop, use component metrics to locate the problem:

Score dropped after pipeline change:

Faithfulness: 0.95 → 0.91 (slight drop)
Context Recall: 0.88 → 0.61 (large drop!)
Context Precision: 0.79 → 0.82 (slight improvement)

Diagnosis: Retrieval is now returning more precise results (better precision)
but is MISSING relevant documents (recall dropped sharply).
Root cause: Chunk size increased → fewer chunks → lower recall coverage.
Fix: Reduce chunk size or add sliding window overlap.

LLM-as-Judge for Nuanced Metrics

RAGAS uses LLMs internally to compute faithfulness and relevancy. For custom metrics, implement LLM-as-judge:

def llm_judge_answer_quality(
    question: str,
    answer: str,
    context: str,
    ground_truth: str,
) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Score this RAG system answer on three dimensions (1-5 scale):

Question: {question}
Context provided: {context[:500]}
System answer: {answer}
Ground truth: {ground_truth}

Score and explain:
1. Correctness (is the answer factually correct?)
2. Completeness (does it cover all important aspects?)
3. Conciseness (is it appropriately brief without missing key info?)

Format: Correctness: X/5, Completeness: X/5, Conciseness: X/5"""
        }]
    )
    return parse_scores(response.content[0].text)

Continuous Evaluation in Production

Don’t just evaluate at deployment time — monitor continuously:

import random

def sample_for_evaluation(production_logs: list[dict], sample_rate: float = 0.05) -> list:
    """Sample 5% of production queries for evaluation."""
    return random.sample(production_logs, int(len(production_logs) * sample_rate))

# Weekly evaluation job
weekly_sample = sample_for_evaluation(last_weeks_queries)
scores = evaluator.evaluate_sample(weekly_sample)

if scores["faithfulness"] < 0.85:
    alert("Faithfulness degraded — possible retrieval or model change")
if scores["context_recall"] < 0.80:
    alert("Context recall dropped — check for index issues or chunk size changes")

2025 Trend: Task-Specific Evaluation Suites

Rather than one generic evaluation, production teams maintain task-specific evaluation suites: one for factual Q&A, one for comparative analysis, one for summarization, one for instruction following. Each suite has tailored metrics and different quality thresholds. A RAG system optimized for factual Q&A shouldn’t be evaluated the same way as one optimized for document summarization.

Good evaluation is what separates teams that improve their RAG systematically from those that guess. Invest in your golden dataset early — it pays dividends throughout the product lifecycle.