LLM Evaluation

Shipping an LLM application without systematic evaluation is like deploying software without tests. You don’t know if it works, you can’t measure improvements, and you’ll be surprised by regressions. Evaluation is the discipline that separates production-grade AI from demos.

Why LLM Evaluation Is Hard

Traditional software testing: run the function, check if output equals expected output. Binary pass/fail.

LLM evaluation: generate text, judge if it’s “good enough.” What even is good?

The fundamental challenges:

No ground truth for many tasks (summarization, creative writing, reasoning)
Multiple valid answers — different phrasings can all be equally correct
Subjective quality — helpful, clear, accurate are all in the eye of the beholder
Distribution shift — a model that works on your 50-example eval might fail on real traffic
Hallucinations are hard to detect — wrong answers often look exactly like right ones

Types of Evaluation

Automated Metrics

Exact match and substring match: For classification, extraction, and factual QA where there’s a correct answer.

def exact_match(prediction: str, reference: str) -> float:
    return 1.0 if prediction.strip().lower() == reference.strip().lower() else 0.0

ROUGE (for summarization): Measures overlap between generated and reference summaries. ROUGE-1, ROUGE-2, ROUGE-L are commonly reported. Cheap to compute, correlates weakly with human preference but useful as a signal.

BERTScore: Computes semantic similarity between generated and reference text using contextual embeddings. Better than exact match for paraphrase-tolerant evaluation.

Code correctness: Run the generated code against test cases. Binary. The only reliable metric for code generation.

LLM-as-Judge

Use a capable LLM (GPT-4, Claude 3.5, or Gemini 1.5 Pro) to evaluate another LLM’s outputs. This is the dominant approach for production evaluation in 2025–2026.

def llm_judge(
    question: str,
    model_response: str,
    reference_answer: str = None,
    criteria: str = "accuracy, helpfulness, and factual correctness"
) -> dict:

    prompt = f"""Evaluate the following AI response on {criteria}.

Question: {question}
{"Reference answer: " + reference_answer if reference_answer else ""}
AI Response: {model_response}

Rate on a scale of 1-5 where:
1 = Completely wrong or unhelpful
3 = Partially correct with issues
5 = Excellent, accurate, and helpful

Respond as JSON: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""

    response = judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(response.content[0].text)

Pairwise comparison is often more reliable than absolute scoring:

# Instead of "is response A good?", ask "is response A better than B?"
prompt = f"""Which response better answers the question?

Question: {question}
Response A: {response_a}
Response B: {response_b}

Respond: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}"""

RAGAS: RAG-Specific Evaluation

For RAG systems, RAGAS provides a standardized evaluation framework with four metrics:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": ["What is the refund policy?", ...],
    "contexts": [["Policy doc chunk 1...", "Policy doc chunk 2..."], ...],
    "answer": ["Our refund policy allows returns within 30 days...", ...],
    "ground_truth": ["Customers can return items within 30 days...", ...]
})

results = evaluate(
    eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(results)
# {faithfulness: 0.89, answer_relevancy: 0.92, context_recall: 0.85, context_precision: 0.78}

Metric	What It Measures	Score range
Faithfulness	Does the answer stay within retrieved context?	0–1
Answer Relevancy	Does the answer address the question?	0–1
Context Recall	Did retrieval find all necessary information?	0–1
Context Precision	Is retrieved context free of irrelevant noise?	0–1

Human Evaluation

For high-stakes applications, human judgment remains the gold standard. But it’s slow and expensive.

When to use human eval:

Initial validation of evaluation methodology
Detecting issues your automated metrics miss
Final acceptance testing before a major release
Calibrating LLM judges

Practical approach: Use human eval to create a “golden set” of 100–500 high-quality question/answer pairs. Run automated evaluation against this set continuously. Periodically sample production traffic for human review to catch distribution drift.

Building an Evaluation Pipeline

┌─────────────────────────────────────────────────┐
│              Evaluation Pipeline                 │
├─────────────────────────────────────────────────┤
│                                                  │
│  1. TEST SET                                     │
│     • Curated golden examples                    │
│     • Representative of real traffic             │
│     • Versioned alongside your model             │
│                                                  │
│  2. METRICS                                      │
│     • Task-specific (exact match, ROUGE, etc.)   │
│     • LLM judge scores                           │
│     • RAG-specific (RAGAS) if applicable         │
│                                                  │
│  3. RUNNER                                       │
│     • Automated on every PR / model change       │
│     • Run against current prod model as baseline │
│                                                  │
│  4. TRACKING                                     │
│     • Store results per version                  │
│     • Alert on regression >X%                    │
│     • Dashboard for historical trends            │
│                                                  │
└─────────────────────────────────────────────────┘

Evaluation Tools and Frameworks

Tool	Best For
Braintrust	End-to-end LLM eval platform, CI integration
RAGAS	RAG evaluation metrics
Langfuse	LLM observability + built-in eval
TruLens	RAG triad evaluation
Promptflow (Azure)	Enterprise eval pipelines
Weave (Weights & Biases)	Experiment tracking + eval
DeepEval	Open-source eval framework, many metrics

Standard Benchmarks (For Model Selection)

When choosing a model, standard benchmarks provide a starting point:

Benchmark	Tests	Notes
MMLU	57-subject knowledge test	Broad knowledge coverage
HumanEval / MBPP	Code generation	Pass@k on function writing
MATH	Math olympiad problems	Multi-step mathematical reasoning
GPQA	Graduate-level science	Expert-level knowledge
BIG-Bench Hard	Reasoning challenges	Harder cognitive tasks
MT-Bench	Multi-turn conversation	Chat quality
LMSYS Chatbot Arena	Human preference	Real-user ratings

Caveat: Benchmark scores are heavily gamed. A model with top benchmark scores may underperform on your specific task. Always evaluate on your own data.

The Eval Mindset

Effective LLM evaluation is a product discipline, not just a technical one:

Define failure modes first: What does “bad” look like? Hallucinations? Wrong format? Inappropriate tone? These should be explicitly tested.
Use real traffic: Synthetic test sets miss real edge cases. Sample production queries weekly for your eval set.
Measure what matters to users: A model that scores well on ROUGE but users hate is a failure. User satisfaction metrics (thumbs up/down, session length, return visits) are ground truth.
Evaluate the system, not the model: In a RAG pipeline, a bad retrieval step will fail even with a great model. Test each component and the end-to-end.