LLM Evaluation
Shipping an LLM application without systematic evaluation is like deploying software without tests. You don’t know if it works, you can’t measure improvements, and you’ll be surprised by regressions. Evaluation is the discipline that separates production-grade AI from demos.
Why LLM Evaluation Is Hard
Traditional software testing: run the function, check if output equals expected output. Binary pass/fail.
LLM evaluation: generate text, judge if it’s “good enough.” What even is good?
The fundamental challenges:
- No ground truth for many tasks (summarization, creative writing, reasoning)
- Multiple valid answers — different phrasings can all be equally correct
- Subjective quality — helpful, clear, accurate are all in the eye of the beholder
- Distribution shift — a model that works on your 50-example eval might fail on real traffic
- Hallucinations are hard to detect — wrong answers often look exactly like right ones
Types of Evaluation
Automated Metrics
Exact match and substring match: For classification, extraction, and factual QA where there’s a correct answer.
def exact_match(prediction: str, reference: str) -> float: return 1.0 if prediction.strip().lower() == reference.strip().lower() else 0.0ROUGE (for summarization): Measures overlap between generated and reference summaries. ROUGE-1, ROUGE-2, ROUGE-L are commonly reported. Cheap to compute, correlates weakly with human preference but useful as a signal.
BERTScore: Computes semantic similarity between generated and reference text using contextual embeddings. Better than exact match for paraphrase-tolerant evaluation.
Code correctness: Run the generated code against test cases. Binary. The only reliable metric for code generation.
LLM-as-Judge
Use a capable LLM (GPT-4, Claude 3.5, or Gemini 1.5 Pro) to evaluate another LLM’s outputs. This is the dominant approach for production evaluation in 2025–2026.
def llm_judge( question: str, model_response: str, reference_answer: str = None, criteria: str = "accuracy, helpfulness, and factual correctness") -> dict:
prompt = f"""Evaluate the following AI response on {criteria}.
Question: {question}{"Reference answer: " + reference_answer if reference_answer else ""}AI Response: {model_response}
Rate on a scale of 1-5 where:1 = Completely wrong or unhelpful3 = Partially correct with issues5 = Excellent, accurate, and helpful
Respond as JSON: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
response = judge_client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=200, messages=[{"role": "user", "content": prompt}] ) return json.loads(response.content[0].text)Pairwise comparison is often more reliable than absolute scoring:
# Instead of "is response A good?", ask "is response A better than B?"prompt = f"""Which response better answers the question?
Question: {question}Response A: {response_a}Response B: {response_b}
Respond: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}"""RAGAS: RAG-Specific Evaluation
For RAG systems, RAGAS provides a standardized evaluation framework with four metrics:
from ragas import evaluatefrom ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precisionfrom datasets import Dataset
eval_dataset = Dataset.from_dict({ "question": ["What is the refund policy?", ...], "contexts": [["Policy doc chunk 1...", "Policy doc chunk 2..."], ...], "answer": ["Our refund policy allows returns within 30 days...", ...], "ground_truth": ["Customers can return items within 30 days...", ...]})
results = evaluate( eval_dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])print(results)# {faithfulness: 0.89, answer_relevancy: 0.92, context_recall: 0.85, context_precision: 0.78}| Metric | What It Measures | Score range |
|---|---|---|
| Faithfulness | Does the answer stay within retrieved context? | 0–1 |
| Answer Relevancy | Does the answer address the question? | 0–1 |
| Context Recall | Did retrieval find all necessary information? | 0–1 |
| Context Precision | Is retrieved context free of irrelevant noise? | 0–1 |
Human Evaluation
For high-stakes applications, human judgment remains the gold standard. But it’s slow and expensive.
When to use human eval:
- Initial validation of evaluation methodology
- Detecting issues your automated metrics miss
- Final acceptance testing before a major release
- Calibrating LLM judges
Practical approach: Use human eval to create a “golden set” of 100–500 high-quality question/answer pairs. Run automated evaluation against this set continuously. Periodically sample production traffic for human review to catch distribution drift.
Building an Evaluation Pipeline
┌─────────────────────────────────────────────────┐│ Evaluation Pipeline │├─────────────────────────────────────────────────┤│ ││ 1. TEST SET ││ • Curated golden examples ││ • Representative of real traffic ││ • Versioned alongside your model ││ ││ 2. METRICS ││ • Task-specific (exact match, ROUGE, etc.) ││ • LLM judge scores ││ • RAG-specific (RAGAS) if applicable ││ ││ 3. RUNNER ││ • Automated on every PR / model change ││ • Run against current prod model as baseline ││ ││ 4. TRACKING ││ • Store results per version ││ • Alert on regression >X% ││ • Dashboard for historical trends ││ │└─────────────────────────────────────────────────┘Evaluation Tools and Frameworks
| Tool | Best For |
|---|---|
| Braintrust | End-to-end LLM eval platform, CI integration |
| RAGAS | RAG evaluation metrics |
| Langfuse | LLM observability + built-in eval |
| TruLens | RAG triad evaluation |
| Promptflow (Azure) | Enterprise eval pipelines |
| Weave (Weights & Biases) | Experiment tracking + eval |
| DeepEval | Open-source eval framework, many metrics |
Standard Benchmarks (For Model Selection)
When choosing a model, standard benchmarks provide a starting point:
| Benchmark | Tests | Notes |
|---|---|---|
| MMLU | 57-subject knowledge test | Broad knowledge coverage |
| HumanEval / MBPP | Code generation | Pass@k on function writing |
| MATH | Math olympiad problems | Multi-step mathematical reasoning |
| GPQA | Graduate-level science | Expert-level knowledge |
| BIG-Bench Hard | Reasoning challenges | Harder cognitive tasks |
| MT-Bench | Multi-turn conversation | Chat quality |
| LMSYS Chatbot Arena | Human preference | Real-user ratings |
Caveat: Benchmark scores are heavily gamed. A model with top benchmark scores may underperform on your specific task. Always evaluate on your own data.
The Eval Mindset
Effective LLM evaluation is a product discipline, not just a technical one:
- Define failure modes first: What does “bad” look like? Hallucinations? Wrong format? Inappropriate tone? These should be explicitly tested.
- Use real traffic: Synthetic test sets miss real edge cases. Sample production queries weekly for your eval set.
- Measure what matters to users: A model that scores well on ROUGE but users hate is a failure. User satisfaction metrics (thumbs up/down, session length, return visits) are ground truth.
- Evaluate the system, not the model: In a RAG pipeline, a bad retrieval step will fail even with a great model. Test each component and the end-to-end.