AI  /  Generative AI

Generative AI 26 guides · updated 2026

From transformer foundations to production RAG, tool-using agents, and the Model Context Protocol — the GenAI stack as it's actually being built in 2026.

LLM Evaluation

Shipping an LLM application without systematic evaluation is like deploying software without tests. You don’t know if it works, you can’t measure improvements, and you’ll be surprised by regressions. Evaluation is the discipline that separates production-grade AI from demos.


Why LLM Evaluation Is Hard

Traditional software testing: run the function, check if output equals expected output. Binary pass/fail.

LLM evaluation: generate text, judge if it’s “good enough.” What even is good?

The fundamental challenges:


Types of Evaluation

Automated Metrics

Exact match and substring match: For classification, extraction, and factual QA where there’s a correct answer.

def exact_match(prediction: str, reference: str) -> float:
return 1.0 if prediction.strip().lower() == reference.strip().lower() else 0.0

ROUGE (for summarization): Measures overlap between generated and reference summaries. ROUGE-1, ROUGE-2, ROUGE-L are commonly reported. Cheap to compute, correlates weakly with human preference but useful as a signal.

BERTScore: Computes semantic similarity between generated and reference text using contextual embeddings. Better than exact match for paraphrase-tolerant evaluation.

Code correctness: Run the generated code against test cases. Binary. The only reliable metric for code generation.


LLM-as-Judge

Use a capable LLM (GPT-4, Claude 3.5, or Gemini 1.5 Pro) to evaluate another LLM’s outputs. This is the dominant approach for production evaluation in 2025–2026.

def llm_judge(
question: str,
model_response: str,
reference_answer: str = None,
criteria: str = "accuracy, helpfulness, and factual correctness"
) -> dict:
prompt = f"""Evaluate the following AI response on {criteria}.
Question: {question}
{"Reference answer: " + reference_answer if reference_answer else ""}
AI Response: {model_response}
Rate on a scale of 1-5 where:
1 = Completely wrong or unhelpful
3 = Partially correct with issues
5 = Excellent, accurate, and helpful
Respond as JSON: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
response = judge_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)

Pairwise comparison is often more reliable than absolute scoring:

# Instead of "is response A good?", ask "is response A better than B?"
prompt = f"""Which response better answers the question?
Question: {question}
Response A: {response_a}
Response B: {response_b}
Respond: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}"""

RAGAS: RAG-Specific Evaluation

For RAG systems, RAGAS provides a standardized evaluation framework with four metrics:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
eval_dataset = Dataset.from_dict({
"question": ["What is the refund policy?", ...],
"contexts": [["Policy doc chunk 1...", "Policy doc chunk 2..."], ...],
"answer": ["Our refund policy allows returns within 30 days...", ...],
"ground_truth": ["Customers can return items within 30 days...", ...]
})
results = evaluate(
eval_dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(results)
# {faithfulness: 0.89, answer_relevancy: 0.92, context_recall: 0.85, context_precision: 0.78}
MetricWhat It MeasuresScore range
FaithfulnessDoes the answer stay within retrieved context?0–1
Answer RelevancyDoes the answer address the question?0–1
Context RecallDid retrieval find all necessary information?0–1
Context PrecisionIs retrieved context free of irrelevant noise?0–1

Human Evaluation

For high-stakes applications, human judgment remains the gold standard. But it’s slow and expensive.

When to use human eval:

Practical approach: Use human eval to create a “golden set” of 100–500 high-quality question/answer pairs. Run automated evaluation against this set continuously. Periodically sample production traffic for human review to catch distribution drift.


Building an Evaluation Pipeline

┌─────────────────────────────────────────────────┐
│ Evaluation Pipeline │
├─────────────────────────────────────────────────┤
│ │
│ 1. TEST SET │
│ • Curated golden examples │
│ • Representative of real traffic │
│ • Versioned alongside your model │
│ │
│ 2. METRICS │
│ • Task-specific (exact match, ROUGE, etc.) │
│ • LLM judge scores │
│ • RAG-specific (RAGAS) if applicable │
│ │
│ 3. RUNNER │
│ • Automated on every PR / model change │
│ • Run against current prod model as baseline │
│ │
│ 4. TRACKING │
│ • Store results per version │
│ • Alert on regression >X% │
│ • Dashboard for historical trends │
│ │
└─────────────────────────────────────────────────┘

Evaluation Tools and Frameworks

ToolBest For
BraintrustEnd-to-end LLM eval platform, CI integration
RAGASRAG evaluation metrics
LangfuseLLM observability + built-in eval
TruLensRAG triad evaluation
Promptflow (Azure)Enterprise eval pipelines
Weave (Weights & Biases)Experiment tracking + eval
DeepEvalOpen-source eval framework, many metrics

Standard Benchmarks (For Model Selection)

When choosing a model, standard benchmarks provide a starting point:

BenchmarkTestsNotes
MMLU57-subject knowledge testBroad knowledge coverage
HumanEval / MBPPCode generationPass@k on function writing
MATHMath olympiad problemsMulti-step mathematical reasoning
GPQAGraduate-level scienceExpert-level knowledge
BIG-Bench HardReasoning challengesHarder cognitive tasks
MT-BenchMulti-turn conversationChat quality
LMSYS Chatbot ArenaHuman preferenceReal-user ratings

Caveat: Benchmark scores are heavily gamed. A model with top benchmark scores may underperform on your specific task. Always evaluate on your own data.


The Eval Mindset

Effective LLM evaluation is a product discipline, not just a technical one: