Corrective RAG (CRAG): Self-Correcting Retrieval with Quality Assessment

Implement Corrective RAG (CRAG) — retrieval quality evaluation, web search fallback, document refinement, and building self-correcting RAG pipelines that handle low-quality retrievals.

Corrective RAG: What If Your Retrieval Is Wrong?

Standard RAG pipelines are optimistic — they retrieve, they generate, and they trust that retrieval found the right documents. But what happens when the relevant documents aren’t in your corpus? Or when the retrieved chunks are only tangentially related to the query? The LLM hallucinates, or gives a vague, hedge-filled non-answer.

Corrective RAG (CRAG) addresses this by evaluating retrieval quality before generation and taking corrective action when retrieval is poor. Instead of blindly generating from whatever was retrieved, CRAG’s retrieval evaluator decides whether the retrieved documents are actually useful — and if not, either refines the search or falls back to external sources.

The CRAG Architecture

CRAG Pipeline:
User Query
Vector Search → Retrieved Documents
Retrieval Evaluator
├── CORRECT (confidence ≥ threshold)
│ → Refine documents (extract key knowledge)
│ → Generate answer
├── INCORRECT (confidence < threshold)
│ → Discard local results
│ → Web search for up-to-date/broader information
│ → Generate answer from web results
└── AMBIGUOUS (borderline confidence)
→ Use both local + web results
→ Generate with combined context

The key insight: acknowledging when retrieval failed and doing something about it is better than silently generating from irrelevant documents.

Building the Retrieval Evaluator

The evaluator scores how relevant retrieved documents are to the query:

import anthropic
from pydantic import BaseModel
client = anthropic.Anthropic()
class RelevanceScore(BaseModel):
score: str # "CORRECT", "INCORRECT", "AMBIGUOUS"
confidence: float # 0.0 to 1.0
reasoning: str
def evaluate_retrieval(
query: str,
retrieved_docs: list[str],
threshold_correct: float = 0.7,
threshold_incorrect: float = 0.3,
) -> RelevanceScore:
docs_text = "\n\n---\n\n".join(retrieved_docs[:3])
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Evaluate whether these retrieved documents are relevant and sufficient
to answer the user's query.
Query: {query}
Retrieved Documents:
{docs_text}
Rate the retrieval quality:
- CORRECT: Documents clearly contain information to answer the query
- INCORRECT: Documents are not relevant to the query
- AMBIGUOUS: Documents are tangentially related but may not fully answer the query
Provide a confidence score from 0.0 to 1.0 and brief reasoning.
Respond in format:
Score: [CORRECT/INCORRECT/AMBIGUOUS]
Confidence: [0.0-1.0]
Reasoning: [brief explanation]"""
}]
)
text = response.content[0].text
lines = text.strip().split('\n')
score_line = next((l for l in lines if l.startswith('Score:')), "Score: AMBIGUOUS")
conf_line = next((l for l in lines if l.startswith('Confidence:')), "Confidence: 0.5")
reason_line = next((l for l in lines if l.startswith('Reasoning:')), "Reasoning: unclear")
score = score_line.split(':', 1)[1].strip()
confidence = float(conf_line.split(':', 1)[1].strip())
reasoning = reason_line.split(':', 1)[1].strip()
return RelevanceScore(score=score, confidence=confidence, reasoning=reasoning)

Knowledge Refinement (CORRECT Path)

When retrieval is CORRECT, CRAG refines the documents — extracting only the key knowledge strips before generation:

def refine_documents(query: str, documents: list[str]) -> str:
"""Extract the most relevant knowledge from documents for the query."""
docs_text = "\n\n".join(documents)
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Extract only the most relevant information from these documents
to answer the following query. Remove irrelevant sentences.
Be precise and factual.
Query: {query}
Documents:
{docs_text}
Extracted relevant knowledge:"""
}]
)
return response.content[0].text.strip()

Web Search Fallback (INCORRECT Path)

When retrieval fails, fall back to web search:

from tavily import TavilyClient # or use SerpAPI, Bing API
tavily_client = TavilyClient(api_key="your-key")
def web_search_fallback(query: str) -> list[str]:
"""Search the web when local retrieval fails."""
results = tavily_client.search(
query=query,
search_depth="advanced",
max_results=5,
include_answer=True,
)
documents = []
if results.get("answer"):
documents.append(f"Direct answer: {results['answer']}")
for result in results.get("results", []):
documents.append(
f"Source: {result['url']}\n{result['content'][:500]}"
)
return documents

Complete CRAG Pipeline

from openai import OpenAI
openai_client = OpenAI()
def crag_pipeline(query: str, vectorstore) -> dict:
# Step 1: Initial retrieval
retrieved = vectorstore.similarity_search(query, k=5)
retrieved_texts = [doc.page_content for doc in retrieved]
# Step 2: Evaluate retrieval quality
evaluation = evaluate_retrieval(query, retrieved_texts)
context_sources = []
if evaluation.score == "CORRECT":
# Refine local docs
refined = refine_documents(query, retrieved_texts)
context_sources = [refined]
source_type = "local_knowledge_base"
elif evaluation.score == "INCORRECT":
# Discard local, use web
web_results = web_search_fallback(query)
context_sources = web_results
source_type = "web_search"
else: # AMBIGUOUS
# Combine local and web
refined_local = refine_documents(query, retrieved_texts[:2])
web_results = web_search_fallback(query)[:2]
context_sources = [refined_local] + web_results
source_type = "combined"
# Step 3: Generate answer
context = "\n\n".join(context_sources)
answer_response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer based on the provided context. Be accurate and cite your sources."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
]
)
return {
"answer": answer_response.choices[0].message.content,
"retrieval_quality": evaluation.score,
"retrieval_confidence": evaluation.confidence,
"source_type": source_type,
"evaluator_reasoning": evaluation.reasoning,
}

LangGraph CRAG Implementation

For production CRAG, LangGraph provides cleaner state management:

from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class CRAGState(TypedDict):
query: str
documents: List[str]
generation: str
retrieval_score: str
def retrieve(state: CRAGState) -> CRAGState:
docs = vectorstore.similarity_search(state["query"], k=4)
return {"documents": [d.page_content for d in docs]}
def grade_documents(state: CRAGState) -> CRAGState:
evaluation = evaluate_retrieval(state["query"], state["documents"])
return {"retrieval_score": evaluation.score}
def web_search(state: CRAGState) -> CRAGState:
web_results = web_search_fallback(state["query"])
return {"documents": web_results}
def generate(state: CRAGState) -> CRAGState:
context = "\n\n".join(state["documents"])
answer = generate_answer(state["query"], context)
return {"generation": answer}
def route_retrieval(state: CRAGState) -> str:
if state["retrieval_score"] == "INCORRECT":
return "web_search"
return "generate"
# Build graph
workflow = StateGraph(CRAGState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("web_search", web_search)
workflow.add_node("generate", generate)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges("grade_documents", route_retrieval)
workflow.add_edge("web_search", "generate")
workflow.add_edge("generate", END)
app = workflow.compile()
result = app.invoke({"query": "What is the latest version of PyTorch?"})

When CRAG Adds Real Value

CRAG is most valuable when:

  • Your corpus has coverage gaps (time-sensitive queries, niche topics)
  • Users ask about recent events not in your indexed documents
  • False confidence in bad retrievals causes more harm than “I don’t know”
  • You have access to a reliable web search API

CRAG adds overhead — the evaluator LLM call adds 200–500ms. Skip it when:

  • Your corpus is comprehensive and regularly updated
  • Latency is critical
  • External web search raises data governance concerns

2025 Trend: Fine-Tuned Retrieval Evaluators

Instead of using a large LLM for evaluation, fine-tuning a small BERT-class model as the retrieval evaluator reduces evaluation cost from 200ms/500 tokens to under 20ms. Labeled datasets for fine-tuning can be bootstrapped by using GPT-4 to generate ground-truth labels on your specific domain’s query-document pairs.

CRAG is a meaningful architectural upgrade for RAG systems where retrieval failures cause user-visible errors. Knowing when you don’t know is a form of intelligence — and CRAG builds that into your pipeline.