Corrective RAG: What If Your Retrieval Is Wrong?

Standard RAG pipelines are optimistic — they retrieve, they generate, and they trust that retrieval found the right documents. But what happens when the relevant documents aren’t in your corpus? Or when the retrieved chunks are only tangentially related to the query? The LLM hallucinates, or gives a vague, hedge-filled non-answer.

Corrective RAG (CRAG) addresses this by evaluating retrieval quality before generation and taking corrective action when retrieval is poor. Instead of blindly generating from whatever was retrieved, CRAG’s retrieval evaluator decides whether the retrieved documents are actually useful — and if not, either refines the search or falls back to external sources.

The CRAG Architecture

CRAG Pipeline:

User Query
    ↓
Vector Search → Retrieved Documents
    ↓
Retrieval Evaluator
    ├── CORRECT (confidence ≥ threshold)
    │   → Refine documents (extract key knowledge)
    │   → Generate answer
    │
    ├── INCORRECT (confidence < threshold)
    │   → Discard local results
    │   → Web search for up-to-date/broader information
    │   → Generate answer from web results
    │
    └── AMBIGUOUS (borderline confidence)
        → Use both local + web results
        → Generate with combined context

The key insight: acknowledging when retrieval failed and doing something about it is better than silently generating from irrelevant documents.

Building the Retrieval Evaluator

The evaluator scores how relevant retrieved documents are to the query:

import anthropic
from pydantic import BaseModel

client = anthropic.Anthropic()

class RelevanceScore(BaseModel):
    score: str  # "CORRECT", "INCORRECT", "AMBIGUOUS"
    confidence: float  # 0.0 to 1.0
    reasoning: str

def evaluate_retrieval(
    query: str,
    retrieved_docs: list[str],
    threshold_correct: float = 0.7,
    threshold_incorrect: float = 0.3,
) -> RelevanceScore:
    docs_text = "\n\n---\n\n".join(retrieved_docs[:3])

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Evaluate whether these retrieved documents are relevant and sufficient
to answer the user's query.

Query: {query}

Retrieved Documents:
{docs_text}

Rate the retrieval quality:
- CORRECT: Documents clearly contain information to answer the query
- INCORRECT: Documents are not relevant to the query
- AMBIGUOUS: Documents are tangentially related but may not fully answer the query

Provide a confidence score from 0.0 to 1.0 and brief reasoning.

Respond in format:
Score: [CORRECT/INCORRECT/AMBIGUOUS]
Confidence: [0.0-1.0]
Reasoning: [brief explanation]"""
        }]
    )

    text = response.content[0].text
    lines = text.strip().split('\n')
    score_line = next((l for l in lines if l.startswith('Score:')), "Score: AMBIGUOUS")
    conf_line = next((l for l in lines if l.startswith('Confidence:')), "Confidence: 0.5")
    reason_line = next((l for l in lines if l.startswith('Reasoning:')), "Reasoning: unclear")

    score = score_line.split(':', 1)[1].strip()
    confidence = float(conf_line.split(':', 1)[1].strip())
    reasoning = reason_line.split(':', 1)[1].strip()

    return RelevanceScore(score=score, confidence=confidence, reasoning=reasoning)

When retrieval is CORRECT, CRAG refines the documents — extracting only the key knowledge strips before generation:

def refine_documents(query: str, documents: list[str]) -> str:
    """Extract the most relevant knowledge from documents for the query."""
    docs_text = "\n\n".join(documents)

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Extract only the most relevant information from these documents
to answer the following query. Remove irrelevant sentences.
Be precise and factual.

Query: {query}

Documents:
{docs_text}

Extracted relevant knowledge:"""
        }]
    )

    return response.content[0].text.strip()

Web Search Fallback (INCORRECT Path)

When retrieval fails, fall back to web search:

from tavily import TavilyClient  # or use SerpAPI, Bing API

tavily_client = TavilyClient(api_key="your-key")

def web_search_fallback(query: str) -> list[str]:
    """Search the web when local retrieval fails."""
    results = tavily_client.search(
        query=query,
        search_depth="advanced",
        max_results=5,
        include_answer=True,
    )

    documents = []
    if results.get("answer"):
        documents.append(f"Direct answer: {results['answer']}")

    for result in results.get("results", []):
        documents.append(
            f"Source: {result['url']}\n{result['content'][:500]}"
        )

    return documents

Complete CRAG Pipeline

from openai import OpenAI

openai_client = OpenAI()

def crag_pipeline(query: str, vectorstore) -> dict:
    # Step 1: Initial retrieval
    retrieved = vectorstore.similarity_search(query, k=5)
    retrieved_texts = [doc.page_content for doc in retrieved]

    # Step 2: Evaluate retrieval quality
    evaluation = evaluate_retrieval(query, retrieved_texts)

    context_sources = []

    if evaluation.score == "CORRECT":
        # Refine local docs
        refined = refine_documents(query, retrieved_texts)
        context_sources = [refined]
        source_type = "local_knowledge_base"

    elif evaluation.score == "INCORRECT":
        # Discard local, use web
        web_results = web_search_fallback(query)
        context_sources = web_results
        source_type = "web_search"

    else:  # AMBIGUOUS
        # Combine local and web
        refined_local = refine_documents(query, retrieved_texts[:2])
        web_results = web_search_fallback(query)[:2]
        context_sources = [refined_local] + web_results
        source_type = "combined"

    # Step 3: Generate answer
    context = "\n\n".join(context_sources)
    answer_response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer based on the provided context. Be accurate and cite your sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ]
    )

    return {
        "answer": answer_response.choices[0].message.content,
        "retrieval_quality": evaluation.score,
        "retrieval_confidence": evaluation.confidence,
        "source_type": source_type,
        "evaluator_reasoning": evaluation.reasoning,
    }

LangGraph CRAG Implementation

For production CRAG, LangGraph provides cleaner state management:

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class CRAGState(TypedDict):
    query: str
    documents: List[str]
    generation: str
    retrieval_score: str

def retrieve(state: CRAGState) -> CRAGState:
    docs = vectorstore.similarity_search(state["query"], k=4)
    return {"documents": [d.page_content for d in docs]}

def grade_documents(state: CRAGState) -> CRAGState:
    evaluation = evaluate_retrieval(state["query"], state["documents"])
    return {"retrieval_score": evaluation.score}

def web_search(state: CRAGState) -> CRAGState:
    web_results = web_search_fallback(state["query"])
    return {"documents": web_results}

def generate(state: CRAGState) -> CRAGState:
    context = "\n\n".join(state["documents"])
    answer = generate_answer(state["query"], context)
    return {"generation": answer}

def route_retrieval(state: CRAGState) -> str:
    if state["retrieval_score"] == "INCORRECT":
        return "web_search"
    return "generate"

# Build graph
workflow = StateGraph(CRAGState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("web_search", web_search)
workflow.add_node("generate", generate)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges("grade_documents", route_retrieval)
workflow.add_edge("web_search", "generate")
workflow.add_edge("generate", END)

app = workflow.compile()
result = app.invoke({"query": "What is the latest version of PyTorch?"})

When CRAG Adds Real Value

CRAG is most valuable when:

Your corpus has coverage gaps (time-sensitive queries, niche topics)
Users ask about recent events not in your indexed documents
False confidence in bad retrievals causes more harm than “I don’t know”
You have access to a reliable web search API

CRAG adds overhead — the evaluator LLM call adds 200–500ms. Skip it when:

Your corpus is comprehensive and regularly updated
Latency is critical
External web search raises data governance concerns

2025 Trend: Fine-Tuned Retrieval Evaluators

Instead of using a large LLM for evaluation, fine-tuning a small BERT-class model as the retrieval evaluator reduces evaluation cost from 200ms/500 tokens to under 20ms. Labeled datasets for fine-tuning can be bootstrapped by using GPT-4 to generate ground-truth labels on your specific domain’s query-document pairs.

CRAG is a meaningful architectural upgrade for RAG systems where retrieval failures cause user-visible errors. Knowing when you don’t know is a form of intelligence — and CRAG builds that into your pipeline.

Corrective RAG (CRAG): Self-Correcting Retrieval with Quality Assessment