Agentic RAG: When One Retrieval Step Isn’t Enough

Standard RAG is a single-shot pipeline: retrieve once, generate once. For simple factual questions, that works well. But complex queries — “Compare the pricing models of our three enterprise plans, identify which best suits a customer with 500 users who need SSO and needs to stay under $10K/year, and explain the upgrade path” — can’t be answered from a single retrieval step.

Agentic RAG gives the LLM the ability to decide when to retrieve, what to retrieve, evaluate whether the retrieved information is sufficient, and decide whether to retrieve again. Instead of a fixed pipeline, you have an autonomous loop.

The Agentic Mindset: Plan, Retrieve, Evaluate, Repeat

User query: "Compare our enterprise plans and find the best fit for 500 users,
             SSO required, budget under $10K/year"

Agent Planning:
  Step 1: Retrieve enterprise plan details (features, pricing)
  Step 2: Retrieve SSO documentation and requirements
  Step 3: Retrieve pricing calculator or per-user pricing information
  Step 4: If budget information found, evaluate against $10K constraint
  Step 5: If upgrade path needed, retrieve upgrade documentation
  Step 6: Synthesize comparison and recommendation

Agent Execution:
  Iteration 1: "search('enterprise plans features pricing')" → gets plan overview
  Iteration 2: "search('SSO authentication setup enterprise')" → gets SSO docs
  Iteration 3: "search('per user pricing 500 users calculation')" → gets pricing
  Evaluation: "Do I have enough to answer?" → Yes
  Generation: Final comparative answer with recommendation

ReAct Framework for RAG Agents

ReAct (Reasoning + Acting) is the dominant paradigm for agentic RAG. The LLM alternates between reasoning steps (Thought) and tool calls (Action):

Thought: I need to find enterprise pricing to compare plans.
Action: search("enterprise plan pricing features comparison")
Observation: [Plan A: $199/user/mo, Plan B: $149/user/mo, Plan C: $99/user/mo...]

Thought: I have pricing. Now I need SSO availability per plan.
Action: search("SSO SAML authentication which enterprise plans")
Observation: [SSO available on Plan A and Plan B. Plan C has OAuth only...]

Thought: Plans A and B have SSO. I need to calculate 500-user costs.
Action: calculate("500 * 199")
Observation: $99,500

Action: calculate("500 * 149")
Observation: $74,500

Thought: Both plans exceed $10K. I need to check if annual contracts reduce cost.
Action: search("annual contract discount enterprise plans")
Observation: [Annual billing: 20% discount on all plans...]

Thought: With 20% annual discount:
  Plan A: $99,500 * 0.8 = $79,600 → still over $10K/year
  Plan B: $74,500 * 0.8 = $59,600 → still over $10K/year

I should check if there's a smaller tier or startup pricing.
Action: search("startup pricing small team under 10000 annual")
Observation: [Startup plan: $8,000/year for up to 500 users with SSO...]

Thought: Found the answer. The Startup plan meets all criteria.
Answer: The Startup plan at $8,000/year supports up to 500 users with SSO included...

Implementation with LangChain Agents

from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from langchain_community.vectorstores import FAISS
from langchain import hub

# Define retrieval tool
def search_knowledge_base(query: str) -> str:
    results = vectorstore.similarity_search(query, k=3)
    return "\n\n".join([doc.page_content for doc in results])

tools = [
    Tool(
        name="search",
        func=search_knowledge_base,
        description="Search the company knowledge base for information about products, pricing, policies, and documentation. Use this to find specific facts.",
    ),
    Tool(
        name="search_policies",
        func=lambda q: policy_store.similarity_search_str(q, k=2),
        description="Search specifically for company policies, terms of service, and compliance documents.",
    ),
]

# ReAct prompt from LangChain Hub
prompt = hub.pull("hwchase17/react")

llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    max_iterations=8,  # prevent infinite loops
    handle_parsing_errors=True,
)

result = agent_executor.invoke({
    "input": "Compare enterprise plans for 500 users with SSO under $10K/year"
})

Tool Design for Agentic RAG

The tools available to the agent determine what it can do. Well-designed tools make agents more effective:

from langchain.tools import StructuredTool
from pydantic import BaseModel, Field

class SearchInput(BaseModel):
    query: str = Field(description="The search query")
    doc_type: str = Field(
        default="all",
        description="Filter by document type: 'pricing', 'technical', 'policy', 'all'"
    )
    max_results: int = Field(default=3, description="Number of results to return (1-10)")

def structured_search(query: str, doc_type: str = "all", max_results: int = 3) -> str:
    filter_dict = {} if doc_type == "all" else {"doc_type": doc_type}
    results = vectorstore.similarity_search(query, k=max_results, filter=filter_dict)
    if not results:
        return "No results found for this query."
    return "\n\n---\n\n".join([
        f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
        for doc in results
    ])

search_tool = StructuredTool.from_function(
    func=structured_search,
    name="search_knowledge_base",
    description="Search for information in the knowledge base with optional filters",
    args_schema=SearchInput,
)

Verification and Self-Critique Loop

A key capability of agentic RAG: the agent verifies its own answer before returning it:

VERIFICATION_PROMPT = """
You just generated this answer:
{answer}

Based on the retrieved context:
{context}

User question:
{question}

Evaluate:
1. Does the answer directly address the question?
2. Is every factual claim supported by the retrieved context?
3. Are there any claims that seem uncertain or unsupported?

If the answer is complete and accurate, respond with: VERIFIED
If the answer needs more information, respond with: NEED_MORE_INFO: [specific what's missing]
If the answer contains unsupported claims, respond with: INCORRECT: [what's wrong]
"""

This self-critique loop catches hallucinations before they reach the user and triggers additional retrieval when needed.

Agent Architectures Comparison

Architecture	Use Case	Complexity	Latency
Single-hop RAG	Simple factual Q&A	Low	~1s
ReAct agent	Multi-step research	Medium	5–30s
Plan-and-Execute	Complex structured tasks	High	15–60s
Multi-agent (specialist agents)	Large-scale enterprise	Very High	Variable

Multi-Agent RAG for Complex Workflows

For enterprise use cases, multiple specialized agents collaborate:

Orchestrator Agent
├── Retrieval Agent (search and fetch documents)
├── Analysis Agent (analyze retrieved information)
├── Calculation Agent (numerical reasoning)
├── Synthesis Agent (combine information from multiple sources)
└── Verification Agent (fact-check the final answer)

The orchestrator routes subtasks to specialist agents and combines their outputs. This separation of concerns improves reliability — each agent is optimized for its specific task.

2025 Trend: Long-Context vs Agentic RAG

With Claude 3.5’s 200K context and GPT-4o’s 128K context, some argue that long-context models reduce the need for agentic retrieval — just stuff everything in the context window.

In practice, long-context models still benefit from selective retrieval:

Cost: 200K tokens is expensive to process every query
Accuracy: LLMs still show “lost in the middle” degradation at extreme lengths
Freshness: Large corpora can’t fit in any context window

The sweet spot: use long context for within-document comprehension, use agentic retrieval for across-document synthesis.

Agentic RAG is the architecture for hard problems. Start simple (single-hop RAG) and escalate to agents only when the complexity of your queries genuinely requires multi-step reasoning. Agents are powerful but add latency, cost, and unpredictability that simpler pipelines don’t have.