Multi-Hop Retrieval: Answering Complex Questions Across Multiple Documents

Implement multi-hop retrieval for RAG — iterative evidence chains, bridge entity retrieval, HotpotQA-style reasoning, and handling questions that require connecting multiple facts.

Multi-Hop Retrieval: Following the Evidence Chain

“Which investor funded both the company that acquired DeepMind and the startup that later became OpenAI?”

This question can’t be answered by retrieving a single document. You need to:

  1. Find who acquired DeepMind (Google)
  2. Find Google’s investors (Sequoia, Kleiner Perkins, etc.)
  3. Find who funded early OpenAI (Y Combinator, Peter Thiel, Reid Hoffman, etc.)
  4. Find the intersection

Each retrieval step depends on the result of the previous one. That’s multi-hop retrieval.

What Makes a Query Multi-Hop

Single-hop (standard RAG):
Q: "What is the capital of France?"
Retrieval: one document → answer
Two-hop:
Q: "What is the capital of the country that hosts the FIFA World Cup 2030?"
Hop 1: Find FIFA World Cup 2030 host → Spain, Portugal, Morocco
Hop 2: Find capitals of Spain, Portugal, Morocco → Madrid, Lisbon, Rabat
Three-hop:
Q: "Who founded the company that makes the software used by the
hospital that treated the president of the entity that first deployed GPT-4?"
Hop 1: Find who first deployed GPT-4 → Microsoft
Hop 2: Find hospital that treated Microsoft's president → [specific hospital]
Hop 3: Find software used by that hospital → [specific medical software]
Hop 4: Find founder of that software company → [answer]

Iterative Retrieval: The Core Algorithm

The fundamental multi-hop algorithm: use each retrieval’s result to formulate the next query:

import anthropic
client = anthropic.Anthropic()
def multi_hop_retrieve(
initial_query: str,
vectorstore,
max_hops: int = 4,
) -> tuple[list[str], list[str]]:
"""Returns (all_retrieved_docs, reasoning_chain)"""
all_docs = []
reasoning_chain = []
current_query = initial_query
context = ""
for hop in range(max_hops):
# Retrieve for current query
results = vectorstore.similarity_search(current_query, k=3)
new_docs = [r.page_content for r in results]
all_docs.extend(new_docs)
# Build context from all retrieved docs so far
context = "\n\n".join(all_docs)
# Ask LLM: do we have enough to answer? If not, what's the next hop?
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Original question: {initial_query}
Retrieved information so far:
{context}
Can you answer the original question with the current information?
- If YES: respond with ANSWER: [your answer]
- If NO: respond with NEXT_QUERY: [what to search for next to complete the answer]"""
}]
)
response_text = response.content[0].text.strip()
reasoning_chain.append(f"Hop {hop+1}: queried '{current_query}'\n{response_text[:200]}")
if response_text.startswith("ANSWER:"):
break
elif response_text.startswith("NEXT_QUERY:"):
current_query = response_text.replace("NEXT_QUERY:", "").strip()
else:
break
return all_docs, reasoning_chain

Bridge Entity Extraction

A common multi-hop pattern involves a “bridge entity” — an intermediate entity that connects the question’s subject to its answer:

Q: "What nationality is the CEO of the company that makes Claude?"
Bridge entity: "the company that makes Claude" = Anthropic
Hop 1: Who makes Claude? → Anthropic
Hop 2: Who is the CEO of Anthropic? → Dario Amodei
Hop 3: What nationality is Dario Amodei? → American

Explicitly extracting bridge entities and searching for them improves precision:

def extract_bridge_entities(query: str, initial_docs: list[str]) -> list[str]:
context = "\n".join(initial_docs[:2])
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Given this question and partial context, identify the key intermediate entities
that need to be looked up to fully answer the question.
List them one per line.
Question: {query}
Context: {context}
Bridge entities to look up:"""
}]
)
entities = response.content[0].text.strip().split('\n')
return [e.strip() for e in entities if e.strip()]
# Usage
bridge_entities = extract_bridge_entities(
"What is the headquarters city of the company that acquired DeepMind?",
initial_docs
)
# → ["company that acquired DeepMind", "Google"]
# → second hop: search("Google headquarters city")

LangChain Multi-Hop with MRKL

MRKL (Modular Reasoning, Knowledge, and Language) systems decompose complex questions into modular steps:

from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool
from langchain_openai import ChatOpenAI
def search_with_context(query_and_context: str) -> str:
"""Search that accepts context from previous hops."""
parts = query_and_context.split("|||")
query = parts[0].strip()
prev_context = parts[1].strip() if len(parts) > 1 else ""
# Use previous context to refine search if available
if prev_context:
enhanced_query = f"{query} (context: {prev_context[:200]})"
else:
enhanced_query = query
results = vectorstore.similarity_search(enhanced_query, k=3)
return "\n".join([r.page_content for r in results])
multi_hop_tool = Tool(
name="contextual_search",
func=search_with_context,
description="""Search the knowledge base. For multi-hop queries, pass context
from previous searches using format: 'current query ||| previous context'"""
)

Comparison: Single-Hop vs Multi-Hop Performance

HotpotQA Benchmark Results (requiring 2-hop reasoning):
Approach | Exact Match | F1 Score
----------------------------|-------------|----------
Standard single-hop RAG | 31.2% | 43.8%
Multi-hop retrieval (2 hop) | 48.7% | 61.3%
Graph RAG | 52.1% | 64.9%
Multi-hop + reranking | 54.3% | 67.2%
LLM + internet search | 61.8% | 74.1%
Multi-hop retrieval improves over single-hop by ~56% on complex questions.

Failure Modes and Mitigations

Error propagation: If Hop 1 retrieves the wrong entity, Hop 2 compounds the error. A wrong “bridge” leads to an entirely wrong answer chain.

Mitigation: Retrieve top-3 candidates at each hop and maintain parallel reasoning paths. Prune paths where intermediate results are low-confidence.

Infinite loops: The agent keeps searching because it can’t find the answer, cycling through similar queries.

Mitigation: Hard iteration limit (max_hops=4 in practice), detect repeated queries, and implement a “best effort” fallback when the hop limit is reached.

Context window explosion: 4 hops × 3 documents × 500 tokens = 6,000 tokens of context before generation. At many hops, you exceed the LLM’s effective reasoning capacity.

Mitigation: Apply context compression to each hop’s retrieved documents before accumulating them. Only carry forward the most relevant sentences from each hop.

2025 Trend: Learned Multi-Hop Planners

Rather than having the LLM decide dynamically whether another hop is needed, newer systems train a lightweight “hop planner” model that predicts the query decomposition upfront:

Input: "Which investor funded both Google and Tesla's early stage?"
Plan: [
{"hop": 1, "query": "Google early investors founding round"},
{"hop": 2, "query": "Tesla early investors Series A"},
{"hop": 3, "query": "intersection of Google and Tesla investors"}
]

This pre-planned approach reduces per-query LLM calls and produces more predictable execution paths. It’s being developed by several research groups as part of broader “structured reasoning” for RAG frameworks.

Multi-hop retrieval is essential for knowledge-intensive applications — legal research, scientific literature review, business intelligence — where single retrievals consistently fail to connect disparate facts. Design your RAG system with multi-hop capability when your query distribution includes complex compound questions.