Multi-Query Retrieval: Why One Query Is Never Enough
Here’s a quiet failure mode in RAG systems: the user’s question is correct, the answer is in the corpus, but the way the user phrased the question doesn’t match the way the document phrases the answer. The semantic gap is small enough that the answer seems like it should be retrieved, but just large enough that it falls outside the top-K results.
Multi-query retrieval addresses this by generating multiple semantically diverse queries from the original, retrieving for each one in parallel, and merging the results. The expanded coverage dramatically reduces “near-miss” failures.
The Core Insight
Any question can be asked in multiple ways. Each phrasing activates slightly different neighborhoods in embedding space:
Original question: "How does attention work in transformers?"
Perspective 1 (mechanism): "What is the mathematical formula for scaled dot-product attention?"Perspective 2 (intuition): "Why do transformers use self-attention instead of RNNs?"Perspective 3 (comparison): "Difference between attention and convolution in neural networks"Perspective 4 (application): "How does attention help transformers understand long documents?"
Each retrieves different but relevant documents.Union of results → comprehensive coverage of the topic.Generating Query Variants
The key is generating diverse variants, not just paraphrases. Diverse means covering different perspectives, different abstraction levels, and different vocabulary:
import anthropicimport asyncio
client = anthropic.Anthropic()
def generate_query_variants( original_query: str, n_variants: int = 4,) -> list[str]: response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=400, messages=[{ "role": "user", "content": f"""Generate {n_variants} different search queries for retrievinginformation relevant to the following question. Each query should approachthe topic from a different angle or use different vocabulary.Return ONLY the queries, one per line.
Original question: {original_query}
Alternative queries:""" }] )
lines = response.content[0].text.strip().split('\n') variants = [l.strip().lstrip('•-1234. ') for l in lines if l.strip()] return [original_query] + variants[:n_variants]Parallel Retrieval with Asyncio
The efficiency of multi-query retrieval comes from running all searches concurrently:
from langchain_community.vectorstores import FAISSfrom langchain_openai import OpenAIEmbeddings
async def multi_query_retrieve( original_query: str, vectorstore: FAISS, k: int = 5, n_queries: int = 4,) -> list: # Step 1: Generate query variants queries = generate_query_variants(original_query, n_variants=n_queries)
# Step 2: Retrieve for all queries in parallel async def single_retrieve(query: str): return await asyncio.to_thread( vectorstore.similarity_search, query=query, k=k, )
all_results = await asyncio.gather(*[single_retrieve(q) for q in queries])
# Step 3: Deduplicate by document ID (preserve highest-scoring version) seen_ids = set() unique_results = [] for result_list in all_results: for doc in result_list: doc_id = doc.metadata.get("source", doc.page_content[:50]) if doc_id not in seen_ids: seen_ids.add(doc_id) unique_results.append(doc)
return unique_resultsLangChain MultiQueryRetriever
LangChain provides a built-in implementation:
from langchain.retrievers.multi_query import MultiQueryRetrieverfrom langchain_openai import ChatOpenAIfrom langchain_community.vectorstores import Chroma
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
multi_query_retriever = MultiQueryRetriever.from_llm( retriever=base_retriever, llm=llm, include_original=True, # include the original query too)
# Automatic multi-query with loggingimport logginglogging.basicConfig()logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
results = multi_query_retriever.invoke("How does attention work in transformers?")# Logs show the generated query variantsResult Fusion Strategies
After parallel retrieval, you have multiple ranked lists. How you merge them matters:
Reciprocal Rank Fusion (RRF) — Recommended
from collections import defaultdict
def rrf_fusion(ranked_lists: list[list], k: int = 60) -> list: scores = defaultdict(float) doc_map = {}
for result_list in ranked_lists: for rank, doc in enumerate(result_list, start=1): doc_id = doc.metadata.get("source", doc.page_content[:50]) scores[doc_id] += 1.0 / (k + rank) doc_map[doc_id] = doc
sorted_ids = sorted(scores, key=scores.get, reverse=True) return [doc_map[doc_id] for doc_id in sorted_ids]Union with Score Averaging
def score_averaged_fusion(results_with_scores: list[list[tuple]]) -> list[tuple]: """results_with_scores: list of [(doc, score), ...] for each query""" doc_scores = defaultdict(list) doc_map = {}
for result_list in results_with_scores: for doc, score in result_list: doc_id = doc.metadata.get("source", doc.page_content[:50]) doc_scores[doc_id].append(score) doc_map[doc_id] = doc
# Documents appearing in multiple results get higher average scores # Documents appearing once get their original score averaged = { doc_id: sum(scores) / len(scores) * (1 + 0.1 * len(scores)) # small boost for consensus for doc_id, scores in doc_scores.items() } sorted_items = sorted(averaged.items(), key=lambda x: x[1], reverse=True) return [(doc_map[doc_id], score) for doc_id, score in sorted_items]Multi-Query for Decomposed Questions
Some queries naturally decompose into independent sub-questions. Multi-query retrieval can be adapted for this:
Complex query: "Compare the computational complexity and practical performance of HNSW and IVF-PQ for billion-scale vector search"
Decomposed sub-queries: 1. "HNSW time complexity and memory requirements" 2. "IVF-PQ indexing algorithm performance benchmarks" 3. "billion-scale vector search approximate nearest neighbor comparison" 4. "HNSW vs IVF query throughput production benchmarks"Each sub-query focuses on a different aspect of the compound question. The union of results gives the LLM enough material to compare both approaches comprehensively.
Controlling Generation Quality
Query variants should be diverse but relevant. Watch for these degeneration cases:
BAD variants (too similar, not diverse): "How does attention work?" "How does the attention mechanism work?" "How do attention mechanisms function?" "Explain how attention works" # All retrieve essentially identical documents
GOOD variants (diverse perspectives): "Mathematical formulation of scaled dot-product attention" "Attention vs recurrence in sequence modeling" "Computational complexity of multi-head attention" "Intuition behind transformer attention weights" # Each opens different document neighborhoodsAdd a diversity check — if generated variants have >0.95 cosine similarity to the original, regenerate with higher temperature:
def generate_diverse_variants(query: str, min_diversity: float = 0.05) -> list[str]: query_embedding = embed(query)
for temperature in [0.3, 0.5, 0.7, 1.0]: variants = generate_query_variants(query, temperature=temperature) variant_embeddings = [embed(v) for v in variants] similarities = [cosine_sim(query_embedding, ve) for ve in variant_embeddings]
if max(similarities) < (1.0 - min_diversity): return variants
return variants # return best attemptPerformance Impact
Typical improvements from multi-query retrieval (4 variants):
| Metric | Single Query | Multi-Query | Improvement |
|---|---|---|---|
| Recall@5 | 71% | 83% | +17% |
| Recall@10 | 79% | 91% | +15% |
| NDCG@10 | 0.61 | 0.72 | +18% |
| Latency (parallel) | 120ms | 140ms | +17% overhead |
| Latency (sequential) | 120ms | 480ms | 4× overhead |
The parallel implementation adds only 20ms overhead (LLM variant generation) compared to 4× overhead if run sequentially. Always parallelize retrieval.
2025 Trend: Adaptive Query Count
Rather than always generating 4 variants, adaptive systems estimate query ambiguity and generate more variants for ambiguous queries, fewer for clear ones. A simple heuristic: if the original query is > 50 tokens and contains specific technical terms, it’s probably clear enough for 2 variants. If it’s < 15 tokens or contains pronouns, generate 5+ variants.
Multi-query retrieval is one of the highest-ROI improvements for RAG systems with diverse query distributions. The parallel implementation keeps latency overhead acceptable while delivering consistent recall improvements.