Multi-Query Retrieval: Why One Query Is Never Enough

Here’s a quiet failure mode in RAG systems: the user’s question is correct, the answer is in the corpus, but the way the user phrased the question doesn’t match the way the document phrases the answer. The semantic gap is small enough that the answer seems like it should be retrieved, but just large enough that it falls outside the top-K results.

Multi-query retrieval addresses this by generating multiple semantically diverse queries from the original, retrieving for each one in parallel, and merging the results. The expanded coverage dramatically reduces “near-miss” failures.

The Core Insight

Any question can be asked in multiple ways. Each phrasing activates slightly different neighborhoods in embedding space:

Original question: "How does attention work in transformers?"

Perspective 1 (mechanism): "What is the mathematical formula for scaled dot-product attention?"
Perspective 2 (intuition):  "Why do transformers use self-attention instead of RNNs?"
Perspective 3 (comparison): "Difference between attention and convolution in neural networks"
Perspective 4 (application): "How does attention help transformers understand long documents?"

Each retrieves different but relevant documents.
Union of results → comprehensive coverage of the topic.

Generating Query Variants

The key is generating diverse variants, not just paraphrases. Diverse means covering different perspectives, different abstraction levels, and different vocabulary:

import anthropic
import asyncio

client = anthropic.Anthropic()

def generate_query_variants(
    original_query: str,
    n_variants: int = 4,
) -> list[str]:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=400,
        messages=[{
            "role": "user",
            "content": f"""Generate {n_variants} different search queries for retrieving
information relevant to the following question. Each query should approach
the topic from a different angle or use different vocabulary.
Return ONLY the queries, one per line.

Original question: {original_query}

Alternative queries:"""
        }]
    )

    lines = response.content[0].text.strip().split('\n')
    variants = [l.strip().lstrip('•-1234. ') for l in lines if l.strip()]
    return [original_query] + variants[:n_variants]

Parallel Retrieval with Asyncio

The efficiency of multi-query retrieval comes from running all searches concurrently:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

async def multi_query_retrieve(
    original_query: str,
    vectorstore: FAISS,
    k: int = 5,
    n_queries: int = 4,
) -> list:
    # Step 1: Generate query variants
    queries = generate_query_variants(original_query, n_variants=n_queries)

    # Step 2: Retrieve for all queries in parallel
    async def single_retrieve(query: str):
        return await asyncio.to_thread(
            vectorstore.similarity_search,
            query=query,
            k=k,
        )

    all_results = await asyncio.gather(*[single_retrieve(q) for q in queries])

    # Step 3: Deduplicate by document ID (preserve highest-scoring version)
    seen_ids = set()
    unique_results = []
    for result_list in all_results:
        for doc in result_list:
            doc_id = doc.metadata.get("source", doc.page_content[:50])
            if doc_id not in seen_ids:
                seen_ids.add(doc_id)
                unique_results.append(doc)

    return unique_results

LangChain MultiQueryRetriever

LangChain provides a built-in implementation:

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=base_retriever,
    llm=llm,
    include_original=True,  # include the original query too
)

# Automatic multi-query with logging
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

results = multi_query_retriever.invoke("How does attention work in transformers?")
# Logs show the generated query variants

Result Fusion Strategies

After parallel retrieval, you have multiple ranked lists. How you merge them matters:

Reciprocal Rank Fusion (RRF) — Recommended

from collections import defaultdict

def rrf_fusion(ranked_lists: list[list], k: int = 60) -> list:
    scores = defaultdict(float)
    doc_map = {}

    for result_list in ranked_lists:
        for rank, doc in enumerate(result_list, start=1):
            doc_id = doc.metadata.get("source", doc.page_content[:50])
            scores[doc_id] += 1.0 / (k + rank)
            doc_map[doc_id] = doc

    sorted_ids = sorted(scores, key=scores.get, reverse=True)
    return [doc_map[doc_id] for doc_id in sorted_ids]

Union with Score Averaging

def score_averaged_fusion(results_with_scores: list[list[tuple]]) -> list[tuple]:
    """results_with_scores: list of [(doc, score), ...] for each query"""
    doc_scores = defaultdict(list)
    doc_map = {}

    for result_list in results_with_scores:
        for doc, score in result_list:
            doc_id = doc.metadata.get("source", doc.page_content[:50])
            doc_scores[doc_id].append(score)
            doc_map[doc_id] = doc

    # Documents appearing in multiple results get higher average scores
    # Documents appearing once get their original score
    averaged = {
        doc_id: sum(scores) / len(scores) * (1 + 0.1 * len(scores))  # small boost for consensus
        for doc_id, scores in doc_scores.items()
    }
    sorted_items = sorted(averaged.items(), key=lambda x: x[1], reverse=True)
    return [(doc_map[doc_id], score) for doc_id, score in sorted_items]

Multi-Query for Decomposed Questions

Some queries naturally decompose into independent sub-questions. Multi-query retrieval can be adapted for this:

Complex query: "Compare the computational complexity and practical performance
               of HNSW and IVF-PQ for billion-scale vector search"

Decomposed sub-queries:
  1. "HNSW time complexity and memory requirements"
  2. "IVF-PQ indexing algorithm performance benchmarks"
  3. "billion-scale vector search approximate nearest neighbor comparison"
  4. "HNSW vs IVF query throughput production benchmarks"

Each sub-query focuses on a different aspect of the compound question. The union of results gives the LLM enough material to compare both approaches comprehensively.

Controlling Generation Quality

Query variants should be diverse but relevant. Watch for these degeneration cases:

BAD variants (too similar, not diverse):
  "How does attention work?"
  "How does the attention mechanism work?"
  "How do attention mechanisms function?"
  "Explain how attention works"
  # All retrieve essentially identical documents

GOOD variants (diverse perspectives):
  "Mathematical formulation of scaled dot-product attention"
  "Attention vs recurrence in sequence modeling"
  "Computational complexity of multi-head attention"
  "Intuition behind transformer attention weights"
  # Each opens different document neighborhoods

Add a diversity check — if generated variants have >0.95 cosine similarity to the original, regenerate with higher temperature:

def generate_diverse_variants(query: str, min_diversity: float = 0.05) -> list[str]:
    query_embedding = embed(query)

    for temperature in [0.3, 0.5, 0.7, 1.0]:
        variants = generate_query_variants(query, temperature=temperature)
        variant_embeddings = [embed(v) for v in variants]
        similarities = [cosine_sim(query_embedding, ve) for ve in variant_embeddings]

        if max(similarities) < (1.0 - min_diversity):
            return variants

    return variants  # return best attempt

Performance Impact

Typical improvements from multi-query retrieval (4 variants):

Metric	Single Query	Multi-Query	Improvement
Recall@5	71%	83%	+17%
Recall@10	79%	91%	+15%
NDCG@10	0.61	0.72	+18%
Latency (parallel)	120ms	140ms	+17% overhead
Latency (sequential)	120ms	480ms	4× overhead

The parallel implementation adds only 20ms overhead (LLM variant generation) compared to 4× overhead if run sequentially. Always parallelize retrieval.

2025 Trend: Adaptive Query Count

Rather than always generating 4 variants, adaptive systems estimate query ambiguity and generate more variants for ambiguous queries, fewer for clear ones. A simple heuristic: if the original query is > 50 tokens and contains specific technical terms, it’s probably clear enough for 2 variants. If it’s < 15 tokens or contains pronouns, generate 5+ variants.

Multi-query retrieval is one of the highest-ROI improvements for RAG systems with diverse query distributions. The parallel implementation keeps latency overhead acceptable while delivering consistent recall improvements.