Query Expansion: Teaching Your RAG System to Think Broader

A user asks: “How do I fix memory leaks in my app?”

Your corpus has a fantastic article titled “Managing Heap Allocation and Garbage Collection in Production Applications.” That article answers the question perfectly — but keyword search misses it because it doesn’t mention “memory leaks,” and the query vector doesn’t quite match because the phrasing is different.

Query expansion is the art of enriching a user’s original query with additional terms or alternative phrasings before retrieval, so more relevant documents get found. Done well, it dramatically improves recall without sacrificing precision.

The Core Idea

Original query: "How do I fix memory leaks in my app?"

Expanded query set:
  Original:    "How do I fix memory leaks in my app?"
  Synonyms:    "memory management problems", "heap allocation issues"
  Related:     "garbage collection", "out of memory errors"
  Rephrased:   "application memory management troubleshooting"
  HyDE:        "Memory leaks occur when objects are allocated but not freed.
                To fix them, use profiling tools like Valgrind or Chrome
                DevTools. Check for event listener accumulation..."

Run all these against the index → merge results → more comprehensive retrieval

Synonym and Term Expansion

The simplest form of expansion: add synonyms and closely related terms to the query.

from openai import OpenAI

client = OpenAI()

def expand_query_with_synonyms(query: str) -> list[str]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Generate 4 alternative phrasings or synonym expansions for this search query.
Return only the alternatives, one per line, no explanations.

Query: {query}"""
        }],
        temperature=0.3,
    )
    alternatives = response.choices[0].message.content.strip().split('\n')
    return [query] + [a.strip() for a in alternatives if a.strip()]

# Result: 5 query variants for retrieval
queries = expand_query_with_synonyms("memory leak fix")
# → ["memory leak fix", "heap memory deallocation", "garbage collection issue",
#    "object reference retention", "application memory management"]

Run retrieval for each expanded query, then merge results using RRF fusion.

HyDE: Hypothetical Document Embeddings

HyDE (Hypothetical Document Embeddings) is one of the most effective query expansion techniques. Instead of expanding the query with more query terms, you use an LLM to generate a hypothetical document that would answer the query — then embed that document as the search vector.

User query: "How do I fix memory leaks in Python?"

LLM generates hypothetical answer:
"Python memory leaks often occur due to circular references, global variables
holding references to large objects, or using mutable default arguments in
functions. To diagnose memory leaks, use the tracemalloc module built into
Python 3.4+. The objgraph library provides visual call graphs showing which
objects hold references to what. Fix circular references by using weakref.ref()
for back-references. Use __slots__ to reduce per-object overhead..."

Embed this hypothetical document → search with document embedding (not query embedding)
→ Finds real documents that discuss these same concepts

The intuition: a hypothetical answer lives in “document space” rather than “query space.” Documents containing the actual answer will be closer to a hypothetical answer than to the original short query.

import anthropic

client = anthropic.Anthropic()

def hyde_retrieval(query: str, vectorstore) -> list:
    # Step 1: Generate hypothetical document
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"Write a short passage that directly answers this question as if from a technical documentation article:\n\n{query}"
        }]
    )
    hypothetical_doc = response.content[0].text

    # Step 2: Embed the hypothetical document (not the original query)
    hyp_embedding = embed(hypothetical_doc)

    # Step 3: Search with the hypothetical embedding
    return vectorstore.similarity_search_by_vector(hyp_embedding, k=10)

HyDE typically improves retrieval on complex, multi-concept questions where the original query is too sparse or ambiguous.

Pseudo-Relevance Feedback (PRF)

PRF assumes the top-K retrieved documents are relevant (the “pseudo-relevant” set) and uses them to expand the query with additional terms:

Step 1: Initial retrieval with original query → top 5 documents
Step 2: Extract key terms from top 5 documents (high TF-IDF weight terms)
Step 3: Add extracted terms to original query
Step 4: Re-retrieve with expanded query

Original: "solar panel efficiency"
Top 5 terms from initial results: "photovoltaic", "conversion rate",
  "monocrystalline", "perovskite", "fill factor"
Expanded: "solar panel efficiency photovoltaic conversion rate monocrystalline"

This works well for highly technical queries where the user doesn’t know the domain-specific vocabulary. The first retrieval teaches the system the right terminology.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def pseudo_relevance_expansion(
    query: str,
    initial_results: list[str],  # text of top retrieved docs
    n_terms: int = 5,
) -> str:
    if not initial_results:
        return query

    # Find terms with high TF-IDF weight in retrieved docs
    vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(initial_results)
    feature_names = vectorizer.get_feature_names_out()

    # Average TF-IDF weight across retrieved docs
    avg_weights = np.mean(tfidf_matrix.toarray(), axis=0)
    top_idx = np.argsort(avg_weights)[::-1][:n_terms]
    expansion_terms = [feature_names[i] for i in top_idx]

    return query + " " + " ".join(expansion_terms)

Back-Translation for Multilingual RAG

For multilingual corpora, back-translation is a powerful expansion technique: translate the query to another language, then translate it back. The round-trip produces paraphrases that can improve retrieval:

Original (English): "software deployment automation"
→ Translate to German: "Software-Bereitstellungsautomatisierung"
→ Translate back:      "automation of software deployment"
→ Also back from French: "automated software deployments"

Three query variants → better coverage across paraphrase space

Multi-Step Expansion Pipeline

In production, these techniques combine well:

async def expanded_retrieval(query: str, vectorstore, k: int = 10) -> list:
    # Step 1: Generate variants
    [original, hyde_doc, *synonym_variants] = await asyncio.gather(
        asyncio.coroutine(lambda: query)(),
        generate_hyde_document(query),
        generate_synonym_expansions(query),
    )

    # Step 2: Embed all variants in parallel
    embeddings = await embed_batch([original, hyde_doc] + synonym_variants)

    # Step 3: Retrieve for each variant
    all_results = await asyncio.gather(*[
        vectorstore.async_search(emb, k=k)
        for emb in embeddings
    ])

    # Step 4: Merge with RRF
    return reciprocal_rank_fusion([
        [r.id for r in result_list]
        for result_list in all_results
    ])[:k]

When Not to Use Query Expansion

Query expansion increases latency (multiple LLM calls + multiple retrievals). Skip it or limit it when:

Queries are already long and detailed
Latency SLAs are tight (< 500ms end-to-end)
The corpus is small and retrieval recall is already high
Queries are highly specific technical lookups (BM25 exact match handles these well)

For interactive applications, consider running expansion only when initial retrieval confidence is low — a dynamic decision based on the top-K similarity scores.

2025 Trend: RAG-Fusion with LLM-Generated Subqueries

RAG-Fusion (Adrian Raudaschl, 2023) generates multiple semantically diverse sub-queries from the original query, retrieves for each, then re-ranks the merged results. Unlike simple synonym expansion, it decomposes complex multi-part questions into focused atomic queries:

Original: "What are the trade-offs between SQL and NoSQL databases for e-commerce?"

Generated sub-queries:
  1. "SQL ACID properties e-commerce transactions"
  2. "NoSQL horizontal scalability product catalogs"
  3. "database performance comparison high-volume retail"
  4. "real-time inventory consistency database options"

→ 4 focused retrievals → RRF merge → comprehensive answer context

Query expansion is a high-value addition to any RAG system where retrieval recall is the bottleneck. Start with HyDE for complex queries and synonym expansion for vocabulary mismatches — both add meaningful recall improvements with manageable latency overhead.