Query Expansion: Teaching Your RAG System to Think Broader
A user asks: “How do I fix memory leaks in my app?”
Your corpus has a fantastic article titled “Managing Heap Allocation and Garbage Collection in Production Applications.” That article answers the question perfectly — but keyword search misses it because it doesn’t mention “memory leaks,” and the query vector doesn’t quite match because the phrasing is different.
Query expansion is the art of enriching a user’s original query with additional terms or alternative phrasings before retrieval, so more relevant documents get found. Done well, it dramatically improves recall without sacrificing precision.
The Core Idea
Original query: "How do I fix memory leaks in my app?"
Expanded query set: Original: "How do I fix memory leaks in my app?" Synonyms: "memory management problems", "heap allocation issues" Related: "garbage collection", "out of memory errors" Rephrased: "application memory management troubleshooting" HyDE: "Memory leaks occur when objects are allocated but not freed. To fix them, use profiling tools like Valgrind or Chrome DevTools. Check for event listener accumulation..."
Run all these against the index → merge results → more comprehensive retrievalSynonym and Term Expansion
The simplest form of expansion: add synonyms and closely related terms to the query.
from openai import OpenAI
client = OpenAI()
def expand_query_with_synonyms(query: str) -> list[str]: response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"""Generate 4 alternative phrasings or synonym expansions for this search query.Return only the alternatives, one per line, no explanations.
Query: {query}""" }], temperature=0.3, ) alternatives = response.choices[0].message.content.strip().split('\n') return [query] + [a.strip() for a in alternatives if a.strip()]
# Result: 5 query variants for retrievalqueries = expand_query_with_synonyms("memory leak fix")# → ["memory leak fix", "heap memory deallocation", "garbage collection issue",# "object reference retention", "application memory management"]Run retrieval for each expanded query, then merge results using RRF fusion.
HyDE: Hypothetical Document Embeddings
HyDE (Hypothetical Document Embeddings) is one of the most effective query expansion techniques. Instead of expanding the query with more query terms, you use an LLM to generate a hypothetical document that would answer the query — then embed that document as the search vector.
User query: "How do I fix memory leaks in Python?"
LLM generates hypothetical answer:"Python memory leaks often occur due to circular references, global variablesholding references to large objects, or using mutable default arguments infunctions. To diagnose memory leaks, use the tracemalloc module built intoPython 3.4+. The objgraph library provides visual call graphs showing whichobjects hold references to what. Fix circular references by using weakref.ref()for back-references. Use __slots__ to reduce per-object overhead..."
Embed this hypothetical document → search with document embedding (not query embedding)→ Finds real documents that discuss these same conceptsThe intuition: a hypothetical answer lives in “document space” rather than “query space.” Documents containing the actual answer will be closer to a hypothetical answer than to the original short query.
import anthropic
client = anthropic.Anthropic()
def hyde_retrieval(query: str, vectorstore) -> list: # Step 1: Generate hypothetical document response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=300, messages=[{ "role": "user", "content": f"Write a short passage that directly answers this question as if from a technical documentation article:\n\n{query}" }] ) hypothetical_doc = response.content[0].text
# Step 2: Embed the hypothetical document (not the original query) hyp_embedding = embed(hypothetical_doc)
# Step 3: Search with the hypothetical embedding return vectorstore.similarity_search_by_vector(hyp_embedding, k=10)HyDE typically improves retrieval on complex, multi-concept questions where the original query is too sparse or ambiguous.
Pseudo-Relevance Feedback (PRF)
PRF assumes the top-K retrieved documents are relevant (the “pseudo-relevant” set) and uses them to expand the query with additional terms:
Step 1: Initial retrieval with original query → top 5 documentsStep 2: Extract key terms from top 5 documents (high TF-IDF weight terms)Step 3: Add extracted terms to original queryStep 4: Re-retrieve with expanded query
Original: "solar panel efficiency"Top 5 terms from initial results: "photovoltaic", "conversion rate", "monocrystalline", "perovskite", "fill factor"Expanded: "solar panel efficiency photovoltaic conversion rate monocrystalline"This works well for highly technical queries where the user doesn’t know the domain-specific vocabulary. The first retrieval teaches the system the right terminology.
from sklearn.feature_extraction.text import TfidfVectorizerimport numpy as np
def pseudo_relevance_expansion( query: str, initial_results: list[str], # text of top retrieved docs n_terms: int = 5,) -> str: if not initial_results: return query
# Find terms with high TF-IDF weight in retrieved docs vectorizer = TfidfVectorizer(max_features=1000, stop_words='english') tfidf_matrix = vectorizer.fit_transform(initial_results) feature_names = vectorizer.get_feature_names_out()
# Average TF-IDF weight across retrieved docs avg_weights = np.mean(tfidf_matrix.toarray(), axis=0) top_idx = np.argsort(avg_weights)[::-1][:n_terms] expansion_terms = [feature_names[i] for i in top_idx]
return query + " " + " ".join(expansion_terms)Back-Translation for Multilingual RAG
For multilingual corpora, back-translation is a powerful expansion technique: translate the query to another language, then translate it back. The round-trip produces paraphrases that can improve retrieval:
Original (English): "software deployment automation"→ Translate to German: "Software-Bereitstellungsautomatisierung"→ Translate back: "automation of software deployment"→ Also back from French: "automated software deployments"
Three query variants → better coverage across paraphrase spaceMulti-Step Expansion Pipeline
In production, these techniques combine well:
async def expanded_retrieval(query: str, vectorstore, k: int = 10) -> list: # Step 1: Generate variants [original, hyde_doc, *synonym_variants] = await asyncio.gather( asyncio.coroutine(lambda: query)(), generate_hyde_document(query), generate_synonym_expansions(query), )
# Step 2: Embed all variants in parallel embeddings = await embed_batch([original, hyde_doc] + synonym_variants)
# Step 3: Retrieve for each variant all_results = await asyncio.gather(*[ vectorstore.async_search(emb, k=k) for emb in embeddings ])
# Step 4: Merge with RRF return reciprocal_rank_fusion([ [r.id for r in result_list] for result_list in all_results ])[:k]When Not to Use Query Expansion
Query expansion increases latency (multiple LLM calls + multiple retrievals). Skip it or limit it when:
- Queries are already long and detailed
- Latency SLAs are tight (< 500ms end-to-end)
- The corpus is small and retrieval recall is already high
- Queries are highly specific technical lookups (BM25 exact match handles these well)
For interactive applications, consider running expansion only when initial retrieval confidence is low — a dynamic decision based on the top-K similarity scores.
2025 Trend: RAG-Fusion with LLM-Generated Subqueries
RAG-Fusion (Adrian Raudaschl, 2023) generates multiple semantically diverse sub-queries from the original query, retrieves for each, then re-ranks the merged results. Unlike simple synonym expansion, it decomposes complex multi-part questions into focused atomic queries:
Original: "What are the trade-offs between SQL and NoSQL databases for e-commerce?"
Generated sub-queries: 1. "SQL ACID properties e-commerce transactions" 2. "NoSQL horizontal scalability product catalogs" 3. "database performance comparison high-volume retail" 4. "real-time inventory consistency database options"
→ 4 focused retrievals → RRF merge → comprehensive answer contextQuery expansion is a high-value addition to any RAG system where retrieval recall is the bottleneck. Start with HyDE for complex queries and synonym expansion for vocabulary mismatches — both add meaningful recall improvements with manageable latency overhead.