Sparse Retrieval: The Foundation of Information Retrieval
While dense retrieval captures semantics, sparse retrieval uses keywords and statistics to find documents. Despite being older, sparse retrieval remains essential because it excels at things dense retrieval struggles with.
What Is Sparse Retrieval?
Sparse retrieval represents documents as high-dimensional vectors where most values are zero (hence “sparse”). Each dimension represents a word or term.
Example (simplified):
Document 1: "The cat sat on the mat"Sparse vector: [0, 1, 0, 1, 1, 0, 1, 0, ...] ↑ "the" ↑ "cat" ↑ "sat" ↑ "mat" dimension indices of words present
Document 2: "Dogs play in the park"Sparse vector: [0, 0, 1, 0, 1, 1, 0, 1, ...] ↑ "dog" ↑ "play" ↑ "park"
Similarity = overlap of words with weightsIn contrast, dense retrieval uses learned representations across all dimensions.
BM25: The Standard Sparse Retrieval Algorithm
BM25 (Best Matching 25) is the industry standard for keyword-based retrieval. It’s been beating neural models at certain tasks for 30 years.
How BM25 Works
BM25 scores documents based on term frequency and inverse document frequency.
Simplified formula:
BM25(query, doc) = Σ IDF(term) × (tf(term, doc) × (k₁ + 1)) / (tf(term, doc) + k₁ × (1 - b + b × len(doc)/avglen))Breaking it down:
-
TF (Term Frequency): How often does the query term appear in the document?
- More mentions = higher score
- But diminishing returns (5 mentions only slightly better than 4)
-
IDF (Inverse Document Frequency): How rare is this term across the corpus?
- Rare terms (specific to documents) matter more than common terms
- “Pneumonia” more informative than “the”
-
Document length normalization: Longer documents shouldn’t automatically score higher
- Balance between term frequency and document length
- Prevents bias toward longer documents
BM25 Example
Query: "how to fix a broken window"Words: ["fix", "broken", "window"]
Document 1: "Steps to fix a broken window frame using glass and tools"- "fix": appears 1 time, IDF = 3.5 (moderately rare)- "broken": appears 1 time, IDF = 2.8 (less rare)- "window": appears 1 time, IDF = 2.2 (common)- BM25 score: ~8.5
Document 2: "How to fix fix fix windows and doors (repeated for emphasis)"- "fix": appears 3 times, IDF = 3.5- "broken": appears 0 times, IDF = 0- "window": appears 1 time, IDF = 2.2- BM25 score: ~12.0
Result: Document 2 ranks higher despite being less semantically relevant(This is a weakness of BM25: keyword repetition gaming)Strengths of Sparse Retrieval
1. Exact Match Handling
Perfect for queries where exact terminology matters.
Query: "Python 3.11 release notes"Sparse retrieval: Matches documents with exact termsDense retrieval: Might match "programming language updates" (less relevant)2. Interpretability
Easy to understand why a document was retrieved.
Query matched because:- "Python" appeared 5 times (TF = 5)- "3.11" appeared 2 times (TF = 2)- "release" appeared 1 time (TF = 1)- Document is 400 words (reasonable length)
Clear and explainable.3. Scalability
BM25 works at massive scale with simple infrastructure.
Millions of documents: Still fastInfrastructure: Standard database indexing (Elasticsearch, PostgreSQL)Cost: Minimal4. Zero Cold-Start
Any document with text is immediately retrievable.
Add new documentImmediately indexedQueryableNo embedding computation needed5. Domain Vocabulary
Works well with domain-specific terminology.
Medical query: "myocardial infarction treatment"Sparse retrieval: Matches "heart attack therapy" if indexedDense embedding: Might understand semantic similarity
Both useful, sparse is more precise for terminologyWeaknesses of Sparse Retrieval
1. Paraphrase Blindness
Can’t match equivalent statements with different words.
Query: "How much do cars cost?"Matches: "car pricing", "vehicle cost", "automobile expenses"Misses: "What's the price of a sedan?" (same meaning, no word overlap)2. Vocabulary Mismatch
Query terms must exist in documents.
Query uses: "smartphone"Document uses: "mobile device"No match, despite semantic equivalence
Misspellings cause complete misses.3. Context Blindness
Doesn’t understand meaning, only word occurrence.
Document 1: "The company is not profitable anymore"Document 2: "The company is profitable"
Query: "profitable company"Both match equally (BM25), though opposite meanings4. Semantic Relationships
Can’t understand that “dog” and “puppy” are related.
Query: "puppy behavior"Won't match "dog training" automaticallyRequires query expansion or thesaurusSparse Retrieval Methods
BM25 (Standard)
Most common, well-understood, proven.
When to use:
- Exact terminology important
- Fast retrieval critical
- Small team/budget
Implementation:
from rank_bm25 import BM25Okapi
# Corpus: list of tokenized documentscorpus = [ ["how", "to", "fix", "window"], ["window", "repair", "guide"], ["broken", "glass", "replacement"],]
bm25 = BM25Okapi(corpus)query = ["fix", "broken", "window"]scores = bm25.get_scores(query)top_doc_index = argmax(scores) # 0 or 1TF-IDF
Simpler variant, still effective.
Score = TF(term, doc) × IDF(term)
TF = frequency of term in documentIDF = log(total_docs / docs_containing_term)
Simpler than BM25, less tuning neededQuery Expansion
Overcome vocabulary mismatch by expanding queries.
Original query: "smartphone"Expansion: ["smartphone", "mobile", "phone", "device", "cellular"]Retrieve with expanded query
Benefit: Match more documentsCost: Possible noise from overly broad expansionSparse vs. Dense Comparison
| Aspect | Sparse (BM25) | Dense |
|---|---|---|
| Paraphrases | ✗ Poor | ✓ Excellent |
| Exact match | ✓ Excellent | ✗ Can miss |
| Speed | ✓ Very fast | ✓ Fast with index |
| Interpretability | ✓ Clear | ✗ Black box |
| Infrastructure | ✓ Simple | ✗ Complex |
| New documents | ✓ Instant | ✗ Needs embedding |
| Scale | ✓ Scales easily | ✓ Scales well |
| Quality | ✓ Proven | ✓ Better |
Hybrid Retrieval: The Best of Both
Key insight: Sparse and dense retrieval excel at different things.
Hybrid approach:
Stage 1: Parallel retrieval Run BM25 (sparse) → Top 10 Run embedding similarity (dense) → Top 10
Stage 2: Merge and rerank Combine results (union or intersection) Rerank by ensemble method Return top 5
Result: Covers both exact terms and semantic similarityImplementation example:
# Get results from both methodsbm25_results = bm25_retrieve(query) # Top 10dense_results = dense_retrieve(query_embedding) # Top 10
# Normalize scores to [0, 1]bm25_scores = normalize(bm25_results.scores)dense_scores = normalize(dense_results.scores)
# Combinecombined_scores = {}for doc_id, score in bm25_scores.items(): combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 0.5 * scorefor doc_id, score in dense_scores.items(): combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 0.5 * score
# Return top 5return sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:5]Sparse Retrieval in Production
Common implementations:
Elasticsearch:
Most popular production systemBM25 built-inScales to billions of documentsPostgreSQL with full-text search:
Simpler infrastructureGood for moderate scaleIntegrated with application databaseLucene/Solr:
Classic choicePowerful configurationSteeper learning curveSparse Retrieval in 2024
Trends:
- Hybrid retrieval becoming standard
- Sparse + dense in single vector space
- Learned sparse models improving
- ColBERT: Combining sparse and dense concepts
Most production RAG systems now use hybrid retrieval rather than pure sparse or pure dense.
Decision: When to Use Sparse
Use sparse retrieval (BM25) when:
- Exact terminology is critical
- Query language matches document language exactly
- Budget is tight
- Infrastructure is limited
- Need immediate new document indexing
Use dense retrieval when:
- Semantic understanding matters
- Query paraphrasing is expected
- Quality is paramount
- You can afford the infrastructure
Use hybrid (both) when:
- You want the best of both
- You have resources to manage both
- Your use case has both exact matches and semantic queries
Most modern RAG systems use hybrid retrieval as the sweet spot.