Sparse Retrieval for RAG: Keyword-Based Search Methods

Master sparse retrieval using BM25 and keyword search. Learn when to use sparse retrieval and hybrid approaches for RAG.

Sparse Retrieval: The Foundation of Information Retrieval

While dense retrieval captures semantics, sparse retrieval uses keywords and statistics to find documents. Despite being older, sparse retrieval remains essential because it excels at things dense retrieval struggles with.

What Is Sparse Retrieval?

Sparse retrieval represents documents as high-dimensional vectors where most values are zero (hence “sparse”). Each dimension represents a word or term.

Example (simplified):

Document 1: "The cat sat on the mat"
Sparse vector: [0, 1, 0, 1, 1, 0, 1, 0, ...]
↑ "the" ↑ "cat" ↑ "sat" ↑ "mat"
dimension indices of words present
Document 2: "Dogs play in the park"
Sparse vector: [0, 0, 1, 0, 1, 1, 0, 1, ...]
↑ "dog" ↑ "play" ↑ "park"
Similarity = overlap of words with weights

In contrast, dense retrieval uses learned representations across all dimensions.

BM25: The Standard Sparse Retrieval Algorithm

BM25 (Best Matching 25) is the industry standard for keyword-based retrieval. It’s been beating neural models at certain tasks for 30 years.

How BM25 Works

BM25 scores documents based on term frequency and inverse document frequency.

Simplified formula:

BM25(query, doc) = Σ IDF(term) × (tf(term, doc) × (k₁ + 1)) /
(tf(term, doc) + k₁ × (1 - b + b × len(doc)/avglen))

Breaking it down:

  1. TF (Term Frequency): How often does the query term appear in the document?

    • More mentions = higher score
    • But diminishing returns (5 mentions only slightly better than 4)
  2. IDF (Inverse Document Frequency): How rare is this term across the corpus?

    • Rare terms (specific to documents) matter more than common terms
    • “Pneumonia” more informative than “the”
  3. Document length normalization: Longer documents shouldn’t automatically score higher

    • Balance between term frequency and document length
    • Prevents bias toward longer documents

BM25 Example

Query: "how to fix a broken window"
Words: ["fix", "broken", "window"]
Document 1: "Steps to fix a broken window frame using glass and tools"
- "fix": appears 1 time, IDF = 3.5 (moderately rare)
- "broken": appears 1 time, IDF = 2.8 (less rare)
- "window": appears 1 time, IDF = 2.2 (common)
- BM25 score: ~8.5
Document 2: "How to fix fix fix windows and doors (repeated for emphasis)"
- "fix": appears 3 times, IDF = 3.5
- "broken": appears 0 times, IDF = 0
- "window": appears 1 time, IDF = 2.2
- BM25 score: ~12.0
Result: Document 2 ranks higher despite being less semantically relevant
(This is a weakness of BM25: keyword repetition gaming)

Strengths of Sparse Retrieval

1. Exact Match Handling

Perfect for queries where exact terminology matters.

Query: "Python 3.11 release notes"
Sparse retrieval: Matches documents with exact terms
Dense retrieval: Might match "programming language updates" (less relevant)

2. Interpretability

Easy to understand why a document was retrieved.

Query matched because:
- "Python" appeared 5 times (TF = 5)
- "3.11" appeared 2 times (TF = 2)
- "release" appeared 1 time (TF = 1)
- Document is 400 words (reasonable length)
Clear and explainable.

3. Scalability

BM25 works at massive scale with simple infrastructure.

Millions of documents: Still fast
Infrastructure: Standard database indexing (Elasticsearch, PostgreSQL)
Cost: Minimal

4. Zero Cold-Start

Any document with text is immediately retrievable.

Add new document
Immediately indexed
Queryable
No embedding computation needed

5. Domain Vocabulary

Works well with domain-specific terminology.

Medical query: "myocardial infarction treatment"
Sparse retrieval: Matches "heart attack therapy" if indexed
Dense embedding: Might understand semantic similarity
Both useful, sparse is more precise for terminology

Weaknesses of Sparse Retrieval

1. Paraphrase Blindness

Can’t match equivalent statements with different words.

Query: "How much do cars cost?"
Matches: "car pricing", "vehicle cost", "automobile expenses"
Misses: "What's the price of a sedan?" (same meaning, no word overlap)

2. Vocabulary Mismatch

Query terms must exist in documents.

Query uses: "smartphone"
Document uses: "mobile device"
No match, despite semantic equivalence
Misspellings cause complete misses.

3. Context Blindness

Doesn’t understand meaning, only word occurrence.

Document 1: "The company is not profitable anymore"
Document 2: "The company is profitable"
Query: "profitable company"
Both match equally (BM25), though opposite meanings

4. Semantic Relationships

Can’t understand that “dog” and “puppy” are related.

Query: "puppy behavior"
Won't match "dog training" automatically
Requires query expansion or thesaurus

Sparse Retrieval Methods

BM25 (Standard)

Most common, well-understood, proven.

When to use:

  • Exact terminology important
  • Fast retrieval critical
  • Small team/budget

Implementation:

from rank_bm25 import BM25Okapi
# Corpus: list of tokenized documents
corpus = [
["how", "to", "fix", "window"],
["window", "repair", "guide"],
["broken", "glass", "replacement"],
]
bm25 = BM25Okapi(corpus)
query = ["fix", "broken", "window"]
scores = bm25.get_scores(query)
top_doc_index = argmax(scores) # 0 or 1

TF-IDF

Simpler variant, still effective.

Score = TF(term, doc) × IDF(term)
TF = frequency of term in document
IDF = log(total_docs / docs_containing_term)
Simpler than BM25, less tuning needed

Query Expansion

Overcome vocabulary mismatch by expanding queries.

Original query: "smartphone"
Expansion: ["smartphone", "mobile", "phone", "device", "cellular"]
Retrieve with expanded query
Benefit: Match more documents
Cost: Possible noise from overly broad expansion

Sparse vs. Dense Comparison

AspectSparse (BM25)Dense
Paraphrases✗ Poor✓ Excellent
Exact match✓ Excellent✗ Can miss
Speed✓ Very fast✓ Fast with index
Interpretability✓ Clear✗ Black box
Infrastructure✓ Simple✗ Complex
New documents✓ Instant✗ Needs embedding
Scale✓ Scales easily✓ Scales well
Quality✓ Proven✓ Better

Hybrid Retrieval: The Best of Both

Key insight: Sparse and dense retrieval excel at different things.

Hybrid approach:

Stage 1: Parallel retrieval
Run BM25 (sparse) → Top 10
Run embedding similarity (dense) → Top 10
Stage 2: Merge and rerank
Combine results (union or intersection)
Rerank by ensemble method
Return top 5
Result: Covers both exact terms and semantic similarity

Implementation example:

# Get results from both methods
bm25_results = bm25_retrieve(query) # Top 10
dense_results = dense_retrieve(query_embedding) # Top 10
# Normalize scores to [0, 1]
bm25_scores = normalize(bm25_results.scores)
dense_scores = normalize(dense_results.scores)
# Combine
combined_scores = {}
for doc_id, score in bm25_scores.items():
combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 0.5 * score
for doc_id, score in dense_scores.items():
combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 0.5 * score
# Return top 5
return sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:5]

Sparse Retrieval in Production

Common implementations:

Elasticsearch:

Most popular production system
BM25 built-in
Scales to billions of documents

PostgreSQL with full-text search:

Simpler infrastructure
Good for moderate scale
Integrated with application database

Lucene/Solr:

Classic choice
Powerful configuration
Steeper learning curve

Sparse Retrieval in 2024

Trends:

  • Hybrid retrieval becoming standard
  • Sparse + dense in single vector space
  • Learned sparse models improving
  • ColBERT: Combining sparse and dense concepts

Most production RAG systems now use hybrid retrieval rather than pure sparse or pure dense.

Decision: When to Use Sparse

Use sparse retrieval (BM25) when:

  • Exact terminology is critical
  • Query language matches document language exactly
  • Budget is tight
  • Infrastructure is limited
  • Need immediate new document indexing

Use dense retrieval when:

  • Semantic understanding matters
  • Query paraphrasing is expected
  • Quality is paramount
  • You can afford the infrastructure

Use hybrid (both) when:

  • You want the best of both
  • You have resources to manage both
  • Your use case has both exact matches and semantic queries

Most modern RAG systems use hybrid retrieval as the sweet spot.