Keyword Search: The Unsung Hero of RAG Retrieval

The AI community has a strange relationship with keyword search. Everyone is excited about semantic embeddings and vector databases, yet consistently research shows that BM25 — an algorithm from 1994 — outperforms dense retrieval on many real-world benchmarks, particularly for out-of-domain queries and queries with precise technical terms.

Understanding keyword search isn’t just historical context. For hybrid retrieval (which beats both approaches alone), you need to understand what keyword search does well and why it complements semantic search.

What Keyword Search Actually Does

At its heart, keyword search builds an inverted index — a mapping from every term in your corpus to the documents that contain it:

Inverted Index (simplified):

"transformer"   → [doc_14, doc_67, doc_234, doc_891, ...]
"attention"     → [doc_14, doc_92, doc_234, doc_445, ...]
"neural"        → [doc_14, doc_67, doc_92, ...]
"Python"        → [doc_67, doc_108, doc_445, ...]
"BERT"          → [doc_14, doc_234, ...]

Query: "transformer attention mechanism"
→ Find documents containing "transformer" AND "attention" AND "mechanism"
→ Rank by relevance score (BM25)

BM25: Better Than TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) was the standard for decades. BM25 (Best Match 25) improved on it in two key ways:

1. Term frequency saturation: TF-IDF’s term frequency component grows unboundedly. A document mentioning “Python” 50 times scores 5× higher than one mentioning it 10 times, even though practical relevance doesn’t scale linearly. BM25 saturates the term frequency contribution:

TF-IDF: score ∝ tf(t,d)
BM25:   score ∝ tf(t,d) × (k1 + 1) / (tf(t,d) + k1 × (1 - b + b × |d|/avgdl))

k1 = 1.5 (term frequency saturation — higher = less saturation)
b   = 0.75 (length normalization — higher = more length penalty)

2. Document length normalization: Long documents naturally contain more term occurrences. BM25 normalizes by document length (|d|/avgdl) so longer documents don’t unfairly dominate.

The full BM25 score for a multi-term query:

BM25(d, Q) = Σ IDF(qi) × tf(qi, d) × (k1 + 1)
                                    ─────────────────────────────────
                                    tf(qi, d) + k1 × (1 - b + b × |d|/avgdl)

IDF(q) = log((N - n(q) + 0.5) / (n(q) + 0.5) + 1)
  N = total documents
  n(q) = documents containing term q

When Keyword Search Wins

Semantic search sometimes fails where keyword search succeeds:

Exact technical terms: “CVE-2024-12345”, “RFC 7519”, “OWASP A03:2021” — these are not in most embedding models’ training distribution. BM25 matches them exactly.

Named entities: Product names, person names, company names, model numbers (“iPhone 15 Pro Max”, “GPT-4o”, “B2B-XY7-REV2”). Semantic search might return similar products instead of the exact one.

Code snippets: A query containing RuntimeError: CUDA out of memory is better matched by keyword search against error logs.

Short, precise queries: “GDPR Article 17” — three tokens that should return exactly the right document. Semantic search might return related privacy articles instead.

Implementing BM25 in Python

# Option 1: rank_bm25 library (pure Python, no server needed)
from rank_bm25 import BM25Okapi
import nltk
nltk.download('punkt')

class BM25Index:
    def __init__(self, documents: list[str]):
        tokenized = [nltk.word_tokenize(doc.lower()) for doc in documents]
        self.bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)
        self.documents = documents

    def search(self, query: str, k: int = 10) -> list[tuple[str, float]]:
        tokenized_query = nltk.word_tokenize(query.lower())
        scores = self.bm25.get_scores(tokenized_query)
        top_k = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:k]
        return [(self.documents[i], score) for i, score in top_k if score > 0]

# Usage
index = BM25Index(document_texts)
results = index.search("transformer attention mechanism", k=5)

# Option 2: Elasticsearch (production-grade, distributed)
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

# Index documents
es.index(index="documents", body={
    "text": "BERT uses bidirectional transformer encoders...",
    "source": "ml_paper_2018.pdf",
})

# BM25 search
results = es.search(index="documents", body={
    "query": {
        "match": {
            "text": {
                "query": "bidirectional transformer encoder",
                "operator": "or",
            }
        }
    },
    "size": 10
})

Tokenization Matters More Than You Think

BM25 performance depends significantly on tokenization and pre-processing:

import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def preprocess(text: str) -> list[str]:
    # Lowercase
    text = text.lower()
    # Remove punctuation (but keep hyphens in compound terms)
    text = re.sub(r'[^a-z0-9\s\-]', '', text)
    # Tokenize
    tokens = text.split()
    # Remove stopwords (optional — sometimes stopwords carry meaning)
    tokens = [t for t in tokens if t not in stop_words]
    # Stem (optional — improves recall, reduces precision)
    tokens = [stemmer.stem(t) for t in tokens]
    return tokens

Stemming trade-off: Stemming improves recall (“running” → “run” matches “runs”, “runner”) but reduces precision. For technical domains with precise terminology, skip stemming — “running” and “run” may mean different things.

Sparse vs Dense: Performance on BEIR Benchmarks

The BEIR benchmark (2021) was a wake-up call for the NLP community. BM25 outperforms most dense retrieval models on out-of-domain datasets:

BEIR Benchmark (NDCG@10, out-of-domain retrieval):

Dataset          | BM25  | DPR   | SBERT | ColBERT | OpenAI-ada-002
-----------------|-------|-------|-------|---------|---------------
MS MARCO         | 0.229 | 0.313 | 0.372 | 0.397   | 0.408
TREC-COVID       | 0.656 | 0.332 | 0.594 | 0.677   | 0.604
NFCorpus         | 0.325 | 0.189 | 0.325 | 0.338   | 0.336
NQ               | 0.329 | 0.475 | 0.529 | 0.524   | 0.530
HotpotQA         | 0.602 | 0.391 | 0.599 | 0.590   | 0.538

BM25 wins on TREC-COVID (medical) and HotpotQA (complex reasoning).
Dense models win on NQ (factual) and standard QA.

The takeaway: neither approach dominates universally. Hybrid is better than either alone.

2025 Trend: Learned Sparse Retrieval

SPLADE (Sparse Lexical and Expansion Model) and similar learned sparse models combine the exact-match properties of keyword search with some semantic expansion. They produce sparse vectors in vocabulary space, but with learned weights that account for semantics:

"What is backpropagation?" (SPLADE sparse representation)
→ {"backprop": 2.1, "gradient": 1.8, "neural": 1.4, "training": 1.2, "chain_rule": 0.9, ...}

Most terms are still 0 (sparse), but semantically related terms get non-zero weights.
This bridges the gap between keyword matching and semantic search.

SPLADE is increasingly used as the sparse component in hybrid retrieval systems, often outperforming classic BM25 as the sparse leg of the search.

The Right Role for Keyword Search in RAG

Don’t think of keyword search as “old technology to be replaced.” Think of it as a complementary retrieval signal that is:

Fast (inverted index lookup is O(log n))
Exact for precise terms
Strong on out-of-domain queries
Cheap to maintain (no GPU needed, no embedding API costs)

In a hybrid RAG system, keyword search catches what semantic search misses, and semantic search catches what keyword search misses. Together, they cover far more of your query distribution than either does alone.