Dense Retrieval: Semantic Vector Search in RAG

Dense retrieval uses neural embeddings to search for semantically similar documents. Unlike keyword-based methods, dense retrieval understands meaning, handling paraphrases, synonyms, and conceptual connections that keyword matching misses.

How Dense Retrieval Works

User Query
    ↓
[Embedding Model]  → Query Vector (e.g., 768 dimensions)
    ↓
[Vector Database]  → Find similar document vectors
    ↓
[Ranking]          → Sort by similarity
    ↓
Top K Documents

The core insight: similar documents have similar embeddings.

The Dense Retrieval Pipeline

Step 1: Encoding Queries and Documents

Document encoding (offline, done once):

For each document chunk:
  embedding = embedding_model(text)
  store(chunk_id, embedding, text, metadata)

Query encoding (online, done per search):

query_embedding = embedding_model(query)

Both use the same embedding model to ensure vector space consistency.

Step 2: Similarity Computation

Given a query embedding, compute similarity with all document embeddings.

Cosine similarity (most common):

similarity(query, doc) = (query · doc) / (||query|| × ||doc||)

For 1M documents with 768-dim embeddings:

Raw computation: 768M dot products
Time without indexing: ~5 seconds
Time with indexing (next step): ~10-100ms

Step 3: Retrieval and Ranking

Return top K by similarity score.

def retrieve(query, top_k=5):
    query_embedding = model.encode(query)
    similarities = compute_similarity(query_embedding, all_embeddings)
    top_k_indices = argsort(similarities)[-top_k:]
    return [documents[i] for i in top_k_indices]

Dense Retrieval Indexing

Brute-force similarity computation doesn’t scale. Indexing enables fast retrieval.

Dense Indexes (We’ll cover in detail later):

FAISS (Facebook AI Similarity Search): Production standard
Annoy: Simple, performant
HNSW: Fast hierarchical indexing
IVF: Inverted file indexing

All share a design: precompute and organize embeddings for fast nearest-neighbor search.

Strengths of Dense Retrieval

1. Semantic Understanding

Captures meaning beyond keywords.

Query: "How do I fix my broken doorbell?"
Dense retrieval matches:
- Documents about doorbell repair ✓
- Documents about electrical troubleshooting ✓
- Documents about home maintenance ✓

Keyword retrieval might match only exact phrase "broken doorbell"

2. Paraphrase Handling

Understands equivalent statements.

Query: "smartphone prices"
Matches:
- "how much do phones cost" ✓
- "mobile device pricing" ✓
- "cell phone rates" ✓

3. Cross-Language Similarity

Multilingual embeddings find semantically similar documents across languages.

Query in English: "best coffee shops in Rome"
Matches documents in Italian about Roman coffee culture

Weaknesses of Dense Retrieval

1. Semantic Drift

Embeddings may prioritize surface similarity over what users want.

Query: "python snake facts"
Model: Interprets as Python programming language
Incorrectly retrieves:
- Python programming tutorials
- Software engineering blogs
- Code examples

Solution: Hybrid retrieval (combine with keyword search).

2. Temporal Sensitivity

Embeddings reflect training data patterns, may miss recent events.

Query: "latest iPhone model"
Embedding trained on 2023 data
May not understand 2024 releases

Solution: Reindex periodically or use hybrid retrieval

3. Computational Cost

Requires:

Embedding model inference (GPUs helpful)
Vector similarity computation
Index updates as documents change

4. Cold Start Problem

New documents need embedding before retrieval.

Add new document
Wait for embedding computation
Then queryable
Delay: Seconds to minutes for large documents

Dense Retrieval Optimization Techniques

Technique 1: Query Reformulation

Transform user query for better retrieval.

Original query: "What's the best way to cook chicken?"
Reformulated: "cooking methods for chicken recipes preparation"

Result: Matches more cooking documentation

Tools: GPT-4, fine-tuned models for query expansion.

Technique 2: Negative Sampling

Improve ranking by teaching model what NOT to match.

Training data: (query, relevant_doc, irrelevant_doc)
Learn: Make relevant_doc embedding closer to query
       Make irrelevant_doc embedding far from query

Result: Better distinction between similar-looking documents

Technique 3: Hard Negative Mining

Focus training on documents that are easy to confuse.

Model predictions:
- Query matches 100 documents at 0.95 similarity
- True relevant doc is at 0.92 similarity
- Problem: Relevant doc not in top 5

Solution: Include these confusing documents in training
Result: Better discrimination

Technique 4: Re-ranking

Retrieve broadly, then rerank with specialized model.

Stage 1: Dense retrieval → Top 50 documents
         Fast but potentially noisy

Stage 2: Cross-encoder reranking → Top 5 documents
         Slower but more accurate

Result: Recall of broad retrieval + precision of specialized model

Dense Retrieval vs. Traditional IR

Aspect	Dense Retrieval	Keyword (BM25)
Understanding	Semantic	Lexical
Speed	Fast with indexing	Fastest
Paraphrases	Excellent	Poor
Recall	High	Lower
Interpretability	Black box	Clear
Indexing	Complex	Simple
Cost	Higher	Lower

Hybrid Approach (Recommended)

Combine dense retrieval with keyword search:

Stage 1: Parallel retrieval
  Dense search: Top 10 by semantic similarity
  Keyword search: Top 10 by BM25 score

Stage 2: Merge and rerank
  Combine results
  Rerank by ensemble score
  Return top 5

Result: Covers semantic understanding + keyword specificity

Implementation: Using Vector Databases

Most companies use vector databases designed for this:

Pinecone (managed):

from pinecone import Pinecone

pc = Pinecone(api_key="...")
index = pc.Index("documents")

# Upsert embeddings
index.upsert(vectors=[
    ("doc1", embedding1, {"text": "...", "source": "..."}),
    ("doc2", embedding2, {"text": "...", "source": "..."}),
])

# Query
results = index.query(vector=query_embedding, top_k=5)

Weaviate (self-hosted):

from weaviate import Client

client = Client("http://localhost:8080")

# Query
result = client.query.get("Document",
    ["content", "source"]
).with_near_vector({
    "vector": query_embedding
}).with_limit(5).do()

Dense Retrieval Performance

Typical metrics for modern dense retrieval:

Speed:

1M documents, top 5 retrieval: 10-50ms
With batch processing: 1000 queries/second

Quality (TREC-DL evaluation):

nDCG@10: 0.55-0.65 (very good)
Recall@100: 0.80-0.90

Cost:

GPU inference: ~$0.001 per embedding (self-hosted)
Vector database: $100-1000/month depending on scale

Dense Retrieval in 2024

Trends:

Multimodal dense retrieval (text + images)
Query-document asymmetric embeddings
Dense retrieval + LLM reranking becoming standard
Real-time indexing (sub-second latency from new documents)
Sparse vectors + dense vectors (hybrid representation)

Dense retrieval is the foundation of modern RAG. Understanding it thoroughly is essential for building effective systems.