Dense Retrieval: Semantic Vector Search in RAG
Dense retrieval uses neural embeddings to search for semantically similar documents. Unlike keyword-based methods, dense retrieval understands meaning, handling paraphrases, synonyms, and conceptual connections that keyword matching misses.
How Dense Retrieval Works
User Query ↓[Embedding Model] → Query Vector (e.g., 768 dimensions) ↓[Vector Database] → Find similar document vectors ↓[Ranking] → Sort by similarity ↓Top K DocumentsThe core insight: similar documents have similar embeddings.
The Dense Retrieval Pipeline
Step 1: Encoding Queries and Documents
Document encoding (offline, done once):
For each document chunk: embedding = embedding_model(text) store(chunk_id, embedding, text, metadata)Query encoding (online, done per search):
query_embedding = embedding_model(query)Both use the same embedding model to ensure vector space consistency.
Step 2: Similarity Computation
Given a query embedding, compute similarity with all document embeddings.
Cosine similarity (most common):
similarity(query, doc) = (query · doc) / (||query|| × ||doc||)For 1M documents with 768-dim embeddings:
- Raw computation: 768M dot products
- Time without indexing: ~5 seconds
- Time with indexing (next step): ~10-100ms
Step 3: Retrieval and Ranking
Return top K by similarity score.
def retrieve(query, top_k=5): query_embedding = model.encode(query) similarities = compute_similarity(query_embedding, all_embeddings) top_k_indices = argsort(similarities)[-top_k:] return [documents[i] for i in top_k_indices]Dense Retrieval Indexing
Brute-force similarity computation doesn’t scale. Indexing enables fast retrieval.
Dense Indexes (We’ll cover in detail later):
- FAISS (Facebook AI Similarity Search): Production standard
- Annoy: Simple, performant
- HNSW: Fast hierarchical indexing
- IVF: Inverted file indexing
All share a design: precompute and organize embeddings for fast nearest-neighbor search.
Strengths of Dense Retrieval
1. Semantic Understanding
Captures meaning beyond keywords.
Query: "How do I fix my broken doorbell?"Dense retrieval matches:- Documents about doorbell repair ✓- Documents about electrical troubleshooting ✓- Documents about home maintenance ✓
Keyword retrieval might match only exact phrase "broken doorbell"2. Paraphrase Handling
Understands equivalent statements.
Query: "smartphone prices"Matches:- "how much do phones cost" ✓- "mobile device pricing" ✓- "cell phone rates" ✓3. Cross-Language Similarity
Multilingual embeddings find semantically similar documents across languages.
Query in English: "best coffee shops in Rome"Matches documents in Italian about Roman coffee cultureWeaknesses of Dense Retrieval
1. Semantic Drift
Embeddings may prioritize surface similarity over what users want.
Query: "python snake facts"Model: Interprets as Python programming languageIncorrectly retrieves:- Python programming tutorials- Software engineering blogs- Code examplesSolution: Hybrid retrieval (combine with keyword search).
2. Temporal Sensitivity
Embeddings reflect training data patterns, may miss recent events.
Query: "latest iPhone model"Embedding trained on 2023 dataMay not understand 2024 releases
Solution: Reindex periodically or use hybrid retrieval3. Computational Cost
Requires:
- Embedding model inference (GPUs helpful)
- Vector similarity computation
- Index updates as documents change
4. Cold Start Problem
New documents need embedding before retrieval.
Add new documentWait for embedding computationThen queryableDelay: Seconds to minutes for large documentsDense Retrieval Optimization Techniques
Technique 1: Query Reformulation
Transform user query for better retrieval.
Original query: "What's the best way to cook chicken?"Reformulated: "cooking methods for chicken recipes preparation"
Result: Matches more cooking documentationTools: GPT-4, fine-tuned models for query expansion.
Technique 2: Negative Sampling
Improve ranking by teaching model what NOT to match.
Training data: (query, relevant_doc, irrelevant_doc)Learn: Make relevant_doc embedding closer to query Make irrelevant_doc embedding far from query
Result: Better distinction between similar-looking documentsTechnique 3: Hard Negative Mining
Focus training on documents that are easy to confuse.
Model predictions:- Query matches 100 documents at 0.95 similarity- True relevant doc is at 0.92 similarity- Problem: Relevant doc not in top 5
Solution: Include these confusing documents in trainingResult: Better discriminationTechnique 4: Re-ranking
Retrieve broadly, then rerank with specialized model.
Stage 1: Dense retrieval → Top 50 documents Fast but potentially noisy
Stage 2: Cross-encoder reranking → Top 5 documents Slower but more accurate
Result: Recall of broad retrieval + precision of specialized modelDense Retrieval vs. Traditional IR
| Aspect | Dense Retrieval | Keyword (BM25) |
|---|---|---|
| Understanding | Semantic | Lexical |
| Speed | Fast with indexing | Fastest |
| Paraphrases | Excellent | Poor |
| Recall | High | Lower |
| Interpretability | Black box | Clear |
| Indexing | Complex | Simple |
| Cost | Higher | Lower |
Hybrid Approach (Recommended)
Combine dense retrieval with keyword search:
Stage 1: Parallel retrieval Dense search: Top 10 by semantic similarity Keyword search: Top 10 by BM25 score
Stage 2: Merge and rerank Combine results Rerank by ensemble score Return top 5
Result: Covers semantic understanding + keyword specificityImplementation: Using Vector Databases
Most companies use vector databases designed for this:
Pinecone (managed):
from pinecone import Pinecone
pc = Pinecone(api_key="...")index = pc.Index("documents")
# Upsert embeddingsindex.upsert(vectors=[ ("doc1", embedding1, {"text": "...", "source": "..."}), ("doc2", embedding2, {"text": "...", "source": "..."}),])
# Queryresults = index.query(vector=query_embedding, top_k=5)Weaviate (self-hosted):
from weaviate import Client
client = Client("http://localhost:8080")
# Queryresult = client.query.get("Document", ["content", "source"]).with_near_vector({ "vector": query_embedding}).with_limit(5).do()Dense Retrieval Performance
Typical metrics for modern dense retrieval:
Speed:
- 1M documents, top 5 retrieval: 10-50ms
- With batch processing: 1000 queries/second
Quality (TREC-DL evaluation):
- nDCG@10: 0.55-0.65 (very good)
- Recall@100: 0.80-0.90
Cost:
- GPU inference: ~$0.001 per embedding (self-hosted)
- Vector database: $100-1000/month depending on scale
Dense Retrieval in 2024
Trends:
- Multimodal dense retrieval (text + images)
- Query-document asymmetric embeddings
- Dense retrieval + LLM reranking becoming standard
- Real-time indexing (sub-second latency from new documents)
- Sparse vectors + dense vectors (hybrid representation)
Dense retrieval is the foundation of modern RAG. Understanding it thoroughly is essential for building effective systems.