Dense Retrieval for RAG: Semantic Vector Search Methods

Master dense retrieval using embeddings and vector search. Learn neural ranking, semantic similarity, and implementation techniques.

Dense Retrieval: Semantic Vector Search in RAG

Dense retrieval uses neural embeddings to search for semantically similar documents. Unlike keyword-based methods, dense retrieval understands meaning, handling paraphrases, synonyms, and conceptual connections that keyword matching misses.

How Dense Retrieval Works

User Query
[Embedding Model] → Query Vector (e.g., 768 dimensions)
[Vector Database] → Find similar document vectors
[Ranking] → Sort by similarity
Top K Documents

The core insight: similar documents have similar embeddings.

The Dense Retrieval Pipeline

Step 1: Encoding Queries and Documents

Document encoding (offline, done once):

For each document chunk:
embedding = embedding_model(text)
store(chunk_id, embedding, text, metadata)

Query encoding (online, done per search):

query_embedding = embedding_model(query)

Both use the same embedding model to ensure vector space consistency.

Step 2: Similarity Computation

Given a query embedding, compute similarity with all document embeddings.

Cosine similarity (most common):

similarity(query, doc) = (query · doc) / (||query|| × ||doc||)

For 1M documents with 768-dim embeddings:

  • Raw computation: 768M dot products
  • Time without indexing: ~5 seconds
  • Time with indexing (next step): ~10-100ms

Step 3: Retrieval and Ranking

Return top K by similarity score.

def retrieve(query, top_k=5):
query_embedding = model.encode(query)
similarities = compute_similarity(query_embedding, all_embeddings)
top_k_indices = argsort(similarities)[-top_k:]
return [documents[i] for i in top_k_indices]

Dense Retrieval Indexing

Brute-force similarity computation doesn’t scale. Indexing enables fast retrieval.

Dense Indexes (We’ll cover in detail later):

  • FAISS (Facebook AI Similarity Search): Production standard
  • Annoy: Simple, performant
  • HNSW: Fast hierarchical indexing
  • IVF: Inverted file indexing

All share a design: precompute and organize embeddings for fast nearest-neighbor search.

Strengths of Dense Retrieval

1. Semantic Understanding

Captures meaning beyond keywords.

Query: "How do I fix my broken doorbell?"
Dense retrieval matches:
- Documents about doorbell repair ✓
- Documents about electrical troubleshooting ✓
- Documents about home maintenance ✓
Keyword retrieval might match only exact phrase "broken doorbell"

2. Paraphrase Handling

Understands equivalent statements.

Query: "smartphone prices"
Matches:
- "how much do phones cost" ✓
- "mobile device pricing" ✓
- "cell phone rates" ✓

3. Cross-Language Similarity

Multilingual embeddings find semantically similar documents across languages.

Query in English: "best coffee shops in Rome"
Matches documents in Italian about Roman coffee culture

Weaknesses of Dense Retrieval

1. Semantic Drift

Embeddings may prioritize surface similarity over what users want.

Query: "python snake facts"
Model: Interprets as Python programming language
Incorrectly retrieves:
- Python programming tutorials
- Software engineering blogs
- Code examples

Solution: Hybrid retrieval (combine with keyword search).

2. Temporal Sensitivity

Embeddings reflect training data patterns, may miss recent events.

Query: "latest iPhone model"
Embedding trained on 2023 data
May not understand 2024 releases
Solution: Reindex periodically or use hybrid retrieval

3. Computational Cost

Requires:

  • Embedding model inference (GPUs helpful)
  • Vector similarity computation
  • Index updates as documents change

4. Cold Start Problem

New documents need embedding before retrieval.

Add new document
Wait for embedding computation
Then queryable
Delay: Seconds to minutes for large documents

Dense Retrieval Optimization Techniques

Technique 1: Query Reformulation

Transform user query for better retrieval.

Original query: "What's the best way to cook chicken?"
Reformulated: "cooking methods for chicken recipes preparation"
Result: Matches more cooking documentation

Tools: GPT-4, fine-tuned models for query expansion.

Technique 2: Negative Sampling

Improve ranking by teaching model what NOT to match.

Training data: (query, relevant_doc, irrelevant_doc)
Learn: Make relevant_doc embedding closer to query
Make irrelevant_doc embedding far from query
Result: Better distinction between similar-looking documents

Technique 3: Hard Negative Mining

Focus training on documents that are easy to confuse.

Model predictions:
- Query matches 100 documents at 0.95 similarity
- True relevant doc is at 0.92 similarity
- Problem: Relevant doc not in top 5
Solution: Include these confusing documents in training
Result: Better discrimination

Technique 4: Re-ranking

Retrieve broadly, then rerank with specialized model.

Stage 1: Dense retrieval → Top 50 documents
Fast but potentially noisy
Stage 2: Cross-encoder reranking → Top 5 documents
Slower but more accurate
Result: Recall of broad retrieval + precision of specialized model

Dense Retrieval vs. Traditional IR

AspectDense RetrievalKeyword (BM25)
UnderstandingSemanticLexical
SpeedFast with indexingFastest
ParaphrasesExcellentPoor
RecallHighLower
InterpretabilityBlack boxClear
IndexingComplexSimple
CostHigherLower

Combine dense retrieval with keyword search:

Stage 1: Parallel retrieval
Dense search: Top 10 by semantic similarity
Keyword search: Top 10 by BM25 score
Stage 2: Merge and rerank
Combine results
Rerank by ensemble score
Return top 5
Result: Covers semantic understanding + keyword specificity

Implementation: Using Vector Databases

Most companies use vector databases designed for this:

Pinecone (managed):

from pinecone import Pinecone
pc = Pinecone(api_key="...")
index = pc.Index("documents")
# Upsert embeddings
index.upsert(vectors=[
("doc1", embedding1, {"text": "...", "source": "..."}),
("doc2", embedding2, {"text": "...", "source": "..."}),
])
# Query
results = index.query(vector=query_embedding, top_k=5)

Weaviate (self-hosted):

from weaviate import Client
client = Client("http://localhost:8080")
# Query
result = client.query.get("Document",
["content", "source"]
).with_near_vector({
"vector": query_embedding
}).with_limit(5).do()

Dense Retrieval Performance

Typical metrics for modern dense retrieval:

Speed:

  • 1M documents, top 5 retrieval: 10-50ms
  • With batch processing: 1000 queries/second

Quality (TREC-DL evaluation):

  • nDCG@10: 0.55-0.65 (very good)
  • Recall@100: 0.80-0.90

Cost:

  • GPU inference: ~$0.001 per embedding (self-hosted)
  • Vector database: $100-1000/month depending on scale

Dense Retrieval in 2024

Trends:

  • Multimodal dense retrieval (text + images)
  • Query-document asymmetric embeddings
  • Dense retrieval + LLM reranking becoming standard
  • Real-time indexing (sub-second latency from new documents)
  • Sparse vectors + dense vectors (hybrid representation)

Dense retrieval is the foundation of modern RAG. Understanding it thoroughly is essential for building effective systems.