Similarity Search: The Math That Powers RAG Retrieval
When your RAG system receives a query, it converts it to a vector and searches for the most “similar” vectors in your index. But what does “similar” actually mean mathematically? The answer depends on which distance (or similarity) metric you’re using — and the choice matters more than most tutorials let on.
This guide covers the three primary metrics in production use, when each one is appropriate, and the practical implications for your RAG system.
The Geometry of Vector Similarity
High-dimensional embeddings represent meaning as points in space. Similar meanings cluster together; dissimilar meanings are far apart. “Similarity search” is finding the nearest neighbors of your query point in that space.
Simplified 2D representation of embedding space:
"machine learning" ● "deep learning" ● ← these cluster together "neural networks" ●
"contract law" ● "legal compliance" ● ← these cluster together "regulatory filings" ●
Query: "What is backpropagation?"→ Closest cluster: machine learning / neural networks→ Farthest cluster: legal/complianceCosine Similarity
Cosine similarity measures the angle between two vectors, ignoring their magnitudes:
cos(θ) = (A · B) / (|A| × |B|)
Range: -1 to +1 (for normalized vectors, always 0 to 1) 1.0 = identical direction (maximum similarity) 0.0 = perpendicular (no similarity)-1.0 = opposite direction (possible with non-normalized vectors)Why it’s widely used: Cosine similarity is magnitude-invariant. A long document and a short document that discuss the same topic will have similar cosine similarity to a query, even if their raw vector lengths differ significantly. This makes it robust for comparing texts of different lengths.
When it’s the right choice:
- Comparing documents of different lengths
- Embeddings are not unit-normalized
- Semantic similarity (not relevance scoring) is the goal
- Most transformer-based embedding models produce vectors well-suited to cosine similarity
Most vector databases default to cosine distance or allow normalization + dot product (equivalent to cosine similarity for unit vectors).
Dot Product (Inner Product)
The dot product measures both the angle AND the magnitude between vectors:
A · B = Σ(ai × bi) = |A| × |B| × cos(θ)
No fixed range — depends on vector magnitudesHigher magnitude + closer angle = higher scoreFor unit-normalized vectors, dot product and cosine similarity give identical rankings. The difference emerges when vectors have varying magnitudes.
When magnitudes carry meaning: Some embedding models are specifically trained to encode relevance in vector magnitude. OpenAI’s embedding models (text-embedding-3-small, text-embedding-3-large) are designed to be used with cosine similarity, but models trained with metric learning or contrastive objectives sometimes encode confidence in magnitude.
When to use dot product:
- Your embedding model documentation explicitly recommends it
- Vectors are already unit-normalized (equivalent to cosine)
- You need raw similarity scores for custom reranking
Euclidean Distance (L2)
Euclidean distance is the straight-line distance between two points in embedding space:
L2(A, B) = sqrt(Σ(ai - bi)²)
Range: 0 to ∞0 = identical vectorsLarger = more dissimilarThe problem with Euclidean in high dimensions: The curse of dimensionality strikes hard here. In 1536 dimensions, the ratio between the nearest and farthest neighbor distance approaches 1, making “near” and “far” increasingly meaningless. Euclidean distances concentrate near a fixed value regardless of actual semantic similarity.
When Euclidean works:
- Low-dimensional embeddings (< 100 dimensions)
- Image features rather than text (some CNN feature vectors work well with L2)
- When explicit magnitude differences should penalize similarity
For text embeddings at 384D, 768D, 1536D+, cosine similarity consistently outperforms Euclidean distance on retrieval benchmarks.
Practical Comparison
Metric | Normalization Needed | Best For | Default In----------------|---------------------|----------------------|------------------Cosine | No | Text retrieval | Most RAG stacksDot Product | Ideally yes | Trained relevance | Pinecone (default)Euclidean (L2) | No | Image features, low-D | FAISS (default)Manhattan (L1) | No | Sparse vectors | Rarely usedSetting Up in Python with FAISS
import faissimport numpy as np
d = 1536 # embedding dimension
# Cosine similarity via normalization + inner productindex_cosine = faiss.IndexFlatIP(d) # Inner Product index# Normalize vectors before adding/queryingvectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)index_cosine.add(vectors.astype(np.float32))
# Euclidean distanceindex_l2 = faiss.IndexFlatL2(d) # L2 indexindex_l2.add(vectors.astype(np.float32))
# Queryquery = query / np.linalg.norm(query) # normalize for cosinedistances, indices = index_cosine.search(query.reshape(1, -1).astype(np.float32), k=10)Similarity Thresholds
An often-overlooked aspect of similarity search: scores below a minimum threshold represent noise, not relevant results. Hard k-NN retrieval always returns k results even if none are truly similar to the query.
def similarity_search_with_threshold( query_vector, k: int = 10, min_score: float = 0.75, # minimum cosine similarity): results = vectorstore.similarity_search_with_score(query_vector, k=k) # Filter out low-confidence results return [(doc, score) for doc, score in results if score >= min_score]Setting a minimum threshold prevents your RAG system from retrieving and sending irrelevant context to the LLM when no good match exists. This is especially important for out-of-domain queries.
2025 Trend: Matryoshka Representation Learning
OpenAI’s text-embedding-3 models use Matryoshka Representation Learning (MRL), where embeddings at different dimensionalities (e.g., 256, 512, 1536 dimensions) all preserve semantic structure. You can truncate the vector for faster, cheaper search while maintaining reasonable recall:
# Full 1536-dim embedding: high recall, higher costfull_embedding = embed(text) # 1536 dims
# Truncated 512-dim: ~96% recall of full, 3× cheaper storagecompact_embedding = full_embedding[:512]compact_embedding /= np.linalg.norm(compact_embedding) # re-normalize
# Truncated 256-dim: ~92% recall, 6× cheapertiny_embedding = full_embedding[:256]tiny_embedding /= np.linalg.norm(tiny_embedding)This enables tiered retrieval: use 256-dim for a fast first pass, then 1536-dim for reranking the top candidates.
Choosing Your Metric: The Practical Rule
For the vast majority of RAG systems using transformer-based text embeddings:
- Use cosine similarity as your default
- Check your embedding model’s documentation — follow their recommendation
- For unit-normalized embeddings, dot product is equivalent and often faster
- Only use Euclidean if you have a specific reason grounded in how your embeddings were trained
The metric choice rarely makes or breaks a RAG system — retrieval quality depends much more on chunking, embedding model quality, and query formulation.