Similarity Search: The Math That Powers RAG Retrieval

When your RAG system receives a query, it converts it to a vector and searches for the most “similar” vectors in your index. But what does “similar” actually mean mathematically? The answer depends on which distance (or similarity) metric you’re using — and the choice matters more than most tutorials let on.

This guide covers the three primary metrics in production use, when each one is appropriate, and the practical implications for your RAG system.

The Geometry of Vector Similarity

High-dimensional embeddings represent meaning as points in space. Similar meanings cluster together; dissimilar meanings are far apart. “Similarity search” is finding the nearest neighbors of your query point in that space.

Simplified 2D representation of embedding space:

  "machine learning" ●
  "deep learning"    ●  ← these cluster together
  "neural networks"  ●

  "contract law"           ●
  "legal compliance"       ●  ← these cluster together
  "regulatory filings"     ●

Query: "What is backpropagation?"
→ Closest cluster: machine learning / neural networks
→ Farthest cluster: legal/compliance

Cosine Similarity

Cosine similarity measures the angle between two vectors, ignoring their magnitudes:

cos(θ) = (A · B) / (|A| × |B|)

Range: -1 to +1 (for normalized vectors, always 0 to 1)
 1.0 = identical direction (maximum similarity)
 0.0 = perpendicular (no similarity)
-1.0 = opposite direction (possible with non-normalized vectors)

Why it’s widely used: Cosine similarity is magnitude-invariant. A long document and a short document that discuss the same topic will have similar cosine similarity to a query, even if their raw vector lengths differ significantly. This makes it robust for comparing texts of different lengths.

When it’s the right choice:

Comparing documents of different lengths
Embeddings are not unit-normalized
Semantic similarity (not relevance scoring) is the goal
Most transformer-based embedding models produce vectors well-suited to cosine similarity

Most vector databases default to cosine distance or allow normalization + dot product (equivalent to cosine similarity for unit vectors).

Dot Product (Inner Product)

The dot product measures both the angle AND the magnitude between vectors:

A · B = Σ(ai × bi) = |A| × |B| × cos(θ)

No fixed range — depends on vector magnitudes
Higher magnitude + closer angle = higher score

For unit-normalized vectors, dot product and cosine similarity give identical rankings. The difference emerges when vectors have varying magnitudes.

When magnitudes carry meaning: Some embedding models are specifically trained to encode relevance in vector magnitude. OpenAI’s embedding models (text-embedding-3-small, text-embedding-3-large) are designed to be used with cosine similarity, but models trained with metric learning or contrastive objectives sometimes encode confidence in magnitude.

When to use dot product:

Your embedding model documentation explicitly recommends it
Vectors are already unit-normalized (equivalent to cosine)
You need raw similarity scores for custom reranking

Euclidean Distance (L2)

Euclidean distance is the straight-line distance between two points in embedding space:

L2(A, B) = sqrt(Σ(ai - bi)²)

Range: 0 to ∞
0 = identical vectors
Larger = more dissimilar

The problem with Euclidean in high dimensions: The curse of dimensionality strikes hard here. In 1536 dimensions, the ratio between the nearest and farthest neighbor distance approaches 1, making “near” and “far” increasingly meaningless. Euclidean distances concentrate near a fixed value regardless of actual semantic similarity.

When Euclidean works:

Low-dimensional embeddings (< 100 dimensions)
Image features rather than text (some CNN feature vectors work well with L2)
When explicit magnitude differences should penalize similarity

For text embeddings at 384D, 768D, 1536D+, cosine similarity consistently outperforms Euclidean distance on retrieval benchmarks.

Practical Comparison

Metric          | Normalization Needed | Best For              | Default In
----------------|---------------------|----------------------|------------------
Cosine          | No                  | Text retrieval        | Most RAG stacks
Dot Product     | Ideally yes         | Trained relevance     | Pinecone (default)
Euclidean (L2)  | No                  | Image features, low-D | FAISS (default)
Manhattan (L1)  | No                  | Sparse vectors        | Rarely used

Setting Up in Python with FAISS

import faiss
import numpy as np

d = 1536  # embedding dimension

# Cosine similarity via normalization + inner product
index_cosine = faiss.IndexFlatIP(d)  # Inner Product index
# Normalize vectors before adding/querying
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
index_cosine.add(vectors.astype(np.float32))

# Euclidean distance
index_l2 = faiss.IndexFlatL2(d)  # L2 index
index_l2.add(vectors.astype(np.float32))

# Query
query = query / np.linalg.norm(query)  # normalize for cosine
distances, indices = index_cosine.search(query.reshape(1, -1).astype(np.float32), k=10)

Similarity Thresholds

An often-overlooked aspect of similarity search: scores below a minimum threshold represent noise, not relevant results. Hard k-NN retrieval always returns k results even if none are truly similar to the query.

def similarity_search_with_threshold(
    query_vector,
    k: int = 10,
    min_score: float = 0.75,  # minimum cosine similarity
):
    results = vectorstore.similarity_search_with_score(query_vector, k=k)
    # Filter out low-confidence results
    return [(doc, score) for doc, score in results if score >= min_score]

Setting a minimum threshold prevents your RAG system from retrieving and sending irrelevant context to the LLM when no good match exists. This is especially important for out-of-domain queries.

2025 Trend: Matryoshka Representation Learning

OpenAI’s text-embedding-3 models use Matryoshka Representation Learning (MRL), where embeddings at different dimensionalities (e.g., 256, 512, 1536 dimensions) all preserve semantic structure. You can truncate the vector for faster, cheaper search while maintaining reasonable recall:

# Full 1536-dim embedding: high recall, higher cost
full_embedding = embed(text)                    # 1536 dims

# Truncated 512-dim: ~96% recall of full, 3× cheaper storage
compact_embedding = full_embedding[:512]
compact_embedding /= np.linalg.norm(compact_embedding)  # re-normalize

# Truncated 256-dim: ~92% recall, 6× cheaper
tiny_embedding = full_embedding[:256]
tiny_embedding /= np.linalg.norm(tiny_embedding)

This enables tiered retrieval: use 256-dim for a fast first pass, then 1536-dim for reranking the top candidates.

Choosing Your Metric: The Practical Rule

For the vast majority of RAG systems using transformer-based text embeddings:

Use cosine similarity as your default
Check your embedding model’s documentation — follow their recommendation
For unit-normalized embeddings, dot product is equivalent and often faster
Only use Euclidean if you have a specific reason grounded in how your embeddings were trained

The metric choice rarely makes or breaks a RAG system — retrieval quality depends much more on chunking, embedding model quality, and query formulation.