Approximate Nearest Neighbor Search: The Key to Scaling Vector Retrieval
Without indexing, finding the K nearest neighbors among millions of vectors requires checking every vector—computationally prohibitive. Approximate Nearest Neighbor (ANN) algorithms solve this by finding good-enough results fast.
The Exact vs. Approximate Trade-off
Exact Nearest Neighbor (Linear Search)
Algorithm: Check every vector, compute similarity, return top K
For 1M vectors:- Queries needed: 1M similarity computations- Time: ~1-2 seconds per query- Space: All vectors in memory- Accuracy: 100% (truly finds nearest)Use when:
- Small corpus (< 100K vectors)
- Accuracy critical
- Speed not critical
- Interactive response not needed
Approximate Nearest Neighbor
Algorithm: Use indexing structure to skip most vectors
For 1M vectors:- Queries needed: ~1K-10K similarity computations- Time: ~10-100ms per query- Space: Index structure overhead- Accuracy: 90-99% (finds approximately nearest)Use when:
- Large corpus (1M+ vectors)
- Speed critical (< 100ms latency)
- Acceptable recall loss (1-10%)
- Interactive systems
Why ANN Works
ANN exploits the structure of high-dimensional vector space:
Property 1: ClusteringVectors naturally cluster in semantic space.Similar vectors are near each other.→ Don't need to check distant regions
Property 2: LocalityNearest neighbors are likely to be near each other.If vector A is near B and B is near C,A and C are probably near each other.→ Can propagate through the structure
Property 3: Dimensionality benefitsHigh-dimensional spaces have properties that makeapproximate search surprisingly effective.→ Can afford small approximation errorThese properties enable indexing structures that skip search space.
Popular ANN Algorithms
1. HNSW (Hierarchical Navigable Small World)
How it works:
Build a multi-level graph structureLayer 0 (bottom): All vectorsLayer 1: ~50% of vectors with long-range connectionsLayer 2: ~25% of vectors, even sparser... (layers decrease until single entry point)
Search:1. Start at top layer entry point2. Navigate toward query3. Move to next layer4. Repeat until bottom layer5. Fine-grained search in layer 0Characteristics:
Recall: 99%+Query time: Sub-millisecond (1M vectors)Build time: MinutesMemory overhead: ~10% above raw vectorsWhen to use:
✓ Production systems needing high recall✓ Real-time search requirements✓ Medium-large scale (1M-100M vectors)Popular implementations:
- Hnswlib (C++, Python wrapper)- Milvus- Weaviate- Pinecone (uses HNSW internally)2. IVF (Inverted File Index)
How it works:
1. Partition vectors into clusters (k-means)2. For each cluster, maintain list of vector IDs3. Build inverted index: cluster → [vectors in cluster]
Search:1. Encode query2. Find nearest clusters (via centroids)3. Search vectors in those clusters4. Return top KCharacteristics:
Recall: 90-95% (depending on how many clusters checked)Query time: MillisecondsBuild time: SecondsMemory overhead: ~5% above raw vectorsScalability: Excellent for very large corpusExample:
1M vectors, 768 dimensionsK clusters: 16384Vectors per cluster: ~60
Query:1. Find nearest 4 clusters: 16KB lookups2. Search ~240 vectors: 240K operationsvs. 1M vectors: 1M operations
Speed: 4-5× faster than linear searchWhen to use:
✓ Very large scale (10M-1B vectors)✓ Acceptable recall trade-off✓ Limited memoryPopular implementations:
- FAISS (Facebook AI Similarity Search)- Milvus- Elasticsearch- Vespa3. LSH (Locality Sensitive Hashing)
How it works:
1. Hash vectors using special functions2. Hash function property: Similar vectors hash to same/nearby buckets3. Store vectors by hash value
Search:1. Hash query2. Look in that bucket and nearby buckets3. Compare only candidates in bucketsCharacteristics:
Recall: Variable (tunable)Query time: FastBuild time: Very fastMemory: TunableSimplicity: Easiest to understandWhen to use:
✓ Extremely fast requirements✓ Streaming data (easy to add new vectors)✓ Limited memory4. Product Quantization
How it works:
1. Split vectors into segments2. Quantize each segment (reduce to fewer bits)3. Build index on quantized vectors4. Trade accuracy for speed and memory
Example:Original: 768 dimensions × 32-bit float = 3KB per vectorQuantized: 768 dimensions × 2 bits (4 clusters per segment) = 12 bytes
100M vectors:Original: 300GBQuantized: 1.2GBSpeed: 10-100× fasterComparison: Which ANN Algorithm?
| Algorithm | Recall | Speed | Memory | Scalability | Ease |
|---|---|---|---|---|---|
| HNSW | 99%+ | <1ms | 10% | 1M-100M | Medium |
| IVF | 90-95% | 1-10ms | 5% | 10M-1B | Medium |
| LSH | Variable | <1ms | Low | Streaming | Easy |
| PQ | 85-90% | <1ms | Tiny | 100M-1B | Hard |
| Linear | 100% | 1-2s | 0% | < 100K | Trivial |
Practical recommendations:
Starting a RAG system: Use HNSW(Good balance of recall and simplicity)
Extremely large scale: Try IVF(Better memory efficiency, still good recall)
Streaming vectors: Consider LSH(Easy to add new vectors incrementally)
Ultra-low latency + massive scale: Combine PQ + IVF(Requires more tuning)ANN in Popular Vector Databases
FAISS (Facebook AI Similarity Search)
import faissimport numpy as np
# Create IVF indexdimension = 768nlist = 1024 # number of clustersquantizer = faiss.IndexFlatL2(dimension)index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# Add vectorsvectors = np.random.randn(1000000, 768).astype('float32')index.train(vectors[:100000]) # Train on subsetindex.add(vectors)
# Searchquery = np.random.randn(1, 768).astype('float32')distances, indices = index.search(query, k=10)print(f"Top 10 results: {indices[0]}")print(f"Distances: {distances[0]}")Pinecone (Cloud Service)
import pinecone
pinecone.init(api_key="xxx")index = pinecone.Index("my-index")
# Upsert vectors (automatically indexed with HNSW)index.upsert(vectors=[ ("id1", embedding1), ("id2", embedding2),])
# Queryresults = index.query(vector=query_embedding, top_k=10)print(results) # Returns matched IDs and scoresWeaviate
from weaviate import Client
client = Client("http://localhost:8080")
# Create class with HNSW indexingclient.schema.create_class({ "class": "Document", "vectorizer": "text2vec-transformers", "vectorIndexConfig": { "name": "hnsw", "ef": 64, "efConstruction": 200, }})
# Queryresult = client.query.get("Document", ["content"]).with_near_vector({ "vector": query_embedding}).with_limit(10).do()Tuning ANN Parameters
HNSW Hyperparameters
- ef: Search parameter (higher = more thorough, slower) Default: 40 Trade-off: Recall vs. speed
- efConstruction: Construction parameter Default: 200 Trade-off: Build time vs. quality
- M: Number of bidirectional links Default: 5-48 (usually 16) Trade-off: Memory vs. recallTuning strategy:
def find_best_parameters(ground_truth, recall_target=0.95): best_config = None
for ef in [20, 40, 60, 100]: for efConstruction in [100, 200, 400]: for M in [4, 8, 16, 32]: index = hnswlib.Index(...) index.init_index(max_elements=1000000, ef_construction=efConstruction, M=M) recall = evaluate_recall(index, ground_truth)
if recall >= recall_target: if (best_config is None or ef < best_config['ef']): # Prefer speed best_config = {'ef': ef, 'efConstruction': efConstruction, 'M': M}
return best_configIVF Hyperparameters
- nlist: Number of clusters Default: √N (for N vectors) Trade-off: Search space vs. cluster quality
- nprobe: Number of clusters to search Default: 1-10 (usually 10) Trade-off: Recall vs. speedANN Quality Metrics
Recall
What percentage of true top-K results were found?
Recall@1 = Is the true nearest neighbor found?Recall@10 = Are the true top-10 found?Query Latency
Time to retrieve K vectorsTypical targets: < 50ms for interactive applicationsThroughput
Queries per second the system can handleDepends on: Recall setting, latency SLA, hardwareCommon ANN Mistakes
Using linear search at scale:
Trying to search 10M vectors with brute forceShould use HNSW or IVF immediatelyWrong recall target:
Optimizing for 100% recall (defeats the purpose)Most applications accept 95%+ recall gladlyNot retraining index:
Vectors added but index outdatedRetrieval quality degrades over timeSolution: Reindex periodically or incrementallyForgetting to normalize:
ANN algorithms assume normalized vectorsDenormalized vectors give wrong resultsANN in 2024
Trends:
- Learned indices (ML-based optimization)
- GPU-accelerated ANN
- Streaming ANN (update without reindex)
- Hybrid ANN (combining multiple algorithms)
- Multi-stage ranking (ANN → reranker)
ANN algorithms are the invisible foundation of modern vector search. Understanding them helps you tune systems for latency and recall trade-offs.