IVF Index: Scaling Vector Search to Billions
IVF (Inverted File) indexing is designed for massive-scale vector search, enabling billion-scale indexes where HNSW would be prohibitive. It achieves this through clustering and disk-efficient design.
Core Concept: Clustering for Scale
Rather than building graph connections between all vectors (HNSW’s approach), IVF partitions vectors into clusters.
The idea:
Traditional search (linear):For each query, check all 1B vectorsTime: ~seconds per query
IVF search:1. Cluster vectors: Group 1B vectors into 64K clusters2. Find nearest clusters: Check 64 cluster centroids3. Search within clusters: Check ~1K vectors in top clusters4. Return top K
Time: ~milliseconds per querySpeedup: 1000×How IVF Works
Step 1: Training Phase (Off-line)
Input: 1B vectors1. Select sample vectors (e.g., 100K random vectors)2. Run k-means clustering: Find K cluster centroids K typically = √N (for N vectors) For 1B vectors: K = 31,623 clusters
3. Store cluster centroids4. Assign each vector to nearest centroid5. Create inverted index: centroid_id → [vector_ids in cluster]Example:
Vectors: [v1, v2, v3, ..., v1B]K-means: Find 31K cluster centersIndex: C1 → [v5, v234, v1000, ..., v999] (1K vectors) C2 → [v2, v100, v456, ..., v888] (1K vectors) ... C31K → [v10, v2000, v50000, ..., v1B] (1K vectors)Step 2: Search Phase (On-line)
Query q arrives:
1. Compute similarity(q, all centroids) 31K operations (fast)
2. Find nprobe nearest centroids Default nprobe = 10 Time: O(K) where K is small
3. Look up vectors in those clusters 10 clusters × 1K vectors = 10K candidates
4. Compute similarity(q, candidates) 10K operations (fast compared to 1B)
5. Return top 10IVF Parameters
nlist (Number of Clusters)
nlist = √N is the typical formulaFor different scales:
1M vectors: nlist = 1K10M vectors: nlist = 3K100M vectors: nlist = 10K1B vectors: nlist = 31KTrade-off:
Higher nlist: ✓ Each cluster smaller (fewer false positives) ✓ Easier to find exact nearest neighbors ✗ Centroid search slower (more centroids)
Lower nlist: ✓ Fewer centroids to check ✗ Larger clusters (more false positives) ✗ Need to search more candidatesnprobe (Clusters to Search)
nprobe = number of nearest clusters to search
nprobe = 1: Fastest, lowest recallnprobe = 10: Balanced (typical)nprobe = 100: Slowest, highest recallTrade-off at query time:
Query with nprobe = 1: Latency: 1ms Recall: 0.65
Query with nprobe = 10: Latency: 10ms Recall: 0.90
Query with nprobe = 100: Latency: 50ms Recall: 0.98
Can change nprobe per query! (Unlike HNSW where ef is fixed)IVF + Product Quantization (IVF-PQ)
Combine IVF with quantization for extreme memory efficiency:
Standard IVF storage:100M vectors × 768 dimensions × 4 bytes = 300GB
IVF-PQ storage:Split 768 dimensions into 32 segments (24 dims each)Each segment quantized to 8 bits100M vectors × 32 × 1 byte = 3.2GBSavings: 100×
Speed benefit:Similarity computation on quantized vectors is fasterTrade-off: Slight recall loss (usually recoverable with nprobe)Implementation:
import faiss
# Create IVF-PQ indexdimension = 768nlist = 10000 # Number of clustersm = 32 # Number of segments for PQnbits = 8 # Bits per segment (256 possible values)
quantizer = faiss.IndexFlatL2(dimension)index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
# Trainvectors = np.random.randn(100000000, 768).astype('float32')index.train(vectors[:1000000]) # Train on subset
# Indexindex.add(vectors)
# Searchquery = np.random.randn(1, 768).astype('float32')index.nprobe = 10 # Search 10 clustersdistances, indices = index.search(query, k=10)IVF in FAISS
FAISS is optimized for IVF indexing at massive scale:
import faissimport numpy as np
# Create indexdimension = 768nlist = 10000 # Number of clustersindex = faiss.IndexFlatL2(dimension) # Quantizerindex = faiss.IndexIVFFlat(index, dimension, nlist)
# Or shorthand:index = faiss.index_factory(dimension, "IVF10000,Flat")
# Trainvectors = np.random.randn(1000000, 768).astype('float32')index.train(vectors)
# Add all dataindex.add(vectors)
# Search with varying recall levelsfor nprobe in [1, 5, 10, 50, 100]: index.nprobe = nprobe distances, indices = index.search(query, k=10) recall = calculate_recall(indices, ground_truth) print(f"nprobe={nprobe}: latency={time_ms}, recall={recall:.3f}")Benchmarks: IVF Performance
Recall vs. Latency Trade-off
1B vector index (IVF):
nprobe Latency Recall Throughput1 0.5ms 0.60 2000 qps5 2ms 0.78 500 qps10 4ms 0.88 250 qps20 8ms 0.94 125 qps50 20ms 0.97 50 qps100 40ms 0.99 25 qps
Single machine can process 25-2000 queries/seconddepending on desired recallMemory Efficiency
Vectors: 768 dimensionsCount: 1 billion
Raw vectors: 3TBIVF index: ~3.2TB (includes centroids, IDs)IVF-PQ index: ~30GB (100× reduction!)IVF-OPQ (optimized): ~20GB
Storage cost at cloud:3TB × $0.023/GB/month = $69/month30GB × $0.023/GB/month = $0.69/monthIVF vs. HNSW: When to Use Each
Use HNSW when:
✓ 1-100M vectors✓ <10ms latency requirement✓ 95%+ recall needed✓ Real-time recall tuning not needed✓ Can afford graph memory overhead
Example: Product search for e-commerce (10M products)Use IVF when:
✓ 100M-1B+ vectors✓ 10-100ms latency acceptable✓ 90%+ recall sufficient✓ Memory critical✓ Can adjust nprobe per query
Example: Web-scale search (billions of documents)Tuning IVF Parameters
Finding Optimal nlist
def find_optimal_nlist(data, ground_truth_k=100): """Find nlist that balances accuracy and speed""" results = []
for nlist in [1000, 3000, 10000, 31000, 100000]: index = faiss.IndexIVFFlat( faiss.IndexFlatL2(data.shape[1]), data.shape[1], nlist ) index.train(data) index.add(data)
# Test with different nprobe values for nprobe in [1, 5, 10, 50]: index.nprobe = nprobe distances, indices = index.search(queries, k=10) recall = calculate_recall(indices, ground_truth) latency = measure_latency(index, queries)
results.append({ 'nlist': nlist, 'nprobe': nprobe, 'recall': recall, 'latency': latency })
# Find best trade-off return [r for r in results if r['recall'] > 0.90 and r['latency'] < 10][0]Memory-Latency Trade-off
Start with: nlist = √N nprobe = 10
If too slow: Reduce nprobe (faster but lower recall)If too much memory: Increase nlist (distribute vectors)If both OK: Keep current settings
Example tuning:1B vectors → nlist=31K, nprobe=10Memory: 3.2TB, Latency: 4ms, Recall: 0.88
Still too much memory?Try IVF-PQ: Memory drops to 32GB, latency stays 4msPractical Deployment: IVF Index
Building Large Index
# Don't load all vectors at oncedef build_massive_index(data_source, dimension=768): index = faiss.IndexFlatL2(dimension) index = faiss.IndexIVFFlat(index, dimension, 31000)
# Train on sample print("Sampling for training...") sample = load_sample(data_source, 1000000) index.train(sample)
# Add incrementally print("Adding vectors...") batch_size = 100000 for i, batch in enumerate(data_source.batches(batch_size)): embeddings = np.array(batch) index.add(embeddings) print(f"Added {i * batch_size} vectors")
return indexSaving and Loading
# Savefaiss.write_index(index, "index.bin")
# Loadindex = faiss.read_index("index.bin")
# Index sizeimport ossize_gb = os.path.getsize("index.bin") / (1024**3)print(f"Index size: {size_gb:.1f}GB")Common IVF Mistakes
Wrong nlist:
Using nlist=1000 for 1B vectorsResults in clusters with 1M vectors eachSlow search, poor recall
Solution: Use nlist = √N = 31K for 1B vectorsNot retraining:
Adding new vectors without retraining centroidsCentroids become outdatedRecall degrades
Solution: Periodic retraining every 10-20% new dataIgnoring cluster imbalance:
K-means might create very unbalanced clustersSome clusters have 100K vectors, others have 1K
Solution: Use balanced clustering (k-means++)IVF in 2024
Trends:
- Hybrid IVF + PQ becoming standard for billion-scale
- GPU-accelerated IVF training
- Learned clustering (replacing k-means with ML)
- Streaming IVF (incremental addition)
- Sharding IVF across distributed systems
Conclusion
IVF indexing enables vector search at scales HNSW can’t reach. By clustering vectors and searching only relevant clusters, IVF achieves reasonable latency and recall even for billions of vectors. While HNSW is easier to tune and better for medium-scale (millions), IVF is essential for truly massive scale.
Understand the nlist/nprobe trade-off, consider PQ quantization for memory efficiency, and tune empirically based on your latency/recall requirements.