IVF Index: Inverted File Indexing for Large-Scale Vector Search

Master IVF (Inverted File) indexing for massive-scale RAG. Learn clustering, quantization, and how to scale to billions of vectors.

IVF Index: Scaling Vector Search to Billions

IVF (Inverted File) indexing is designed for massive-scale vector search, enabling billion-scale indexes where HNSW would be prohibitive. It achieves this through clustering and disk-efficient design.

Core Concept: Clustering for Scale

Rather than building graph connections between all vectors (HNSW’s approach), IVF partitions vectors into clusters.

The idea:

Traditional search (linear):
For each query, check all 1B vectors
Time: ~seconds per query
IVF search:
1. Cluster vectors: Group 1B vectors into 64K clusters
2. Find nearest clusters: Check 64 cluster centroids
3. Search within clusters: Check ~1K vectors in top clusters
4. Return top K
Time: ~milliseconds per query
Speedup: 1000×

How IVF Works

Step 1: Training Phase (Off-line)

Input: 1B vectors
1. Select sample vectors (e.g., 100K random vectors)
2. Run k-means clustering: Find K cluster centroids
K typically = √N (for N vectors)
For 1B vectors: K = 31,623 clusters
3. Store cluster centroids
4. Assign each vector to nearest centroid
5. Create inverted index: centroid_id → [vector_ids in cluster]

Example:

Vectors: [v1, v2, v3, ..., v1B]
K-means: Find 31K cluster centers
Index:
C1 → [v5, v234, v1000, ..., v999] (1K vectors)
C2 → [v2, v100, v456, ..., v888] (1K vectors)
...
C31K → [v10, v2000, v50000, ..., v1B] (1K vectors)

Step 2: Search Phase (On-line)

Query q arrives:
1. Compute similarity(q, all centroids)
31K operations (fast)
2. Find nprobe nearest centroids
Default nprobe = 10
Time: O(K) where K is small
3. Look up vectors in those clusters
10 clusters × 1K vectors = 10K candidates
4. Compute similarity(q, candidates)
10K operations (fast compared to 1B)
5. Return top 10

IVF Parameters

nlist (Number of Clusters)

nlist = √N is the typical formula
For different scales:
1M vectors: nlist = 1K
10M vectors: nlist = 3K
100M vectors: nlist = 10K
1B vectors: nlist = 31K

Trade-off:

Higher nlist:
✓ Each cluster smaller (fewer false positives)
✓ Easier to find exact nearest neighbors
✗ Centroid search slower (more centroids)
Lower nlist:
✓ Fewer centroids to check
✗ Larger clusters (more false positives)
✗ Need to search more candidates
nprobe = number of nearest clusters to search
nprobe = 1: Fastest, lowest recall
nprobe = 10: Balanced (typical)
nprobe = 100: Slowest, highest recall

Trade-off at query time:

Query with nprobe = 1:
Latency: 1ms
Recall: 0.65
Query with nprobe = 10:
Latency: 10ms
Recall: 0.90
Query with nprobe = 100:
Latency: 50ms
Recall: 0.98
Can change nprobe per query! (Unlike HNSW where ef is fixed)

IVF + Product Quantization (IVF-PQ)

Combine IVF with quantization for extreme memory efficiency:

Standard IVF storage:
100M vectors × 768 dimensions × 4 bytes = 300GB
IVF-PQ storage:
Split 768 dimensions into 32 segments (24 dims each)
Each segment quantized to 8 bits
100M vectors × 32 × 1 byte = 3.2GB
Savings: 100×
Speed benefit:
Similarity computation on quantized vectors is faster
Trade-off: Slight recall loss (usually recoverable with nprobe)

Implementation:

import faiss
# Create IVF-PQ index
dimension = 768
nlist = 10000 # Number of clusters
m = 32 # Number of segments for PQ
nbits = 8 # Bits per segment (256 possible values)
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
# Train
vectors = np.random.randn(100000000, 768).astype('float32')
index.train(vectors[:1000000]) # Train on subset
# Index
index.add(vectors)
# Search
query = np.random.randn(1, 768).astype('float32')
index.nprobe = 10 # Search 10 clusters
distances, indices = index.search(query, k=10)

IVF in FAISS

FAISS is optimized for IVF indexing at massive scale:

import faiss
import numpy as np
# Create index
dimension = 768
nlist = 10000 # Number of clusters
index = faiss.IndexFlatL2(dimension) # Quantizer
index = faiss.IndexIVFFlat(index, dimension, nlist)
# Or shorthand:
index = faiss.index_factory(dimension, "IVF10000,Flat")
# Train
vectors = np.random.randn(1000000, 768).astype('float32')
index.train(vectors)
# Add all data
index.add(vectors)
# Search with varying recall levels
for nprobe in [1, 5, 10, 50, 100]:
index.nprobe = nprobe
distances, indices = index.search(query, k=10)
recall = calculate_recall(indices, ground_truth)
print(f"nprobe={nprobe}: latency={time_ms}, recall={recall:.3f}")

Benchmarks: IVF Performance

Recall vs. Latency Trade-off

1B vector index (IVF):
nprobe Latency Recall Throughput
1 0.5ms 0.60 2000 qps
5 2ms 0.78 500 qps
10 4ms 0.88 250 qps
20 8ms 0.94 125 qps
50 20ms 0.97 50 qps
100 40ms 0.99 25 qps
Single machine can process 25-2000 queries/second
depending on desired recall

Memory Efficiency

Vectors: 768 dimensions
Count: 1 billion
Raw vectors: 3TB
IVF index: ~3.2TB (includes centroids, IDs)
IVF-PQ index: ~30GB (100× reduction!)
IVF-OPQ (optimized): ~20GB
Storage cost at cloud:
3TB × $0.023/GB/month = $69/month
30GB × $0.023/GB/month = $0.69/month

IVF vs. HNSW: When to Use Each

Use HNSW when:

✓ 1-100M vectors
✓ <10ms latency requirement
✓ 95%+ recall needed
✓ Real-time recall tuning not needed
✓ Can afford graph memory overhead
Example: Product search for e-commerce (10M products)

Use IVF when:

✓ 100M-1B+ vectors
✓ 10-100ms latency acceptable
✓ 90%+ recall sufficient
✓ Memory critical
✓ Can adjust nprobe per query
Example: Web-scale search (billions of documents)

Tuning IVF Parameters

Finding Optimal nlist

def find_optimal_nlist(data, ground_truth_k=100):
"""Find nlist that balances accuracy and speed"""
results = []
for nlist in [1000, 3000, 10000, 31000, 100000]:
index = faiss.IndexIVFFlat(
faiss.IndexFlatL2(data.shape[1]),
data.shape[1],
nlist
)
index.train(data)
index.add(data)
# Test with different nprobe values
for nprobe in [1, 5, 10, 50]:
index.nprobe = nprobe
distances, indices = index.search(queries, k=10)
recall = calculate_recall(indices, ground_truth)
latency = measure_latency(index, queries)
results.append({
'nlist': nlist,
'nprobe': nprobe,
'recall': recall,
'latency': latency
})
# Find best trade-off
return [r for r in results if r['recall'] > 0.90
and r['latency'] < 10][0]

Memory-Latency Trade-off

Start with: nlist = √N
nprobe = 10
If too slow: Reduce nprobe (faster but lower recall)
If too much memory: Increase nlist (distribute vectors)
If both OK: Keep current settings
Example tuning:
1B vectors → nlist=31K, nprobe=10
Memory: 3.2TB, Latency: 4ms, Recall: 0.88
Still too much memory?
Try IVF-PQ: Memory drops to 32GB, latency stays 4ms

Practical Deployment: IVF Index

Building Large Index

# Don't load all vectors at once
def build_massive_index(data_source, dimension=768):
index = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(index, dimension, 31000)
# Train on sample
print("Sampling for training...")
sample = load_sample(data_source, 1000000)
index.train(sample)
# Add incrementally
print("Adding vectors...")
batch_size = 100000
for i, batch in enumerate(data_source.batches(batch_size)):
embeddings = np.array(batch)
index.add(embeddings)
print(f"Added {i * batch_size} vectors")
return index

Saving and Loading

# Save
faiss.write_index(index, "index.bin")
# Load
index = faiss.read_index("index.bin")
# Index size
import os
size_gb = os.path.getsize("index.bin") / (1024**3)
print(f"Index size: {size_gb:.1f}GB")

Common IVF Mistakes

Wrong nlist:

Using nlist=1000 for 1B vectors
Results in clusters with 1M vectors each
Slow search, poor recall
Solution: Use nlist = √N = 31K for 1B vectors

Not retraining:

Adding new vectors without retraining centroids
Centroids become outdated
Recall degrades
Solution: Periodic retraining every 10-20% new data

Ignoring cluster imbalance:

K-means might create very unbalanced clusters
Some clusters have 100K vectors, others have 1K
Solution: Use balanced clustering (k-means++)

IVF in 2024

Trends:

  • Hybrid IVF + PQ becoming standard for billion-scale
  • GPU-accelerated IVF training
  • Learned clustering (replacing k-means with ML)
  • Streaming IVF (incremental addition)
  • Sharding IVF across distributed systems

Conclusion

IVF indexing enables vector search at scales HNSW can’t reach. By clustering vectors and searching only relevant clusters, IVF achieves reasonable latency and recall even for billions of vectors. While HNSW is easier to tune and better for medium-scale (millions), IVF is essential for truly massive scale.

Understand the nlist/nprobe trade-off, consider PQ quantization for memory efficiency, and tune empirically based on your latency/recall requirements.