IVF Index: Scaling Vector Search to Billions

IVF (Inverted File) indexing is designed for massive-scale vector search, enabling billion-scale indexes where HNSW would be prohibitive. It achieves this through clustering and disk-efficient design.

Core Concept: Clustering for Scale

Rather than building graph connections between all vectors (HNSW’s approach), IVF partitions vectors into clusters.

The idea:

Traditional search (linear):
For each query, check all 1B vectors
Time: ~seconds per query

IVF search:
1. Cluster vectors: Group 1B vectors into 64K clusters
2. Find nearest clusters: Check 64 cluster centroids
3. Search within clusters: Check ~1K vectors in top clusters
4. Return top K

Time: ~milliseconds per query
Speedup: 1000×

How IVF Works

Step 1: Training Phase (Off-line)

Input: 1B vectors
1. Select sample vectors (e.g., 100K random vectors)
2. Run k-means clustering: Find K cluster centroids
   K typically = √N (for N vectors)
   For 1B vectors: K = 31,623 clusters

3. Store cluster centroids
4. Assign each vector to nearest centroid
5. Create inverted index: centroid_id → [vector_ids in cluster]

Example:

Vectors:    [v1, v2, v3, ..., v1B]
K-means:    Find 31K cluster centers
Index:
  C1 → [v5, v234, v1000, ..., v999]     (1K vectors)
  C2 → [v2, v100, v456, ..., v888]      (1K vectors)
  ...
  C31K → [v10, v2000, v50000, ..., v1B] (1K vectors)

Step 2: Search Phase (On-line)

Query q arrives:

1. Compute similarity(q, all centroids)
   31K operations (fast)

2. Find nprobe nearest centroids
   Default nprobe = 10
   Time: O(K) where K is small

3. Look up vectors in those clusters
   10 clusters × 1K vectors = 10K candidates

4. Compute similarity(q, candidates)
   10K operations (fast compared to 1B)

5. Return top 10

IVF Parameters

nlist (Number of Clusters)

nlist = √N is the typical formula
For different scales:

1M vectors:   nlist = 1K
10M vectors:  nlist = 3K
100M vectors: nlist = 10K
1B vectors:   nlist = 31K

Trade-off:

Higher nlist:
  ✓ Each cluster smaller (fewer false positives)
  ✓ Easier to find exact nearest neighbors
  ✗ Centroid search slower (more centroids)

Lower nlist:
  ✓ Fewer centroids to check
  ✗ Larger clusters (more false positives)
  ✗ Need to search more candidates

nprobe (Clusters to Search)

nprobe = number of nearest clusters to search

nprobe = 1:   Fastest, lowest recall
nprobe = 10:  Balanced (typical)
nprobe = 100: Slowest, highest recall

Trade-off at query time:

Query with nprobe = 1:
  Latency: 1ms
  Recall: 0.65

Query with nprobe = 10:
  Latency: 10ms
  Recall: 0.90

Query with nprobe = 100:
  Latency: 50ms
  Recall: 0.98

Can change nprobe per query! (Unlike HNSW where ef is fixed)

IVF + Product Quantization (IVF-PQ)

Combine IVF with quantization for extreme memory efficiency:

Standard IVF storage:
100M vectors × 768 dimensions × 4 bytes = 300GB

IVF-PQ storage:
Split 768 dimensions into 32 segments (24 dims each)
Each segment quantized to 8 bits
100M vectors × 32 × 1 byte = 3.2GB
Savings: 100×

Speed benefit:
Similarity computation on quantized vectors is faster
Trade-off: Slight recall loss (usually recoverable with nprobe)

Implementation:

import faiss

# Create IVF-PQ index
dimension = 768
nlist = 10000  # Number of clusters
m = 32  # Number of segments for PQ
nbits = 8  # Bits per segment (256 possible values)

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

# Train
vectors = np.random.randn(100000000, 768).astype('float32')
index.train(vectors[:1000000])  # Train on subset

# Index
index.add(vectors)

# Search
query = np.random.randn(1, 768).astype('float32')
index.nprobe = 10  # Search 10 clusters
distances, indices = index.search(query, k=10)

IVF in FAISS

FAISS is optimized for IVF indexing at massive scale:

import faiss
import numpy as np

# Create index
dimension = 768
nlist = 10000  # Number of clusters
index = faiss.IndexFlatL2(dimension)  # Quantizer
index = faiss.IndexIVFFlat(index, dimension, nlist)

# Or shorthand:
index = faiss.index_factory(dimension, "IVF10000,Flat")

# Train
vectors = np.random.randn(1000000, 768).astype('float32')
index.train(vectors)

# Add all data
index.add(vectors)

# Search with varying recall levels
for nprobe in [1, 5, 10, 50, 100]:
    index.nprobe = nprobe
    distances, indices = index.search(query, k=10)
    recall = calculate_recall(indices, ground_truth)
    print(f"nprobe={nprobe}: latency={time_ms}, recall={recall:.3f}")

Benchmarks: IVF Performance

Recall vs. Latency Trade-off

1B vector index (IVF):

nprobe  Latency  Recall  Throughput
1       0.5ms    0.60    2000 qps
5       2ms      0.78    500 qps
10      4ms      0.88    250 qps
20      8ms      0.94    125 qps
50      20ms     0.97    50 qps
100     40ms     0.99    25 qps

Single machine can process 25-2000 queries/second
depending on desired recall

Memory Efficiency

Vectors:     768 dimensions
Count:       1 billion

Raw vectors:           3TB
IVF index:             ~3.2TB (includes centroids, IDs)
IVF-PQ index:          ~30GB (100× reduction!)
IVF-OPQ (optimized):   ~20GB

Storage cost at cloud:
3TB × $0.023/GB/month = $69/month
30GB × $0.023/GB/month = $0.69/month

IVF vs. HNSW: When to Use Each

Use HNSW when:

✓ 1-100M vectors
✓ <10ms latency requirement
✓ 95%+ recall needed
✓ Real-time recall tuning not needed
✓ Can afford graph memory overhead

Example: Product search for e-commerce (10M products)

Use IVF when:

✓ 100M-1B+ vectors
✓ 10-100ms latency acceptable
✓ 90%+ recall sufficient
✓ Memory critical
✓ Can adjust nprobe per query

Example: Web-scale search (billions of documents)

Tuning IVF Parameters

Finding Optimal nlist

def find_optimal_nlist(data, ground_truth_k=100):
    """Find nlist that balances accuracy and speed"""
    results = []

    for nlist in [1000, 3000, 10000, 31000, 100000]:
        index = faiss.IndexIVFFlat(
            faiss.IndexFlatL2(data.shape[1]),
            data.shape[1],
            nlist
        )
        index.train(data)
        index.add(data)

        # Test with different nprobe values
        for nprobe in [1, 5, 10, 50]:
            index.nprobe = nprobe
            distances, indices = index.search(queries, k=10)
            recall = calculate_recall(indices, ground_truth)
            latency = measure_latency(index, queries)

            results.append({
                'nlist': nlist,
                'nprobe': nprobe,
                'recall': recall,
                'latency': latency
            })

    # Find best trade-off
    return [r for r in results if r['recall'] > 0.90
            and r['latency'] < 10][0]

Memory-Latency Trade-off

Start with: nlist = √N
            nprobe = 10

If too slow:   Reduce nprobe (faster but lower recall)
If too much memory: Increase nlist (distribute vectors)
If both OK:    Keep current settings

Example tuning:
1B vectors → nlist=31K, nprobe=10
Memory: 3.2TB, Latency: 4ms, Recall: 0.88

Still too much memory?
Try IVF-PQ: Memory drops to 32GB, latency stays 4ms

Practical Deployment: IVF Index

Building Large Index

# Don't load all vectors at once
def build_massive_index(data_source, dimension=768):
    index = faiss.IndexFlatL2(dimension)
    index = faiss.IndexIVFFlat(index, dimension, 31000)

    # Train on sample
    print("Sampling for training...")
    sample = load_sample(data_source, 1000000)
    index.train(sample)

    # Add incrementally
    print("Adding vectors...")
    batch_size = 100000
    for i, batch in enumerate(data_source.batches(batch_size)):
        embeddings = np.array(batch)
        index.add(embeddings)
        print(f"Added {i * batch_size} vectors")

    return index

Saving and Loading

# Save
faiss.write_index(index, "index.bin")

# Load
index = faiss.read_index("index.bin")

# Index size
import os
size_gb = os.path.getsize("index.bin") / (1024**3)
print(f"Index size: {size_gb:.1f}GB")

Common IVF Mistakes

Wrong nlist:

Using nlist=1000 for 1B vectors
Results in clusters with 1M vectors each
Slow search, poor recall

Solution: Use nlist = √N = 31K for 1B vectors

Not retraining:

Adding new vectors without retraining centroids
Centroids become outdated
Recall degrades

Solution: Periodic retraining every 10-20% new data

Ignoring cluster imbalance:

K-means might create very unbalanced clusters
Some clusters have 100K vectors, others have 1K

Solution: Use balanced clustering (k-means++)

IVF in 2024

Trends:

Hybrid IVF + PQ becoming standard for billion-scale
GPU-accelerated IVF training
Learned clustering (replacing k-means with ML)
Streaming IVF (incremental addition)
Sharding IVF across distributed systems

Conclusion

IVF indexing enables vector search at scales HNSW can’t reach. By clustering vectors and searching only relevant clusters, IVF achieves reasonable latency and recall even for billions of vectors. While HNSW is easier to tune and better for medium-scale (millions), IVF is essential for truly massive scale.

Understand the nlist/nprobe trade-off, consider PQ quantization for memory efficiency, and tune empirically based on your latency/recall requirements.