Vector Indexing: The Machinery Behind Fast Similarity Search
When your vector store has 1,000 vectors, brute-force search works fine — compare your query against every vector, sort, return top K. At 1,000,000 vectors, brute-force takes seconds. At 1 billion vectors, it’s completely impractical.
Vector indexes solve this by building data structures that let you find approximate nearest neighbors in milliseconds, trading a small amount of accuracy for massive speed gains. Understanding how they work helps you tune them correctly and interpret the accuracy-speed trade-offs you’ll encounter in production.
The Fundamental Problem: Exact vs Approximate
In high-dimensional spaces (384D, 1536D, 3072D), there’s no efficient equivalent of a B-tree for range queries. The “curse of dimensionality” means distance-based data structures that work in 2D or 3D don’t scale to hundreds of dimensions.
The solution: Approximate Nearest Neighbor (ANN) algorithms that sacrifice a small percentage of accuracy to achieve logarithmic or sub-linear query time.
A typical production system might accept 95–99% recall (meaning 95–99 of the true 100 nearest neighbors are returned) in exchange for 100× speedup over exact search.
HNSW: The Workhorse of Modern RAG
Hierarchical Navigable Small World (HNSW) graphs are the most widely deployed index type in production RAG systems. Qdrant, Weaviate, Pinecone, and pgvector all use HNSW by default.
HNSW Structure (simplified):
Layer 2 (sparse): A ──── E │Layer 1 (medium): A ──── B ──── E ──── G │ │Layer 0 (dense): A─B─C─D─E─F─G─H─I─J (all vectors)
Search: start at top layer, greedily navigate towards query vector, drop to next layer when local minimum found, repeat until Layer 0, then perform beam search.Key parameters:
M: Number of bidirectional connections per node. Higher M = better recall + slower build + more memory. Typical range: 16–64.ef_construction: Size of the candidate set during index build. Higher = better quality index but slower build. Typical range: 100–400.ef_search: Size of the candidate set during query time. Higher = better recall but slower search. Can be tuned without rebuilding index.
# HNSW configuration in Qdrantfrom qdrant_client.models import VectorParams, HnswConfigDiff
client.create_collection( collection_name="documents", vectors_config=VectorParams( size=1536, distance="Cosine", hnsw_config=HnswConfigDiff( m=16, ef_construct=200, ) ))
# Tune ef at query time without rebuildingclient.search( collection_name="documents", query_vector=query_embedding, limit=10, search_params={"hnsw_ef": 128}, # higher ef → better recall)HNSW pros: Excellent recall/speed balance, supports incremental updates without full rebuild, good memory efficiency.
HNSW cons: High memory usage (entire graph must be in RAM for best performance), slow index builds at billion scale.
IVF: Cluster-Based Indexing
Inverted File Index (IVF) partitions the vector space into clusters using k-means. At query time, only the nearest clusters are searched.
IVF Structure:
Training phase: All vectors → k-means clustering → n_list centroids Each vector assigned to nearest centroid
Index structure: Centroid 0: [vec_42, vec_891, vec_2203, ...] Centroid 1: [vec_17, vec_445, vec_1002, ...] ... Centroid n: [vec_99, vec_334, vec_876, ...]
Query: 1. Find nearest n_probe centroids to query vector 2. Search only vectors in those clusters 3. Return top K
n_probe = 10 means search 10/n_list fraction of the databaseIVF variants:
- IVF-Flat: Exact search within each cluster. Good recall but high memory (stores full vectors).
- IVF-PQ: Uses Product Quantization to compress vectors inside clusters. 10–32× memory reduction with modest recall loss.
- IVF-HNSW: Uses HNSW to navigate between clusters. Best of both approaches.
Key parameters:
n_list: Number of clusters. Rule of thumb:n_list = sqrt(n_vectors)for balanced datasets.n_probe: Clusters searched at query time. Higher = better recall, slower query.
# FAISS IVF-PQ indeximport faiss
d = 1536 # vector dimensionn_list = 1024 # number of clustersm = 64 # number of sub-quantizers (for PQ)
quantizer = faiss.IndexFlatL2(d)index = faiss.IndexIVFPQ(quantizer, d, n_list, m, 8) # 8 bits per sub-quantizer
# Must train before adding vectorsindex.train(training_vectors)index.add(all_vectors)
# Queryindex.nprobe = 64 # search 64 clustersdistances, indices = index.search(query_vector, k=10)DiskANN: Billion-Scale on Disk
DiskANN (developed at Microsoft) addresses HNSW’s memory limitation by storing the graph on SSD while keeping a small compressed version in RAM. This enables billion-scale indexes on machines with hundreds of GB of RAM instead of terabytes.
DiskANN Architecture:RAM: Compressed vector representations (PQ-compressed, ~32GB for 1B vectors)SSD: Full-precision vectors + graph edges (~300GB for 1B vectors)
Search: 1. Navigate graph using RAM-resident compressed vectors (fast) 2. Re-score top candidates using full-precision SSD vectors 3. Return final top K
Result: ~100ms queries on 1B vector indexes on a single machineMilvus and Azure AI Search support DiskANN. For most RAG systems (< 100M vectors), HNSW is simpler and faster.
Scalar Quantization and Binary Indexes
For extreme memory efficiency, scalar quantization (SQ) and binary indexes compress vectors dramatically:
- SQ8: Each float32 → 1 byte. 4× memory reduction, ~2% recall loss.
- Binary: Each float → 1 bit (via sign). 32× compression, significant recall loss (mitigated by larger k).
Binary indexes use Hamming distance and bitwise operations — extremely fast on modern CPUs. Used for first-pass candidate generation in two-stage retrieval.
Index Build Time vs Query Performance
Index Type | Build Time | RAM Usage | Recall@10 | QPS (single thread)------------|------------|------------|-----------|--------------------Flat | Seconds | 6GB/1M | 100% | ~500IVF-Flat | Minutes | 6GB/1M | 97% | ~50,000IVF-PQ | Minutes | 0.2GB/1M | 92% | ~80,000HNSW | Hours | 8GB/1M | 99% | ~30,000DiskANN | Hours | 1GB/1M RAM | 98% | ~5,000
(approximate values, varies by dimension and hardware)2025 Trend: GPU-Accelerated Indexing
NVIDIA’s RAFT library brings GPU-accelerated ANN algorithms (IVF-Flat, IVF-PQ, CAGRA — a GPU-native graph index) to RAPIDS. For batch ingestion of hundreds of millions of vectors, GPU indexing reduces build times from hours to minutes.
Milvus 2.4+ supports GPU-accelerated indexing. This is becoming a meaningful differentiator for teams with high ingestion throughput requirements.
Choosing Your Index Type
| Scenario | Recommended Index |
|---|---|
| < 1M vectors, dev/test | Flat (exact) |
| 1M–100M vectors, standard RAG | HNSW |
| 100M–1B vectors, memory-constrained | IVF-PQ or DiskANN |
| 1B+ vectors | DiskANN + GPU build |
| Need ultra-fast query, some recall loss OK | IVF-PQ + GPU |
For the vast majority of RAG systems (< 10M chunks), HNSW with default parameters works well. Only optimize when you measure a performance bottleneck, not before.