Vector Indexing: The Machinery Behind Fast Similarity Search

When your vector store has 1,000 vectors, brute-force search works fine — compare your query against every vector, sort, return top K. At 1,000,000 vectors, brute-force takes seconds. At 1 billion vectors, it’s completely impractical.

Vector indexes solve this by building data structures that let you find approximate nearest neighbors in milliseconds, trading a small amount of accuracy for massive speed gains. Understanding how they work helps you tune them correctly and interpret the accuracy-speed trade-offs you’ll encounter in production.

The Fundamental Problem: Exact vs Approximate

In high-dimensional spaces (384D, 1536D, 3072D), there’s no efficient equivalent of a B-tree for range queries. The “curse of dimensionality” means distance-based data structures that work in 2D or 3D don’t scale to hundreds of dimensions.

The solution: Approximate Nearest Neighbor (ANN) algorithms that sacrifice a small percentage of accuracy to achieve logarithmic or sub-linear query time.

A typical production system might accept 95–99% recall (meaning 95–99 of the true 100 nearest neighbors are returned) in exchange for 100× speedup over exact search.

HNSW: The Workhorse of Modern RAG

Hierarchical Navigable Small World (HNSW) graphs are the most widely deployed index type in production RAG systems. Qdrant, Weaviate, Pinecone, and pgvector all use HNSW by default.

HNSW Structure (simplified):

Layer 2 (sparse):     A ──── E
                      │
Layer 1 (medium):     A ──── B ──── E ──── G
                      │      │
Layer 0 (dense):      A─B─C─D─E─F─G─H─I─J  (all vectors)

Search: start at top layer, greedily navigate towards query vector,
        drop to next layer when local minimum found,
        repeat until Layer 0, then perform beam search.

Key parameters:

M: Number of bidirectional connections per node. Higher M = better recall + slower build + more memory. Typical range: 16–64.
ef_construction: Size of the candidate set during index build. Higher = better quality index but slower build. Typical range: 100–400.
ef_search: Size of the candidate set during query time. Higher = better recall but slower search. Can be tuned without rebuilding index.

# HNSW configuration in Qdrant
from qdrant_client.models import VectorParams, HnswConfigDiff

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance="Cosine",
        hnsw_config=HnswConfigDiff(
            m=16,
            ef_construct=200,
        )
    )
)

# Tune ef at query time without rebuilding
client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=10,
    search_params={"hnsw_ef": 128},  # higher ef → better recall
)

HNSW pros: Excellent recall/speed balance, supports incremental updates without full rebuild, good memory efficiency.

HNSW cons: High memory usage (entire graph must be in RAM for best performance), slow index builds at billion scale.

IVF: Cluster-Based Indexing

Inverted File Index (IVF) partitions the vector space into clusters using k-means. At query time, only the nearest clusters are searched.

IVF Structure:

Training phase:
  All vectors → k-means clustering → n_list centroids
  Each vector assigned to nearest centroid

Index structure:
  Centroid 0: [vec_42, vec_891, vec_2203, ...]
  Centroid 1: [vec_17, vec_445, vec_1002, ...]
  ...
  Centroid n: [vec_99, vec_334, vec_876, ...]

Query:
  1. Find nearest n_probe centroids to query vector
  2. Search only vectors in those clusters
  3. Return top K

n_probe = 10 means search 10/n_list fraction of the database

IVF variants:

IVF-Flat: Exact search within each cluster. Good recall but high memory (stores full vectors).
IVF-PQ: Uses Product Quantization to compress vectors inside clusters. 10–32× memory reduction with modest recall loss.
IVF-HNSW: Uses HNSW to navigate between clusters. Best of both approaches.

Key parameters:

n_list: Number of clusters. Rule of thumb: n_list = sqrt(n_vectors) for balanced datasets.
n_probe: Clusters searched at query time. Higher = better recall, slower query.

# FAISS IVF-PQ index
import faiss

d = 1536      # vector dimension
n_list = 1024  # number of clusters
m = 64        # number of sub-quantizers (for PQ)

quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, n_list, m, 8)  # 8 bits per sub-quantizer

# Must train before adding vectors
index.train(training_vectors)
index.add(all_vectors)

# Query
index.nprobe = 64  # search 64 clusters
distances, indices = index.search(query_vector, k=10)

DiskANN: Billion-Scale on Disk

DiskANN (developed at Microsoft) addresses HNSW’s memory limitation by storing the graph on SSD while keeping a small compressed version in RAM. This enables billion-scale indexes on machines with hundreds of GB of RAM instead of terabytes.

DiskANN Architecture:
RAM: Compressed vector representations (PQ-compressed, ~32GB for 1B vectors)
SSD: Full-precision vectors + graph edges (~300GB for 1B vectors)

Search:
  1. Navigate graph using RAM-resident compressed vectors (fast)
  2. Re-score top candidates using full-precision SSD vectors
  3. Return final top K

Result: ~100ms queries on 1B vector indexes on a single machine

Milvus and Azure AI Search support DiskANN. For most RAG systems (< 100M vectors), HNSW is simpler and faster.

Scalar Quantization and Binary Indexes

For extreme memory efficiency, scalar quantization (SQ) and binary indexes compress vectors dramatically:

SQ8: Each float32 → 1 byte. 4× memory reduction, ~2% recall loss.
Binary: Each float → 1 bit (via sign). 32× compression, significant recall loss (mitigated by larger k).

Binary indexes use Hamming distance and bitwise operations — extremely fast on modern CPUs. Used for first-pass candidate generation in two-stage retrieval.

Index Build Time vs Query Performance

Index Type  | Build Time | RAM Usage  | Recall@10 | QPS (single thread)
------------|------------|------------|-----------|--------------------
Flat        | Seconds    | 6GB/1M     | 100%      | ~500
IVF-Flat    | Minutes    | 6GB/1M     | 97%       | ~50,000
IVF-PQ     | Minutes    | 0.2GB/1M   | 92%       | ~80,000
HNSW        | Hours      | 8GB/1M     | 99%       | ~30,000
DiskANN     | Hours      | 1GB/1M RAM | 98%       | ~5,000

(approximate values, varies by dimension and hardware)

2025 Trend: GPU-Accelerated Indexing

NVIDIA’s RAFT library brings GPU-accelerated ANN algorithms (IVF-Flat, IVF-PQ, CAGRA — a GPU-native graph index) to RAPIDS. For batch ingestion of hundreds of millions of vectors, GPU indexing reduces build times from hours to minutes.

Milvus 2.4+ supports GPU-accelerated indexing. This is becoming a meaningful differentiator for teams with high ingestion throughput requirements.

Choosing Your Index Type

Scenario	Recommended Index
< 1M vectors, dev/test	Flat (exact)
1M–100M vectors, standard RAG	HNSW
100M–1B vectors, memory-constrained	IVF-PQ or DiskANN
1B+ vectors	DiskANN + GPU build
Need ultra-fast query, some recall loss OK	IVF-PQ + GPU

For the vast majority of RAG systems (< 10M chunks), HNSW with default parameters works well. Only optimize when you measure a performance bottleneck, not before.