Approximate Nearest Neighbor Search: The Key to Scaling Vector Retrieval

Without indexing, finding the K nearest neighbors among millions of vectors requires checking every vector—computationally prohibitive. Approximate Nearest Neighbor (ANN) algorithms solve this by finding good-enough results fast.

The Exact vs. Approximate Trade-off

Exact Nearest Neighbor (Linear Search)

Algorithm: Check every vector, compute similarity, return top K

For 1M vectors:
- Queries needed: 1M similarity computations
- Time: ~1-2 seconds per query
- Space: All vectors in memory
- Accuracy: 100% (truly finds nearest)

Use when:

Small corpus (< 100K vectors)
Accuracy critical
Speed not critical
Interactive response not needed

Approximate Nearest Neighbor

Algorithm: Use indexing structure to skip most vectors

For 1M vectors:
- Queries needed: ~1K-10K similarity computations
- Time: ~10-100ms per query
- Space: Index structure overhead
- Accuracy: 90-99% (finds approximately nearest)

Use when:

Large corpus (1M+ vectors)
Speed critical (< 100ms latency)
Acceptable recall loss (1-10%)
Interactive systems

Why ANN Works

ANN exploits the structure of high-dimensional vector space:

Property 1: Clustering
Vectors naturally cluster in semantic space.
Similar vectors are near each other.
→ Don't need to check distant regions

Property 2: Locality
Nearest neighbors are likely to be near each other.
If vector A is near B and B is near C,
A and C are probably near each other.
→ Can propagate through the structure

Property 3: Dimensionality benefits
High-dimensional spaces have properties that make
approximate search surprisingly effective.
→ Can afford small approximation error

These properties enable indexing structures that skip search space.

Popular ANN Algorithms

1. HNSW (Hierarchical Navigable Small World)

How it works:

Build a multi-level graph structure
Layer 0 (bottom): All vectors
Layer 1: ~50% of vectors with long-range connections
Layer 2: ~25% of vectors, even sparser
... (layers decrease until single entry point)

Search:
1. Start at top layer entry point
2. Navigate toward query
3. Move to next layer
4. Repeat until bottom layer
5. Fine-grained search in layer 0

Characteristics:

Recall: 99%+
Query time: Sub-millisecond (1M vectors)
Build time: Minutes
Memory overhead: ~10% above raw vectors

When to use:

✓ Production systems needing high recall
✓ Real-time search requirements
✓ Medium-large scale (1M-100M vectors)

Popular implementations:

- Hnswlib (C++, Python wrapper)
- Milvus
- Weaviate
- Pinecone (uses HNSW internally)

2. IVF (Inverted File Index)

How it works:

1. Partition vectors into clusters (k-means)
2. For each cluster, maintain list of vector IDs
3. Build inverted index: cluster → [vectors in cluster]

Search:
1. Encode query
2. Find nearest clusters (via centroids)
3. Search vectors in those clusters
4. Return top K

Characteristics:

Recall: 90-95% (depending on how many clusters checked)
Query time: Milliseconds
Build time: Seconds
Memory overhead: ~5% above raw vectors
Scalability: Excellent for very large corpus

Example:

1M vectors, 768 dimensions
K clusters: 16384
Vectors per cluster: ~60

Query:
1. Find nearest 4 clusters: 16KB lookups
2. Search ~240 vectors: 240K operations
vs. 1M vectors: 1M operations

Speed: 4-5× faster than linear search

When to use:

✓ Very large scale (10M-1B vectors)
✓ Acceptable recall trade-off
✓ Limited memory

Popular implementations:

- FAISS (Facebook AI Similarity Search)
- Milvus
- Elasticsearch
- Vespa

3. LSH (Locality Sensitive Hashing)

How it works:

1. Hash vectors using special functions
2. Hash function property: Similar vectors hash to same/nearby buckets
3. Store vectors by hash value

Search:
1. Hash query
2. Look in that bucket and nearby buckets
3. Compare only candidates in buckets

Characteristics:

Recall: Variable (tunable)
Query time: Fast
Build time: Very fast
Memory: Tunable
Simplicity: Easiest to understand

When to use:

✓ Extremely fast requirements
✓ Streaming data (easy to add new vectors)
✓ Limited memory

4. Product Quantization

How it works:

1. Split vectors into segments
2. Quantize each segment (reduce to fewer bits)
3. Build index on quantized vectors
4. Trade accuracy for speed and memory

Example:
Original: 768 dimensions × 32-bit float = 3KB per vector
Quantized: 768 dimensions × 2 bits (4 clusters per segment) = 12 bytes

100M vectors:
Original: 300GB
Quantized: 1.2GB
Speed: 10-100× faster

Comparison: Which ANN Algorithm?

Algorithm	Recall	Speed	Memory	Scalability	Ease
HNSW	99%+	<1ms	10%	1M-100M	Medium
IVF	90-95%	1-10ms	5%	10M-1B	Medium
LSH	Variable	<1ms	Low	Streaming	Easy
PQ	85-90%	<1ms	Tiny	100M-1B	Hard
Linear	100%	1-2s	0%	< 100K	Trivial

Practical recommendations:

Starting a RAG system: Use HNSW
(Good balance of recall and simplicity)

Extremely large scale: Try IVF
(Better memory efficiency, still good recall)

Streaming vectors: Consider LSH
(Easy to add new vectors incrementally)

Ultra-low latency + massive scale: Combine PQ + IVF
(Requires more tuning)

ANN in Popular Vector Databases

FAISS (Facebook AI Similarity Search)

import faiss
import numpy as np

# Create IVF index
dimension = 768
nlist = 1024  # number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Add vectors
vectors = np.random.randn(1000000, 768).astype('float32')
index.train(vectors[:100000])  # Train on subset
index.add(vectors)

# Search
query = np.random.randn(1, 768).astype('float32')
distances, indices = index.search(query, k=10)
print(f"Top 10 results: {indices[0]}")
print(f"Distances: {distances[0]}")

Pinecone (Cloud Service)

import pinecone

pinecone.init(api_key="xxx")
index = pinecone.Index("my-index")

# Upsert vectors (automatically indexed with HNSW)
index.upsert(vectors=[
    ("id1", embedding1),
    ("id2", embedding2),
])

# Query
results = index.query(vector=query_embedding, top_k=10)
print(results)  # Returns matched IDs and scores

Weaviate

from weaviate import Client

client = Client("http://localhost:8080")

# Create class with HNSW indexing
client.schema.create_class({
    "class": "Document",
    "vectorizer": "text2vec-transformers",
    "vectorIndexConfig": {
        "name": "hnsw",
        "ef": 64,
        "efConstruction": 200,
    }
})

# Query
result = client.query.get("Document", ["content"]).with_near_vector({
    "vector": query_embedding
}).with_limit(10).do()

Tuning ANN Parameters

HNSW Hyperparameters

- ef: Search parameter (higher = more thorough, slower)
  Default: 40
  Trade-off: Recall vs. speed

- efConstruction: Construction parameter
  Default: 200
  Trade-off: Build time vs. quality

- M: Number of bidirectional links
  Default: 5-48 (usually 16)
  Trade-off: Memory vs. recall

Tuning strategy:

def find_best_parameters(ground_truth, recall_target=0.95):
    best_config = None

    for ef in [20, 40, 60, 100]:
        for efConstruction in [100, 200, 400]:
            for M in [4, 8, 16, 32]:
                index = hnswlib.Index(...)
                index.init_index(max_elements=1000000,
                                ef_construction=efConstruction, M=M)
                recall = evaluate_recall(index, ground_truth)

                if recall >= recall_target:
                    if (best_config is None or
                        ef < best_config['ef']):  # Prefer speed
                        best_config = {'ef': ef, 'efConstruction': efConstruction, 'M': M}

    return best_config

IVF Hyperparameters

- nlist: Number of clusters
  Default: √N (for N vectors)
  Trade-off: Search space vs. cluster quality

- nprobe: Number of clusters to search
  Default: 1-10 (usually 10)
  Trade-off: Recall vs. speed

ANN Quality Metrics

Recall

What percentage of true top-K results were found?

Recall@1 = Is the true nearest neighbor found?
Recall@10 = Are the true top-10 found?

Query Latency

Time to retrieve K vectors
Typical targets: < 50ms for interactive applications

Throughput

Queries per second the system can handle
Depends on: Recall setting, latency SLA, hardware

Common ANN Mistakes

Using linear search at scale:

Trying to search 10M vectors with brute force
Should use HNSW or IVF immediately

Wrong recall target:

Optimizing for 100% recall (defeats the purpose)
Most applications accept 95%+ recall gladly

Not retraining index:

Vectors added but index outdated
Retrieval quality degrades over time
Solution: Reindex periodically or incrementally

Forgetting to normalize:

ANN algorithms assume normalized vectors
Denormalized vectors give wrong results

ANN in 2024

Trends:

Learned indices (ML-based optimization)
GPU-accelerated ANN
Streaming ANN (update without reindex)
Hybrid ANN (combining multiple algorithms)
Multi-stage ranking (ANN → reranker)

ANN algorithms are the invisible foundation of modern vector search. Understanding them helps you tune systems for latency and recall trade-offs.