Approximate Nearest Neighbor (ANN): Fast Vector Indexing for RAG

Master ANN algorithms for RAG vector search. Learn how approximate methods achieve sub-millisecond retrieval from millions of vectors.

Approximate Nearest Neighbor Search: The Key to Scaling Vector Retrieval

Without indexing, finding the K nearest neighbors among millions of vectors requires checking every vector—computationally prohibitive. Approximate Nearest Neighbor (ANN) algorithms solve this by finding good-enough results fast.

The Exact vs. Approximate Trade-off

Algorithm: Check every vector, compute similarity, return top K

For 1M vectors:
- Queries needed: 1M similarity computations
- Time: ~1-2 seconds per query
- Space: All vectors in memory
- Accuracy: 100% (truly finds nearest)

Use when:

  • Small corpus (< 100K vectors)
  • Accuracy critical
  • Speed not critical
  • Interactive response not needed

Approximate Nearest Neighbor

Algorithm: Use indexing structure to skip most vectors

For 1M vectors:
- Queries needed: ~1K-10K similarity computations
- Time: ~10-100ms per query
- Space: Index structure overhead
- Accuracy: 90-99% (finds approximately nearest)

Use when:

  • Large corpus (1M+ vectors)
  • Speed critical (< 100ms latency)
  • Acceptable recall loss (1-10%)
  • Interactive systems

Why ANN Works

ANN exploits the structure of high-dimensional vector space:

Property 1: Clustering
Vectors naturally cluster in semantic space.
Similar vectors are near each other.
→ Don't need to check distant regions
Property 2: Locality
Nearest neighbors are likely to be near each other.
If vector A is near B and B is near C,
A and C are probably near each other.
→ Can propagate through the structure
Property 3: Dimensionality benefits
High-dimensional spaces have properties that make
approximate search surprisingly effective.
→ Can afford small approximation error

These properties enable indexing structures that skip search space.

1. HNSW (Hierarchical Navigable Small World)

How it works:

Build a multi-level graph structure
Layer 0 (bottom): All vectors
Layer 1: ~50% of vectors with long-range connections
Layer 2: ~25% of vectors, even sparser
... (layers decrease until single entry point)
Search:
1. Start at top layer entry point
2. Navigate toward query
3. Move to next layer
4. Repeat until bottom layer
5. Fine-grained search in layer 0

Characteristics:

Recall: 99%+
Query time: Sub-millisecond (1M vectors)
Build time: Minutes
Memory overhead: ~10% above raw vectors

When to use:

✓ Production systems needing high recall
✓ Real-time search requirements
✓ Medium-large scale (1M-100M vectors)

Popular implementations:

- Hnswlib (C++, Python wrapper)
- Milvus
- Weaviate
- Pinecone (uses HNSW internally)

2. IVF (Inverted File Index)

How it works:

1. Partition vectors into clusters (k-means)
2. For each cluster, maintain list of vector IDs
3. Build inverted index: cluster → [vectors in cluster]
Search:
1. Encode query
2. Find nearest clusters (via centroids)
3. Search vectors in those clusters
4. Return top K

Characteristics:

Recall: 90-95% (depending on how many clusters checked)
Query time: Milliseconds
Build time: Seconds
Memory overhead: ~5% above raw vectors
Scalability: Excellent for very large corpus

Example:

1M vectors, 768 dimensions
K clusters: 16384
Vectors per cluster: ~60
Query:
1. Find nearest 4 clusters: 16KB lookups
2. Search ~240 vectors: 240K operations
vs. 1M vectors: 1M operations
Speed: 4-5× faster than linear search

When to use:

✓ Very large scale (10M-1B vectors)
✓ Acceptable recall trade-off
✓ Limited memory

Popular implementations:

- FAISS (Facebook AI Similarity Search)
- Milvus
- Elasticsearch
- Vespa

3. LSH (Locality Sensitive Hashing)

How it works:

1. Hash vectors using special functions
2. Hash function property: Similar vectors hash to same/nearby buckets
3. Store vectors by hash value
Search:
1. Hash query
2. Look in that bucket and nearby buckets
3. Compare only candidates in buckets

Characteristics:

Recall: Variable (tunable)
Query time: Fast
Build time: Very fast
Memory: Tunable
Simplicity: Easiest to understand

When to use:

✓ Extremely fast requirements
✓ Streaming data (easy to add new vectors)
✓ Limited memory

4. Product Quantization

How it works:

1. Split vectors into segments
2. Quantize each segment (reduce to fewer bits)
3. Build index on quantized vectors
4. Trade accuracy for speed and memory
Example:
Original: 768 dimensions × 32-bit float = 3KB per vector
Quantized: 768 dimensions × 2 bits (4 clusters per segment) = 12 bytes
100M vectors:
Original: 300GB
Quantized: 1.2GB
Speed: 10-100× faster

Comparison: Which ANN Algorithm?

AlgorithmRecallSpeedMemoryScalabilityEase
HNSW99%+<1ms10%1M-100MMedium
IVF90-95%1-10ms5%10M-1BMedium
LSHVariable<1msLowStreamingEasy
PQ85-90%<1msTiny100M-1BHard
Linear100%1-2s0%< 100KTrivial

Practical recommendations:

Starting a RAG system: Use HNSW
(Good balance of recall and simplicity)
Extremely large scale: Try IVF
(Better memory efficiency, still good recall)
Streaming vectors: Consider LSH
(Easy to add new vectors incrementally)
Ultra-low latency + massive scale: Combine PQ + IVF
(Requires more tuning)
import faiss
import numpy as np
# Create IVF index
dimension = 768
nlist = 1024 # number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# Add vectors
vectors = np.random.randn(1000000, 768).astype('float32')
index.train(vectors[:100000]) # Train on subset
index.add(vectors)
# Search
query = np.random.randn(1, 768).astype('float32')
distances, indices = index.search(query, k=10)
print(f"Top 10 results: {indices[0]}")
print(f"Distances: {distances[0]}")

Pinecone (Cloud Service)

import pinecone
pinecone.init(api_key="xxx")
index = pinecone.Index("my-index")
# Upsert vectors (automatically indexed with HNSW)
index.upsert(vectors=[
("id1", embedding1),
("id2", embedding2),
])
# Query
results = index.query(vector=query_embedding, top_k=10)
print(results) # Returns matched IDs and scores

Weaviate

from weaviate import Client
client = Client("http://localhost:8080")
# Create class with HNSW indexing
client.schema.create_class({
"class": "Document",
"vectorizer": "text2vec-transformers",
"vectorIndexConfig": {
"name": "hnsw",
"ef": 64,
"efConstruction": 200,
}
})
# Query
result = client.query.get("Document", ["content"]).with_near_vector({
"vector": query_embedding
}).with_limit(10).do()

Tuning ANN Parameters

HNSW Hyperparameters

- ef: Search parameter (higher = more thorough, slower)
Default: 40
Trade-off: Recall vs. speed
- efConstruction: Construction parameter
Default: 200
Trade-off: Build time vs. quality
- M: Number of bidirectional links
Default: 5-48 (usually 16)
Trade-off: Memory vs. recall

Tuning strategy:

def find_best_parameters(ground_truth, recall_target=0.95):
best_config = None
for ef in [20, 40, 60, 100]:
for efConstruction in [100, 200, 400]:
for M in [4, 8, 16, 32]:
index = hnswlib.Index(...)
index.init_index(max_elements=1000000,
ef_construction=efConstruction, M=M)
recall = evaluate_recall(index, ground_truth)
if recall >= recall_target:
if (best_config is None or
ef < best_config['ef']): # Prefer speed
best_config = {'ef': ef, 'efConstruction': efConstruction, 'M': M}
return best_config

IVF Hyperparameters

- nlist: Number of clusters
Default: √N (for N vectors)
Trade-off: Search space vs. cluster quality
- nprobe: Number of clusters to search
Default: 1-10 (usually 10)
Trade-off: Recall vs. speed

ANN Quality Metrics

Recall

What percentage of true top-K results were found?
Recall@1 = Is the true nearest neighbor found?
Recall@10 = Are the true top-10 found?

Query Latency

Time to retrieve K vectors
Typical targets: < 50ms for interactive applications

Throughput

Queries per second the system can handle
Depends on: Recall setting, latency SLA, hardware

Common ANN Mistakes

Using linear search at scale:

Trying to search 10M vectors with brute force
Should use HNSW or IVF immediately

Wrong recall target:

Optimizing for 100% recall (defeats the purpose)
Most applications accept 95%+ recall gladly

Not retraining index:

Vectors added but index outdated
Retrieval quality degrades over time
Solution: Reindex periodically or incrementally

Forgetting to normalize:

ANN algorithms assume normalized vectors
Denormalized vectors give wrong results

ANN in 2024

Trends:

  • Learned indices (ML-based optimization)
  • GPU-accelerated ANN
  • Streaming ANN (update without reindex)
  • Hybrid ANN (combining multiple algorithms)
  • Multi-stage ranking (ANN → reranker)

ANN algorithms are the invisible foundation of modern vector search. Understanding them helps you tune systems for latency and recall trade-offs.