HNSW Index: High-Performance Vector Search Algorithm
HNSW (Hierarchical Navigable Small World) is arguably the best all-around algorithm for RAG systems, offering excellent recall, sub-millisecond latency, and reasonable memory overhead.
How HNSW Works: The Core Idea
HNSW builds a multi-layered graph structure inspired by “small world” networks (where any two nodes reach each other quickly through few hops).
The Layers
Layer 3 (top): • ← Entry point, sparsest | (long-range connections)
Layer 2: • — • — • | \ / \ |
Layer 1: • - • - • - • - • |\ /|\ /|\ /|
Layer 0 (bottom): All vectors Dense local connectionsKey insight:
- Top layers: Long-range connections, quick coarse-grained search- Lower layers: Local connections, fine-grained search- Allows efficient navigation from any query to its neighborsVector Storage in HNSW
Each vector appears in one or more layers:
Vector A: Layer 3: No Layer 2: Yes, connected to vectors B, C Layer 1: Yes, connected to B, C, D, E, F Layer 0: Yes, connected to all nearby vectors
Probability of appearing in layer L: exp(-L / ln(m_L))Result: Sparse top layers, dense bottom layerSearch Algorithm: The Navigation Process
Finding nearest neighbors to query Q:
Step 1: Start at top layer entry pointStep 2: For current layer: - Compute similarity to neighbors - Move toward most similar neighbor - Repeat until converged (local minimum)Step 3: Move to next lower layerStep 4: Repeat from Step 2Step 5: Return top K from Layer 0
Result: Approximates the true nearest neighborsVisual navigation:
Top layer: Jump to approximate region Q ——→ nearest region in layer
Middle layers: Refine location Q ——→ ——→ getting closer
Bottom layer: Find exact neighbors Q ——→ ——→ ——→ found! Within true k-NNHNSW Parameters: Understanding and Tuning
M (Maximum Connections)
Number of bidirectional connections each node maintains.
M = 16 (typical)
Higher M: ✓ Better recall (more connections = better graph) ✗ Slower search (more neighbors to check) ✗ More memory (more edges to store)
Lower M: ✓ Faster search ✓ Less memory ✗ Worse recall
Recommendation: Start with M=16, tune based on latency constraintsef_construction (Candidate Set Size During Building)
How many candidates to consider when adding a vector.
ef_construction = 200 (typical)
Higher ef_construction: ✓ Better index quality (takes time to find good neighbors) ✗ Longer build time
Lower ef_construction: ✓ Faster build ✗ Worse final index
Recommendation: Set high during build (it's one-time cost) Lower at query time to balance speedef (Query Time Candidate Set Size)
How many candidates to explore during search.
ef = 40 (typical at query time)
Higher ef: ✓ Better recall (search more thoroughly) ✗ Slower queries
Lower ef: ✓ Faster queries ✗ Worse recall
Recommendation: This is the main knob to tune speed/qualityImplementation: Using HNSW in Practice
Python with hnswlib
import hnswlibimport numpy as np
# Initialize indexdimension = 768max_elements = 1000000M = 16 # max connections per nodeef_construction = 200 # size of dynamic list
index = hnswlib.Index(space='l2', dim=dimension)index.init_index(max_elements=max_elements, ef_construction=ef_construction, M=M)
# Add vectors to indexembeddings = np.random.randn(1000000, 768).astype('float32')labels = np.arange(1000000)index.add_items(embeddings, labels)
# Queryquery = np.random.randn(768).astype('float32').reshape(1, -1)ef = 40 # Size of candidate list (can change per query)labels, distances = index.knn_query(query, k=10, ef=ef)
print(f"Top 10 neighbors: {labels[0]}")print(f"Distances: {distances[0]}")
# Save/Load indexindex.save_index("hnsw.bin")index_loaded = hnswlib.Index(space='l2', dim=dimension)index_loaded.load_index("hnsw.bin")Using HNSW in Production Databases
Milvus:
from pymilvus import CollectionSchema, FieldSchema, DataTypefrom pymilvus import Collection, connections
# Connect to Milvusconnections.connect("default", host="localhost", port="19530")
# Define schemafields = [ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True), FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)]schema = CollectionSchema(fields=fields, description="Documents")
# Create collectioncollection = Collection(name="documents", schema=schema)
# Create HNSW indexcollection.create_index( field_name="embedding", index_params={ "index_type": "HNSW", "metric_type": "L2", "params": { "M": 16, "efConstruction": 200, "ef": 64 } })
# Insert vectorsmr = collection.insert([ids, embeddings])
# Searchsearch_params = {"metric_type": "L2", "params": {"ef": 40}}results = collection.search( data=[query_embedding], anns_field="embedding", param=search_params, limit=10, expr=None)Weaviate:
from weaviate import Client
client = Client("http://localhost:8080")
# Create class with HNSW indexingclient.schema.create_class({ "class": "Document", "vectorizer": "text2vec-transformers", "vectorIndexConfig": { "name": "hnsw", "ef": 64, # Query time parameter "efConstruction": 200, # Build time parameter "maxConnections": 16, # M parameter "pq": {"enabled": False} # PQ compression (optional) }})
# Add documentsdata_object = {"title": "RAG Fundamentals"}client.data_object.create(data_object, class_name="Document")
# Queryresult = client.query.get("Document", ["title"]).with_near_vector({ "vector": query_embedding}).with_limit(10).do()Performance Characteristics
Recall vs. Speed Trade-off
Testing on 1M vector dataset:
ef=20: Latency: 2ms, Recall@10: 0.85ef=40: Latency: 5ms, Recall@10: 0.94ef=60: Latency: 8ms, Recall@10: 0.97ef=100: Latency: 15ms, Recall@10: 0.98ef=200: Latency: 30ms, Recall@10: 0.995
Typical target: ef=40-60 for 90-95% recall, <10ms latencyMemory Overhead
Raw vectors: 1M × 768 × 4 bytes = 3GB
HNSW overhead:- Graph edges: 1M × 16 × 8 bytes = 128MB- Layer assignments: 1M × 1 byte = 1MB- Metadata: ~50MB
Total HNSW: 3.2GB (6% overhead)Build Time
1M vectors, 768 dimensions:
Single-threaded: ~5-10 minutesMulti-threaded (8 cores): ~1-2 minutesWith GPU: Would need custom implementation
Key: Build is one-time cost, worth optimizing for qualityTuning HNSW for Your Workload
For Maximum Recall
# Use generous parametersindex = hnswlib.Index(space='l2', dim=768)index.init_index(max_elements=1000000, ef_construction=400, # High construction effort M=32) # More connections
# Query with high eflabels, distances = index.knn_query(query, k=10, ef=200)
Trade-off: 99%+ recall but slower queriesUse for: Small result sets, accuracy-critical applicationsFor Balanced Performance
# Typical production setupindex = hnswlib.Index(space='l2', dim=768)index.init_index(max_elements=1000000, ef_construction=200, # Standard M=16) # Standard
# Query with moderate eflabels, distances = index.knn_query(query, k=10, ef=50)
Trade-off: 95-98% recall, ~5-10ms latencyUse for: Most production systemsFor Speed-Optimized
# Minimal parametersindex = hnswlib.Index(space='l2', dim=768)index.init_index(max_elements=1000000, ef_construction=100, # Faster build M=8) # Fewer connections
# Query with low eflabels, distances = index.knn_query(query, k=10, ef=20)
Trade-off: 85-90% recall, ~1-2ms latencyUse for: High-throughput, lower-latency systemsUpdating HNSW Indexes
Adding New Vectors
# Dynamic addition (vectors can be added after index creation)new_embeddings = np.random.randn(1000, 768).astype('float32')new_labels = np.arange(1000000, 1001000)
index.add_items(new_embeddings, new_labels)
Benefit: No need to rebuild entire indexLimitation: Incremental additions might have slightly lower qualitySolution: Periodic full reindexing if neededDeleting Vectors
# HNSW doesn't support deletion directly# Options:# 1. Mark as deleted (filter during search)# 2. Rebuild index periodically# 3. Use Milvus/Weaviate which handle deletion
# Mark-as-deleted approach:deleted_ids = {123, 456, 789}
labels, distances = index.knn_query(query, k=100)filtered_results = [l for l in labels[0] if l not in deleted_ids][:10]Advanced: HNSW + Quantization
Combining HNSW with quantization for extreme scale:
# Quantized storage (lower precision) + HNSW indexingdef quantize_vector(embedding): # Scale to [-127, 127] int8 range return np.int8(np.clip(embedding * 127, -127, 127))
# Build index on quantized vectorsquantized_embeddings = np.array([quantize_vector(e) for e in embeddings])index.add_items(quantized_embeddings, labels)
Trade-off:- Memory: 4x reduction (float32 → int8)- Speed: Faster computation- Recall: Slight reduction (usually negligible)Comparison to Other Algorithms
| Aspect | HNSW | IVF | PQ |
|---|---|---|---|
| Recall | 99%+ | 90-95% | 85-90% |
| Latency | <1ms | 1-10ms | <1ms |
| Memory | ~10% overhead | ~5% overhead | <5% raw |
| Scalability | 1M-100M | 10M-1B | 100M-1B |
| Ease | Easy | Medium | Hard |
Recommendation: Start with HNSW unless you have extreme scale or latency requirements.
Conclusion
HNSW is the gold standard for RAG vector indexing. It provides excellent recall (99%+) with sub-millisecond latency, is easy to tune, and scales to hundreds of millions of vectors. Most modern vector databases (Milvus, Weaviate, Pinecone) use HNSW as their default indexing method.
Understand its parameters, tune conservatively, and monitor latency/recall trade-offs in your specific workload.