HNSW Index: High-Performance Vector Search Algorithm

HNSW (Hierarchical Navigable Small World) is arguably the best all-around algorithm for RAG systems, offering excellent recall, sub-millisecond latency, and reasonable memory overhead.

How HNSW Works: The Core Idea

HNSW builds a multi-layered graph structure inspired by “small world” networks (where any two nodes reach each other quickly through few hops).

The Layers

Layer 3 (top):      • ← Entry point, sparsest
                    | (long-range connections)

Layer 2:           • — • — •
                   | \ / \ |

Layer 1:          • - • - • - • - •
                  |\ /|\ /|\ /|

Layer 0 (bottom): All vectors
                  Dense local connections

Key insight:

- Top layers: Long-range connections, quick coarse-grained search
- Lower layers: Local connections, fine-grained search
- Allows efficient navigation from any query to its neighbors

Vector Storage in HNSW

Each vector appears in one or more layers:

Vector A:
  Layer 3: No
  Layer 2: Yes, connected to vectors B, C
  Layer 1: Yes, connected to B, C, D, E, F
  Layer 0: Yes, connected to all nearby vectors

Probability of appearing in layer L: exp(-L / ln(m_L))
Result: Sparse top layers, dense bottom layer

Finding nearest neighbors to query Q:

Step 1: Start at top layer entry point
Step 2: For current layer:
        - Compute similarity to neighbors
        - Move toward most similar neighbor
        - Repeat until converged (local minimum)
Step 3: Move to next lower layer
Step 4: Repeat from Step 2
Step 5: Return top K from Layer 0

Result: Approximates the true nearest neighbors

Visual navigation:

Top layer: Jump to approximate region
  Q ——→ nearest region in layer

Middle layers: Refine location
  Q ——→ ——→ getting closer

Bottom layer: Find exact neighbors
  Q ——→ ——→ ——→ found! Within true k-NN

HNSW Parameters: Understanding and Tuning

M (Maximum Connections)

Number of bidirectional connections each node maintains.

M = 16 (typical)

Higher M:
  ✓ Better recall (more connections = better graph)
  ✗ Slower search (more neighbors to check)
  ✗ More memory (more edges to store)

Lower M:
  ✓ Faster search
  ✓ Less memory
  ✗ Worse recall

Recommendation: Start with M=16, tune based on latency constraints

ef_construction (Candidate Set Size During Building)

How many candidates to consider when adding a vector.

ef_construction = 200 (typical)

Higher ef_construction:
  ✓ Better index quality (takes time to find good neighbors)
  ✗ Longer build time

Lower ef_construction:
  ✓ Faster build
  ✗ Worse final index

Recommendation: Set high during build (it's one-time cost)
                 Lower at query time to balance speed

ef (Query Time Candidate Set Size)

How many candidates to explore during search.

ef = 40 (typical at query time)

Higher ef:
  ✓ Better recall (search more thoroughly)
  ✗ Slower queries

Lower ef:
  ✓ Faster queries
  ✗ Worse recall

Recommendation: This is the main knob to tune speed/quality

Implementation: Using HNSW in Practice

Python with hnswlib

import hnswlib
import numpy as np

# Initialize index
dimension = 768
max_elements = 1000000
M = 16  # max connections per node
ef_construction = 200  # size of dynamic list

index = hnswlib.Index(space='l2', dim=dimension)
index.init_index(max_elements=max_elements,
                ef_construction=ef_construction, M=M)

# Add vectors to index
embeddings = np.random.randn(1000000, 768).astype('float32')
labels = np.arange(1000000)
index.add_items(embeddings, labels)

# Query
query = np.random.randn(768).astype('float32').reshape(1, -1)
ef = 40  # Size of candidate list (can change per query)
labels, distances = index.knn_query(query, k=10, ef=ef)

print(f"Top 10 neighbors: {labels[0]}")
print(f"Distances: {distances[0]}")

# Save/Load index
index.save_index("hnsw.bin")
index_loaded = hnswlib.Index(space='l2', dim=dimension)
index_loaded.load_index("hnsw.bin")

Using HNSW in Production Databases

Milvus:

from pymilvus import CollectionSchema, FieldSchema, DataType
from pymilvus import Collection, connections

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
]
schema = CollectionSchema(fields=fields, description="Documents")

# Create collection
collection = Collection(name="documents", schema=schema)

# Create HNSW index
collection.create_index(
    field_name="embedding",
    index_params={
        "index_type": "HNSW",
        "metric_type": "L2",
        "params": {
            "M": 16,
            "efConstruction": 200,
            "ef": 64
        }
    }
)

# Insert vectors
mr = collection.insert([ids, embeddings])

# Search
search_params = {"metric_type": "L2", "params": {"ef": 40}}
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=10,
    expr=None
)

Weaviate:

from weaviate import Client

client = Client("http://localhost:8080")

# Create class with HNSW indexing
client.schema.create_class({
    "class": "Document",
    "vectorizer": "text2vec-transformers",
    "vectorIndexConfig": {
        "name": "hnsw",
        "ef": 64,           # Query time parameter
        "efConstruction": 200,  # Build time parameter
        "maxConnections": 16,   # M parameter
        "pq": {"enabled": False}  # PQ compression (optional)
    }
})

# Add documents
data_object = {"title": "RAG Fundamentals"}
client.data_object.create(data_object, class_name="Document")

# Query
result = client.query.get("Document", ["title"]).with_near_vector({
    "vector": query_embedding
}).with_limit(10).do()

Performance Characteristics

Recall vs. Speed Trade-off

Testing on 1M vector dataset:

ef=20:   Latency: 2ms,   Recall@10: 0.85
ef=40:   Latency: 5ms,   Recall@10: 0.94
ef=60:   Latency: 8ms,   Recall@10: 0.97
ef=100:  Latency: 15ms,  Recall@10: 0.98
ef=200:  Latency: 30ms,  Recall@10: 0.995

Typical target: ef=40-60 for 90-95% recall, <10ms latency

Memory Overhead

Raw vectors: 1M × 768 × 4 bytes = 3GB

HNSW overhead:
- Graph edges: 1M × 16 × 8 bytes = 128MB
- Layer assignments: 1M × 1 byte = 1MB
- Metadata: ~50MB

Total HNSW: 3.2GB (6% overhead)

Build Time

1M vectors, 768 dimensions:

Single-threaded: ~5-10 minutes
Multi-threaded (8 cores): ~1-2 minutes
With GPU: Would need custom implementation

Key: Build is one-time cost, worth optimizing for quality

Tuning HNSW for Your Workload

For Maximum Recall

# Use generous parameters
index = hnswlib.Index(space='l2', dim=768)
index.init_index(max_elements=1000000,
                ef_construction=400,  # High construction effort
                M=32)  # More connections

# Query with high ef
labels, distances = index.knn_query(query, k=10, ef=200)

Trade-off: 99%+ recall but slower queries
Use for: Small result sets, accuracy-critical applications

For Balanced Performance

# Typical production setup
index = hnswlib.Index(space='l2', dim=768)
index.init_index(max_elements=1000000,
                ef_construction=200,  # Standard
                M=16)  # Standard

# Query with moderate ef
labels, distances = index.knn_query(query, k=10, ef=50)

Trade-off: 95-98% recall, ~5-10ms latency
Use for: Most production systems

For Speed-Optimized

# Minimal parameters
index = hnswlib.Index(space='l2', dim=768)
index.init_index(max_elements=1000000,
                ef_construction=100,  # Faster build
                M=8)  # Fewer connections

# Query with low ef
labels, distances = index.knn_query(query, k=10, ef=20)

Trade-off: 85-90% recall, ~1-2ms latency
Use for: High-throughput, lower-latency systems

Updating HNSW Indexes

Adding New Vectors

# Dynamic addition (vectors can be added after index creation)
new_embeddings = np.random.randn(1000, 768).astype('float32')
new_labels = np.arange(1000000, 1001000)

index.add_items(new_embeddings, new_labels)

Benefit: No need to rebuild entire index
Limitation: Incremental additions might have slightly lower quality
Solution: Periodic full reindexing if needed

Deleting Vectors

# HNSW doesn't support deletion directly
# Options:
# 1. Mark as deleted (filter during search)
# 2. Rebuild index periodically
# 3. Use Milvus/Weaviate which handle deletion

# Mark-as-deleted approach:
deleted_ids = {123, 456, 789}

labels, distances = index.knn_query(query, k=100)
filtered_results = [l for l in labels[0] if l not in deleted_ids][:10]

Advanced: HNSW + Quantization

Combining HNSW with quantization for extreme scale:

# Quantized storage (lower precision) + HNSW indexing
def quantize_vector(embedding):
    # Scale to [-127, 127] int8 range
    return np.int8(np.clip(embedding * 127, -127, 127))

# Build index on quantized vectors
quantized_embeddings = np.array([quantize_vector(e)
                                 for e in embeddings])
index.add_items(quantized_embeddings, labels)

Trade-off:
- Memory: 4x reduction (float32 → int8)
- Speed: Faster computation
- Recall: Slight reduction (usually negligible)

Comparison to Other Algorithms

Aspect	HNSW	IVF	PQ
Recall	99%+	90-95%	85-90%
Latency	<1ms	1-10ms	<1ms
Memory	~10% overhead	~5% overhead	<5% raw
Scalability	1M-100M	10M-1B	100M-1B
Ease	Easy	Medium	Hard

Recommendation: Start with HNSW unless you have extreme scale or latency requirements.

Conclusion

HNSW is the gold standard for RAG vector indexing. It provides excellent recall (99%+) with sub-millisecond latency, is easy to tune, and scales to hundreds of millions of vectors. Most modern vector databases (Milvus, Weaviate, Pinecone) use HNSW as their default indexing method.

Understand its parameters, tune conservatively, and monitor latency/recall trade-offs in your specific workload.

HNSW Index: Hierarchical Navigation for Fast Vector Retrieval