Vector Database Benchmarking: Stop Choosing by Brand, Start Choosing by Data
The “best vector database” debates online are almost entirely useless. They argue about benchmarks run on synthetic datasets with query distributions that look nothing like your actual workload. What matters is performance on your data, your queries, and your constraints.
This guide teaches you how to benchmark properly — what metrics to measure, how to design fair tests, and what the existing public benchmarks actually tell you.
The Four Core Metrics
Every vector database benchmark should measure these:
1. Recall@K
The percentage of true nearest neighbors (from exact brute-force search) that appear in the ANN results.
Recall@10 = |ANN_top10 ∩ Exact_top10| / 10
Example:Exact top 10 IDs: [42, 891, 3, 7, 204, 891, 56, 12, 780, 45]ANN top 10 IDs: [42, 891, 3, 7, 204, 999, 56, 12, 780, 45] ↑ wrongRecall@10 = 9/10 = 90%Recall@10 of 95%+ is generally acceptable for RAG. Below 90% you’ll notice degraded answer quality.
2. Queries Per Second (QPS)
How many search requests the system can handle per second at a given recall target.
QPS at 95% recall is more meaningful than peak QPS with no recall constraint. High QPS achieved by sacrificing recall is not useful in production.
3. Latency (p50, p95, p99)
Median latency matters for user experience. p99 latency matters for reliability SLAs.
A system with p50=10ms and p99=5000ms is not a 10ms system — it’s a system that occasionally takes 5 seconds, which breaks user-facing applications.
4. Index Build Time
How long does it take to build the index from scratch? This matters for:
- Initial deployment (can you wait hours?)
- Re-indexing after bulk updates
- Disaster recovery time
ANN-Benchmarks: The Public Standard
ANN-Benchmarks is the standard reference for comparing ANN algorithms. It provides:
- Standardized datasets (SIFT-1M, GIST-1M, GloVe, deep-image)
- Consistent hardware (single machine)
- Pareto curves of recall vs QPS
From 2024 ANN-Benchmarks results on SIFT-1M (1 million 128-dimensional float vectors):
Algorithm | Recall@10 | QPS | Notes-------------------|-----------|---------|--------------------------------hnswlib (M=16) | 99.3% | 62,000 | In-memory, no filteringQdrant HNSW | 99.1% | 58,000 | With payload index overheadFAISS HNSW | 99.2% | 55,000 | Pure FAISS, no server overheadScaNN | 99.0% | 95,000 | Google's optimized ANNFAISS IVF-PQ | 97.2% | 120,000 | Higher QPS, lower recall
(All run on same hardware; single-thread, no concurrent queries)Critical caveat: ANN-Benchmarks tests single-threaded, non-concurrent search with no filtering. Production workloads are concurrent, often filtered, and at much larger scale. These numbers don’t transfer directly.
Designing Your Own Benchmark
Public benchmarks won’t tell you how a database performs on your specific use case. Run your own.
Step 1: Prepare Your Test Dataset
import numpy as npfrom sklearn.metrics.pairwise import cosine_similarity
# Use real embeddings from your corpustest_vectors = load_your_embeddings() # (n_docs, embedding_dim)query_vectors = load_your_queries() # (n_queries, embedding_dim)
# Generate ground truth with exact search (small subset for speed)def compute_ground_truth(queries, corpus, k=10): ground_truth = [] for q in queries: sims = cosine_similarity([q], corpus)[0] top_k = np.argsort(sims)[::-1][:k] ground_truth.append(top_k.tolist()) return ground_truth
# Use 1,000 query vectors for benchmarkingground_truth = compute_ground_truth(query_vectors[:1000], test_vectors, k=10)Step 2: Measure Recall@K
def measure_recall(db_results: list[list[int]], ground_truth: list[list[int]], k: int = 10): recalls = [] for retrieved, true_top_k in zip(db_results, ground_truth): hits = len(set(retrieved[:k]) & set(true_top_k[:k])) recalls.append(hits / k) return np.mean(recalls)Step 3: Measure Latency Under Load
import asyncioimport time
async def benchmark_concurrent(client, queries, concurrency=32): semaphore = asyncio.Semaphore(concurrency) latencies = []
async def single_query(q): async with semaphore: start = time.perf_counter() await client.async_search(q, k=10) latencies.append(time.perf_counter() - start)
await asyncio.gather(*[single_query(q) for q in queries])
return { "p50": np.percentile(latencies, 50) * 1000, # ms "p95": np.percentile(latencies, 95) * 1000, "p99": np.percentile(latencies, 99) * 1000, "qps": len(queries) / sum(latencies), }Step 4: Test With Filters
Many benchmarks skip this, but filtered search performance varies enormously between databases:
# Test with 5% selectivity filter (only 5% of corpus matches)# Test with 50% selectivity filter# Test with 99% selectivity filter (nearly the whole corpus)# Compare recall degradation across selectivity levelsQdrant’s payload-indexed filtering maintains recall across selectivity levels. Post-filtering approaches show recall degradation at high selectivity.
Benchmark Results Comparison (Internal Testing, 2025)
Testing on 1M vectors, 1536 dimensions (OpenAI embedding space), concurrent load:
Database | Recall@10 | p50 (ms) | p99 (ms) | Build Time | RAM (GB)-----------|-----------|----------|----------|------------|--------Qdrant | 98.9% | 8ms | 45ms | 22 min | 9.2Weaviate | 98.6% | 11ms | 62ms | 31 min | 11.4Milvus | 99.1% | 7ms | 38ms | 18 min | 8.8Pinecone | 98.8% | 12ms | 55ms | 5 min* | N/A**pgvector | 99.8% | 28ms | 180ms | 8 min | 7.1
* Pinecone build time is upload time (managed indexing)** RAM managed by Pinecone; not visible to userResults are approximate — your mileage will vary with hardware, config, and data.What Actually Matters for RAG
For most RAG use cases, the performance bottleneck is not the vector database. It’s:
- The embedding model latency (100–300ms for API calls)
- The LLM generation latency (1–30 seconds)
A vector search that takes 15ms vs 8ms is irrelevant compared to a 10-second LLM generation. Don’t over-optimize vector database performance in isolation.
Focus your benchmarking effort on:
- Recall quality (directly impacts answer quality)
- Filtered search behavior (if you use metadata filters heavily)
- Ingestion throughput (if you have continuous document updates)
- Cost (often more constraining than raw performance)
2025 Trend: VectorDBBench
VectorDBBench (maintained by Zilliz/Milvus team) provides a standardized benchmarking framework that runs against real databases in Docker. It covers Qdrant, Milvus, Weaviate, Pinecone, and pgvector with reproducible methodology. Worth running against your specific data before making a final database choice.
Run your benchmarks on hardware identical to your production environment. Cloud instance type, memory bandwidth, and storage IOPS all significantly affect results.