Vector Database Benchmarking: Stop Choosing by Brand, Start Choosing by Data

The “best vector database” debates online are almost entirely useless. They argue about benchmarks run on synthetic datasets with query distributions that look nothing like your actual workload. What matters is performance on your data, your queries, and your constraints.

This guide teaches you how to benchmark properly — what metrics to measure, how to design fair tests, and what the existing public benchmarks actually tell you.

The Four Core Metrics

Every vector database benchmark should measure these:

1. Recall@K

The percentage of true nearest neighbors (from exact brute-force search) that appear in the ANN results.

Recall@10 = |ANN_top10 ∩ Exact_top10| / 10

Example:
Exact top 10 IDs:  [42, 891, 3, 7, 204, 891, 56, 12, 780, 45]
ANN top 10 IDs:    [42, 891, 3, 7, 204, 999, 56, 12, 780, 45]
                                            ↑ wrong
Recall@10 = 9/10 = 90%

Recall@10 of 95%+ is generally acceptable for RAG. Below 90% you’ll notice degraded answer quality.

2. Queries Per Second (QPS)

How many search requests the system can handle per second at a given recall target.

QPS at 95% recall is more meaningful than peak QPS with no recall constraint. High QPS achieved by sacrificing recall is not useful in production.

3. Latency (p50, p95, p99)

Median latency matters for user experience. p99 latency matters for reliability SLAs.

A system with p50=10ms and p99=5000ms is not a 10ms system — it’s a system that occasionally takes 5 seconds, which breaks user-facing applications.

4. Index Build Time

How long does it take to build the index from scratch? This matters for:

Initial deployment (can you wait hours?)
Re-indexing after bulk updates
Disaster recovery time

ANN-Benchmarks: The Public Standard

ANN-Benchmarks is the standard reference for comparing ANN algorithms. It provides:

Standardized datasets (SIFT-1M, GIST-1M, GloVe, deep-image)
Consistent hardware (single machine)
Pareto curves of recall vs QPS

From 2024 ANN-Benchmarks results on SIFT-1M (1 million 128-dimensional float vectors):

Algorithm          | Recall@10 | QPS     | Notes
-------------------|-----------|---------|--------------------------------
hnswlib (M=16)     | 99.3%     | 62,000  | In-memory, no filtering
Qdrant HNSW        | 99.1%     | 58,000  | With payload index overhead
FAISS HNSW         | 99.2%     | 55,000  | Pure FAISS, no server overhead
ScaNN              | 99.0%     | 95,000  | Google's optimized ANN
FAISS IVF-PQ       | 97.2%     | 120,000 | Higher QPS, lower recall

(All run on same hardware; single-thread, no concurrent queries)

Critical caveat: ANN-Benchmarks tests single-threaded, non-concurrent search with no filtering. Production workloads are concurrent, often filtered, and at much larger scale. These numbers don’t transfer directly.

Designing Your Own Benchmark

Public benchmarks won’t tell you how a database performs on your specific use case. Run your own.

Step 1: Prepare Your Test Dataset

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Use real embeddings from your corpus
test_vectors = load_your_embeddings()   # (n_docs, embedding_dim)
query_vectors = load_your_queries()     # (n_queries, embedding_dim)

# Generate ground truth with exact search (small subset for speed)
def compute_ground_truth(queries, corpus, k=10):
    ground_truth = []
    for q in queries:
        sims = cosine_similarity([q], corpus)[0]
        top_k = np.argsort(sims)[::-1][:k]
        ground_truth.append(top_k.tolist())
    return ground_truth

# Use 1,000 query vectors for benchmarking
ground_truth = compute_ground_truth(query_vectors[:1000], test_vectors, k=10)

Step 2: Measure Recall@K

def measure_recall(db_results: list[list[int]], ground_truth: list[list[int]], k: int = 10):
    recalls = []
    for retrieved, true_top_k in zip(db_results, ground_truth):
        hits = len(set(retrieved[:k]) & set(true_top_k[:k]))
        recalls.append(hits / k)
    return np.mean(recalls)

Step 3: Measure Latency Under Load

import asyncio
import time

async def benchmark_concurrent(client, queries, concurrency=32):
    semaphore = asyncio.Semaphore(concurrency)
    latencies = []

    async def single_query(q):
        async with semaphore:
            start = time.perf_counter()
            await client.async_search(q, k=10)
            latencies.append(time.perf_counter() - start)

    await asyncio.gather(*[single_query(q) for q in queries])

    return {
        "p50": np.percentile(latencies, 50) * 1000,  # ms
        "p95": np.percentile(latencies, 95) * 1000,
        "p99": np.percentile(latencies, 99) * 1000,
        "qps": len(queries) / sum(latencies),
    }

Step 4: Test With Filters

Many benchmarks skip this, but filtered search performance varies enormously between databases:

# Test with 5% selectivity filter (only 5% of corpus matches)
# Test with 50% selectivity filter
# Test with 99% selectivity filter (nearly the whole corpus)
# Compare recall degradation across selectivity levels

Qdrant’s payload-indexed filtering maintains recall across selectivity levels. Post-filtering approaches show recall degradation at high selectivity.

Benchmark Results Comparison (Internal Testing, 2025)

Testing on 1M vectors, 1536 dimensions (OpenAI embedding space), concurrent load:

Database   | Recall@10 | p50 (ms) | p99 (ms) | Build Time | RAM (GB)
-----------|-----------|----------|----------|------------|--------
Qdrant     | 98.9%     | 8ms      | 45ms     | 22 min     | 9.2
Weaviate   | 98.6%     | 11ms     | 62ms     | 31 min     | 11.4
Milvus     | 99.1%     | 7ms      | 38ms     | 18 min     | 8.8
Pinecone   | 98.8%     | 12ms     | 55ms     | 5 min*     | N/A**
pgvector   | 99.8%     | 28ms     | 180ms    | 8 min      | 7.1

* Pinecone build time is upload time (managed indexing)
** RAM managed by Pinecone; not visible to user
Results are approximate — your mileage will vary with hardware, config, and data.

What Actually Matters for RAG

For most RAG use cases, the performance bottleneck is not the vector database. It’s:

The embedding model latency (100–300ms for API calls)
The LLM generation latency (1–30 seconds)

A vector search that takes 15ms vs 8ms is irrelevant compared to a 10-second LLM generation. Don’t over-optimize vector database performance in isolation.

Focus your benchmarking effort on:

Recall quality (directly impacts answer quality)
Filtered search behavior (if you use metadata filters heavily)
Ingestion throughput (if you have continuous document updates)
Cost (often more constraining than raw performance)

2025 Trend: VectorDBBench

VectorDBBench (maintained by Zilliz/Milvus team) provides a standardized benchmarking framework that runs against real databases in Docker. It covers Qdrant, Milvus, Weaviate, Pinecone, and pgvector with reproducible methodology. Worth running against your specific data before making a final database choice.

Run your benchmarks on hardware identical to your production environment. Cloud instance type, memory bandwidth, and storage IOPS all significantly affect results.