Reranking: The Second Opinion That Makes RAG Accurate

Vector similarity search is fast and scalable, but it has a subtle problem: it uses a single embedding to represent both the query and each document. This bi-encoder architecture is great for recall — it can scan millions of documents quickly — but it’s not great for precision. The top-ranked result isn’t always the most relevant one.

Reranking adds a second, slower, more accurate stage. After the vector search retrieves 20–50 candidate documents, a reranker evaluates each candidate in the context of the query — not with separate embeddings, but by jointly processing query and document together. This cross-encoder architecture is significantly more accurate at identifying which candidates actually answer the question.

The Two-Stage Retrieval Architecture

Stage 1: Bi-Encoder Retrieval (Fast, Approximate)
  Query → Embed → Vector Search → Top-50 Candidates
  Speed: ~10ms for 1M vectors
  Accuracy: Good recall, imperfect precision

Stage 2: Cross-Encoder Reranking (Slow, Precise)
  Query + Candidate1 → Cross-Encoder → Score 0.94
  Query + Candidate2 → Cross-Encoder → Score 0.67
  Query + Candidate3 → Cross-Encoder → Score 0.88
  ...repeat for all 50 candidates...
  Sort by score → Top-5 for LLM context
  Speed: ~50-200ms for 50 candidates
  Accuracy: High precision

Total pipeline: ~200ms for high-quality top-5 results
vs.
Stage 1 alone: ~10ms but lower precision

The bi-encoder’s job is to reduce the search space from millions to dozens. The cross-encoder’s job is to order those dozens accurately.

Cross-Encoder vs Bi-Encoder

The architectural difference explains why cross-encoders are more accurate:

Bi-Encoder:
  Query ──→ Encoder ──→ Query Vector [0.12, -0.34, ...]
  Doc   ──→ Encoder ──→ Doc Vector   [0.10, -0.31, ...]
  Score = cosine_similarity(query_vec, doc_vec)
  ↑ Independent encoding — no cross-attention between query and document

Cross-Encoder:
  [CLS] Query [SEP] Document [SEP]
         ↓ Full transformer with cross-attention ↓
  Relevance Score: 0.94
  ↑ Joint encoding — every token in query can attend to every token in document

The cross-encoder’s joint attention allows it to understand subtle interactions: “The document mentions Python version 3.10, and the query asks about a function introduced in Python 3.10” — the kind of cross-reference that bi-encoders can’t capture.

Cohere Rerank API

The simplest way to add reranking is Cohere’s Rerank endpoint:

import cohere

co = cohere.Client("your-api-key")

def rerank_documents(query: str, documents: list[str], top_n: int = 5) -> list[dict]:
    response = co.rerank(
        query=query,
        documents=documents,
        top_n=top_n,
        model="rerank-english-v3.0",  # or rerank-multilingual-v3.0
    )

    results = []
    for item in response.results:
        results.append({
            "document": documents[item.index],
            "relevance_score": item.relevance_score,
            "original_rank": item.index + 1,
        })
    return results

# Usage in RAG pipeline
candidate_docs = vector_search(query, k=20)
reranked = rerank_documents(query, [doc.page_content for doc in candidate_docs])
top_5_for_llm = reranked[:5]

Open Source Rerankers

For self-hosted deployments, several open-source cross-encoder models are available:

BGE Reranker (Beijing Academy of AI)

from FlagEmbedding import FlagReranker

reranker = FlagReranker("BAAI/bge-reranker-large", use_fp16=True)

# Provide query-document pairs
pairs = [[query, doc] for doc in candidate_texts]
scores = reranker.compute_score(pairs, normalize=True)

# Sort and return top candidates
ranked = sorted(zip(candidate_texts, scores), key=lambda x: x[1], reverse=True)
top_5 = [doc for doc, _ in ranked[:5]]

FlashRank: Fast Local Reranking

from flashrank import Ranker, RerankRequest

# Fast, lightweight — designed for sub-10ms reranking
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt/cache/")

passages = [
    {"id": i, "text": doc, "meta": {"source": src}}
    for i, (doc, src) in enumerate(zip(docs, sources))
]

rerank_request = RerankRequest(query=query, passages=passages)
results = ranker.rerank(rerank_request)

# results is sorted by relevance score
for r in results[:5]:
    print(f"Score: {r['score']:.3f} | {r['text'][:80]}")

Jina Reranker

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_name = "jinaai/jina-reranker-v2-base-multilingual"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def jina_rerank(query, documents):
    pairs = [[query, doc] for doc in documents]
    features = tokenizer(
        pairs,
        padding=True, truncation=True,
        return_tensors="pt", max_length=512
    )
    with torch.no_grad():
        scores = model(**features).logits.flatten().tolist()
    return sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

LangChain Reranking Integration

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import FAISS

# Cohere reranker as a compression step
cohere_reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=5,
)

# Wrap base retriever with reranking
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=cohere_reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
)

# Returns top-5 after reranking 20 candidates
results = reranking_retriever.invoke("How does attention work in transformers?")

Reranking Model Benchmarks

From MTEB Reranking Leaderboard (2025):

Model                          | Map@10 | Speed | Size
-------------------------------|--------|-------|------
Cohere rerank-english-v3.0    | 0.714  | API   | API
BAAI/bge-reranker-large        | 0.702  | 85ms* | 560MB
BAAI/bge-reranker-v2-m3        | 0.721  | 90ms* | 568MB
cross-encoder/ms-marco-L-6-v2  | 0.680  | 45ms* | 140MB
flashrank ms-marco-L-12        | 0.672  | 8ms*  | 34MB
jinaai/jina-reranker-v2-base   | 0.695  | 75ms* | 278MB

* Per 20-document batch, GPU

FlashRank is the speed champion for latency-sensitive applications. BGE v2-m3 is best for multilingual use cases. Cohere is simplest to deploy.

When Reranking Pays Off

Reranking improves results most when:

Top-K vector search results are good but not perfectly ordered
Queries are complex (multi-condition, multi-hop)
Documents are long and contain relevant content scattered throughout

Reranking adds less value when:

Vector search already achieves high precision (simple factual queries on clean corpora)
You need sub-100ms total latency (cross-encoder adds 50–200ms)
Your candidate pool after vector search is small (< 10 documents)

2025 Trend: Listwise Reranking

Traditional cross-encoders score documents independently against the query (pointwise). Newer listwise rerankers consider the relative relevance of all candidates together, which improves diversity and prevents redundancy in the final set. ListT5 and RankVicuna are early examples of this approach showing consistent improvements over pointwise methods.

Reranking is the single highest-ROI improvement for most RAG pipelines that have already got basic vector search working. A 50ms investment in reranking typically produces 15–30% precision improvements — a trade-off that’s almost always worth making.