Reranking: The Second Opinion That Makes RAG Accurate
Vector similarity search is fast and scalable, but it has a subtle problem: it uses a single embedding to represent both the query and each document. This bi-encoder architecture is great for recall — it can scan millions of documents quickly — but it’s not great for precision. The top-ranked result isn’t always the most relevant one.
Reranking adds a second, slower, more accurate stage. After the vector search retrieves 20–50 candidate documents, a reranker evaluates each candidate in the context of the query — not with separate embeddings, but by jointly processing query and document together. This cross-encoder architecture is significantly more accurate at identifying which candidates actually answer the question.
The Two-Stage Retrieval Architecture
Stage 1: Bi-Encoder Retrieval (Fast, Approximate) Query → Embed → Vector Search → Top-50 Candidates Speed: ~10ms for 1M vectors Accuracy: Good recall, imperfect precision
Stage 2: Cross-Encoder Reranking (Slow, Precise) Query + Candidate1 → Cross-Encoder → Score 0.94 Query + Candidate2 → Cross-Encoder → Score 0.67 Query + Candidate3 → Cross-Encoder → Score 0.88 ...repeat for all 50 candidates... Sort by score → Top-5 for LLM context Speed: ~50-200ms for 50 candidates Accuracy: High precision
Total pipeline: ~200ms for high-quality top-5 resultsvs.Stage 1 alone: ~10ms but lower precisionThe bi-encoder’s job is to reduce the search space from millions to dozens. The cross-encoder’s job is to order those dozens accurately.
Cross-Encoder vs Bi-Encoder
The architectural difference explains why cross-encoders are more accurate:
Bi-Encoder: Query ──→ Encoder ──→ Query Vector [0.12, -0.34, ...] Doc ──→ Encoder ──→ Doc Vector [0.10, -0.31, ...] Score = cosine_similarity(query_vec, doc_vec) ↑ Independent encoding — no cross-attention between query and document
Cross-Encoder: [CLS] Query [SEP] Document [SEP] ↓ Full transformer with cross-attention ↓ Relevance Score: 0.94 ↑ Joint encoding — every token in query can attend to every token in documentThe cross-encoder’s joint attention allows it to understand subtle interactions: “The document mentions Python version 3.10, and the query asks about a function introduced in Python 3.10” — the kind of cross-reference that bi-encoders can’t capture.
Cohere Rerank API
The simplest way to add reranking is Cohere’s Rerank endpoint:
import cohere
co = cohere.Client("your-api-key")
def rerank_documents(query: str, documents: list[str], top_n: int = 5) -> list[dict]: response = co.rerank( query=query, documents=documents, top_n=top_n, model="rerank-english-v3.0", # or rerank-multilingual-v3.0 )
results = [] for item in response.results: results.append({ "document": documents[item.index], "relevance_score": item.relevance_score, "original_rank": item.index + 1, }) return results
# Usage in RAG pipelinecandidate_docs = vector_search(query, k=20)reranked = rerank_documents(query, [doc.page_content for doc in candidate_docs])top_5_for_llm = reranked[:5]Open Source Rerankers
For self-hosted deployments, several open-source cross-encoder models are available:
BGE Reranker (Beijing Academy of AI)
from FlagEmbedding import FlagReranker
reranker = FlagReranker("BAAI/bge-reranker-large", use_fp16=True)
# Provide query-document pairspairs = [[query, doc] for doc in candidate_texts]scores = reranker.compute_score(pairs, normalize=True)
# Sort and return top candidatesranked = sorted(zip(candidate_texts, scores), key=lambda x: x[1], reverse=True)top_5 = [doc for doc, _ in ranked[:5]]FlashRank: Fast Local Reranking
from flashrank import Ranker, RerankRequest
# Fast, lightweight — designed for sub-10ms rerankingranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt/cache/")
passages = [ {"id": i, "text": doc, "meta": {"source": src}} for i, (doc, src) in enumerate(zip(docs, sources))]
rerank_request = RerankRequest(query=query, passages=passages)results = ranker.rerank(rerank_request)
# results is sorted by relevance scorefor r in results[:5]: print(f"Score: {r['score']:.3f} | {r['text'][:80]}")Jina Reranker
from transformers import AutoModelForSequenceClassification, AutoTokenizerimport torch
model_name = "jinaai/jina-reranker-v2-base-multilingual"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForSequenceClassification.from_pretrained(model_name)
def jina_rerank(query, documents): pairs = [[query, doc] for doc in documents] features = tokenizer( pairs, padding=True, truncation=True, return_tensors="pt", max_length=512 ) with torch.no_grad(): scores = model(**features).logits.flatten().tolist() return sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)LangChain Reranking Integration
from langchain.retrievers import ContextualCompressionRetrieverfrom langchain_cohere import CohereRerankfrom langchain_community.vectorstores import FAISS
# Cohere reranker as a compression stepcohere_reranker = CohereRerank( model="rerank-english-v3.0", top_n=5,)
# Wrap base retriever with rerankingreranking_retriever = ContextualCompressionRetriever( base_compressor=cohere_reranker, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),)
# Returns top-5 after reranking 20 candidatesresults = reranking_retriever.invoke("How does attention work in transformers?")Reranking Model Benchmarks
From MTEB Reranking Leaderboard (2025):
Model | Map@10 | Speed | Size-------------------------------|--------|-------|------Cohere rerank-english-v3.0 | 0.714 | API | APIBAAI/bge-reranker-large | 0.702 | 85ms* | 560MBBAAI/bge-reranker-v2-m3 | 0.721 | 90ms* | 568MBcross-encoder/ms-marco-L-6-v2 | 0.680 | 45ms* | 140MBflashrank ms-marco-L-12 | 0.672 | 8ms* | 34MBjinaai/jina-reranker-v2-base | 0.695 | 75ms* | 278MB
* Per 20-document batch, GPUFlashRank is the speed champion for latency-sensitive applications. BGE v2-m3 is best for multilingual use cases. Cohere is simplest to deploy.
When Reranking Pays Off
Reranking improves results most when:
- Top-K vector search results are good but not perfectly ordered
- Queries are complex (multi-condition, multi-hop)
- Documents are long and contain relevant content scattered throughout
Reranking adds less value when:
- Vector search already achieves high precision (simple factual queries on clean corpora)
- You need sub-100ms total latency (cross-encoder adds 50–200ms)
- Your candidate pool after vector search is small (< 10 documents)
2025 Trend: Listwise Reranking
Traditional cross-encoders score documents independently against the query (pointwise). Newer listwise rerankers consider the relative relevance of all candidates together, which improves diversity and prevents redundancy in the final set. ListT5 and RankVicuna are early examples of this approach showing consistent improvements over pointwise methods.
Reranking is the single highest-ROI improvement for most RAG pipelines that have already got basic vector search working. A 50ms investment in reranking typically produces 15–30% precision improvements — a trade-off that’s almost always worth making.