Cross Encoder Models: The Precision Engine Behind RAG Reranking
If you’ve used a reranker in your RAG pipeline, you’ve used a cross encoder. But understanding what these models actually do — and how they differ architecturally from the bi-encoders used in vector search — reveals why they’re so much more accurate and when you should consider training your own.
Architecture Deep Dive
A cross encoder takes a pair of texts as input and outputs a single relevance score. It processes query and document jointly, with full transformer attention between every token in both sequences.
Input:[CLS] How does attention work in transformers? [SEP]Transformers use scaled dot-product attention to computedependencies between tokens. The attention score is computedas softmax(QK^T / sqrt(d_k)) * V, where Q, K, V arelearned projections of the input. [SEP]
Architecture:┌─────────────────────────────────────────────────────┐│ [CLS] Q1 Q2 Q3 [SEP] D1 D2 D3 D4 ... [SEP] ││ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ││ 12-layer BERT (or RoBERTa, DeBERTa) ││ Full cross-attention ││ Q1 attends to D1, D2... Dn simultaneously ││ D1 attends to Q1, Q2... Qn simultaneously │└─────────────────────────────────────────────────────┘ ↓ [CLS] representation Linear(768 → 1) → Relevance Score: 0.92Every word in the query can directly attend to every word in the document. This allows the model to understand things like:
- “The query asks about ‘Python 3.10 walrus operator’ and this document introduces the walrus operator in Python 3.8”
- Entity coreference: “its” in the query refers to “BERT” mentioned in the document
- Numeric comparison: the query asks for “less than 5ms” and the document says “average latency is 3.2ms”
None of this is possible with separate bi-encoder embeddings.
Standard Backbone Models
DeBERTa: The Preferred Backbone
DeBERTa (Decoupled Enhanced BERT) is the current preferred backbone for cross encoders because its disentangled attention mechanism handles relative positions better than BERT:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load DeBERTa-based cross encodermodel_name = "cross-encoder/ms-marco-MiniLM-L-12-v2" # small, fast# model_name = "BAAI/bge-reranker-v2-m3" # multilingual# model_name = "cross-encoder/ms-marco-electra-base" # ELECTRA backbone
tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForSequenceClassification.from_pretrained(model_name)model.eval()Model Family Comparison
| Model | Backbone | Params | Latency (20 docs) | NDCG@10 |
|---|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | BERT-mini | 22M | 25ms | 0.694 |
| ms-marco-MiniLM-L-12-v2 | BERT-mini | 33M | 42ms | 0.700 |
| bge-reranker-base | BERT | 278M | 85ms | 0.697 |
| bge-reranker-large | BERT-large | 560M | 140ms | 0.702 |
| bge-reranker-v2-m3 | multilingual | 568M | 145ms | 0.721 |
| ms-marco-deberta-base | DeBERTa | 184M | 70ms | 0.708 |
Training Cross Encoders
Cross encoders are trained on labeled query-document pairs with binary or graded relevance labels. MS MARCO (Microsoft Machine Reading Comprehension) is the dominant training dataset.
MS MARCO Training Format
from datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationfrom transformers import TrainingArguments, Trainerimport torch
dataset = load_dataset("ms_marco", "v1.1")
# Training examples: (query, positive_doc, negative_doc) triplets# Cross encoder trained to score positive > negative for same query
class CrossEncoderDataset(torch.utils.data.Dataset): def __init__(self, pairs, labels, tokenizer, max_length=512): self.encodings = tokenizer( [p[0] for p in pairs], # queries [p[1] for p in pairs], # documents truncation=True, padding=True, max_length=max_length, return_tensors="pt" ) self.labels = torch.tensor(labels, dtype=torch.float)
def __getitem__(self, idx): return { "input_ids": self.encodings["input_ids"][idx], "attention_mask": self.encodings["attention_mask"][idx], "labels": self.labels[idx], }
def __len__(self): return len(self.labels)Pairwise vs Pointwise Training
Pointwise: Train to predict binary label (relevant=1, not relevant=0):
# Loss: Binary Cross Entropyloss = F.binary_cross_entropy_with_logits(score, label)Pairwise: Train so that positive_doc score > negative_doc score by margin:
# Loss: Margin ranking lossloss = F.margin_ranking_loss(pos_score, neg_score, target=1, margin=1.0)Pairwise training often produces better ranking quality because the model directly optimizes the ordering relationship rather than absolute relevance scores.
Fine-Tuning for Domain Adaptation
A general MS MARCO cross encoder may not perform well on your domain. Fine-tuning on domain-specific data often produces 10–30% improvement:
from sentence_transformers import CrossEncoder, InputExamplefrom torch.utils.data import DataLoader
# Your domain-specific training datatraining_examples = [ InputExample(texts=["What is our API rate limit?", "The API rate limit is 1000 requests per minute."], label=1), InputExample(texts=["What is our API rate limit?", "Our company was founded in 2015."], label=0), # ... hundreds more examples]
# Start from pretrained cross encodermodel = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", num_labels=1)
# Fine-tunetrain_dataloader = DataLoader(training_examples, shuffle=True, batch_size=16)model.fit( train_dataloader=train_dataloader, epochs=3, warmup_steps=100, output_path="./models/domain-reranker",)Weak Supervision for Fine-Tuning Data
Generating labeled training data is expensive. Weak supervision reduces this cost:
- Collect query-document pairs where users clicked (implicit positive labels)
- Use BM25 to generate hard negatives (documents that match keywords but aren’t relevant)
- Use an LLM to label a small subset for quality anchoring
- Train on the weakly labeled dataset, validate on the LLM-labeled set
Batched Inference for Production
import torchfrom transformers import AutoTokenizer, AutoModelForSequenceClassification
class CrossEncoderReranker: def __init__(self, model_name: str, device: str = "cuda", batch_size: int = 32): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) self.model.to(device) self.model.eval() self.device = device self.batch_size = batch_size
def rerank(self, query: str, documents: list[str], top_n: int = 5) -> list[tuple]: pairs = [[query, doc] for doc in documents] all_scores = []
for i in range(0, len(pairs), self.batch_size): batch = pairs[i:i + self.batch_size] features = self.tokenizer( [p[0] for p in batch], [p[1] for p in batch], padding=True, truncation=True, max_length=512, return_tensors="pt" ).to(self.device)
with torch.no_grad(): scores = self.model(**features).logits.flatten().cpu().tolist() all_scores.extend(scores)
ranked = sorted( zip(documents, all_scores), key=lambda x: x[1], reverse=True ) return ranked[:top_n]2025 Trend: LLM-as-Reranker
Large language models themselves are being used as rerankers, leveraging their deep language understanding for relevance assessment:
import anthropic
client = anthropic.Anthropic()
def llm_rerank(query: str, candidates: list[str], top_n: int = 3) -> list[str]: doc_list = "\n\n".join([f"[{i+1}] {doc[:300]}" for i, doc in enumerate(candidates)])
response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=100, messages=[{ "role": "user", "content": f"""Rank these documents by relevance to the query.Return ONLY the document numbers in order of relevance, most relevant first.
Query: {query}
Documents:{doc_list}
Ranking (numbers only, comma separated):""" }] )
ranking_text = response.content[0].text.strip() indices = [int(x.strip()) - 1 for x in ranking_text.split(",")] return [candidates[i] for i in indices[:top_n] if i < len(candidates)]LLM-as-reranker is slower and more expensive than cross encoders but shows superior performance on complex reasoning tasks where domain understanding matters more than semantic similarity. It’s increasingly used as a final reranking stage after cross-encoder prefiltering.
Cross encoder models are the precision layer that turns good RAG into great RAG. For most teams, using a pretrained cross encoder (FlashRank for speed, BGE-reranker-large for quality) is the right starting point before considering fine-tuning.