Retrieval-Augmented Generation (RAG)
Large language models know a lot. But they don’t know your internal documents, your latest product specs, last week’s meeting notes, or anything published after their training cutoff. RAG is the architecture that bridges this gap.
The Problem RAG Solves
Without RAG, you have two bad options for knowledge injection:
- Include everything in the system prompt — bloated context, expensive, and usually impossible (your knowledge base is thousands of documents)
- Fine-tune on your knowledge — slow, expensive, stale the moment anything changes, and doesn’t reliably inject specific facts (models don’t “learn facts” well from fine-tuning)
RAG offers a third way: retrieve the relevant knowledge at query time and include only that in the context. The model gets exactly the information it needs, nothing more.
How RAG Works: The Full Pipeline
Document Corpus Query Time──────────────── ──────────────────────────────────Your PDFs, docs, User questionspreadsheets, wikis │ │ ▼ ▼ [Chunk & Embed] [Embed Query] │ │ ▼ │ Vector Database ←────Search─────────┘ (indexed embeddings) │ ▼ Top-K relevant chunks │ ▼ [Augment Prompt with chunks] │ ▼ LLM generates response grounded in retrieved contextTwo phases: indexing (done once, then incrementally) and retrieval + generation (done at query time).
Phase 1: Indexing
Chunking
Documents must be split into smaller pieces that fit meaningfully within the context window. Chunking strategy significantly affects retrieval quality.
Fixed-size chunking: Split every N tokens (e.g., 512 tokens with 50-token overlap between chunks). Simple, predictable, but ignores document structure.
Semantic chunking: Split on natural boundaries — paragraphs, sections, sentences. Preserves meaning but produces variable-size chunks.
Hierarchical chunking: Index at multiple granularities. Store both paragraph-level chunks (for specific retrieval) and document-level summaries (for broad context). At query time, retrieve specific chunks but include section-level context.
Document structure for hierarchical chunking: Document summary (500 tokens) ← for broad context ├── Section 1 summary (100 tokens) │ ├── Paragraph 1 (150 tokens) ← retrieved │ ├── Paragraph 2 (140 tokens) ← retrieved │ └── Paragraph 3 (160 tokens) └── Section 2 summary (120 tokens)Embedding
Each chunk is converted into a high-dimensional vector that captures its semantic meaning. Similar topics produce vectors that are close together in the vector space.
Common embedding models:
- text-embedding-3-large (OpenAI): Strong multilingual, 3072 dimensions
- voyage-large-2 (Voyage AI): Best in class for retrieval as of 2025
- Cohere embed-v3: Strong multilingual retrieval
- BGE-M3 (open-source): Excellent, free to run locally
Phase 2: Retrieval
Dense Retrieval (Semantic Search)
Query is embedded, then compared to all stored chunk embeddings using cosine similarity or dot product. Returns the top-K most semantically similar chunks.
Strengths: Handles paraphrase, finds conceptually related content, multilingual
Weaknesses: Can miss exact keyword matches, struggles with rare terms and product codes
Sparse Retrieval (BM25 / Keyword Search)
Traditional TF-IDF style search. Exact keyword matching with statistical weighting. The backbone of Elasticsearch and most search engines for 30+ years.
Strengths: Reliable exact-match, efficient, interpretable
Weaknesses: Doesn’t understand synonyms or paraphrase
Hybrid Search (Best of Both)
Combine both approaches and fuse the scores. Standard practice in production systems:
# Hybrid search with Reciprocal Rank Fusionsemantic_results = vector_db.search(query_embedding, top_k=20)keyword_results = elasticsearch.search(query_text, top_k=20)
# RRF score = Σ 1/(k + rank_i) for each document idef rrf_score(doc, semantic_rank, keyword_rank, k=60): score = 0 if semantic_rank is not None: score += 1 / (k + semantic_rank) if keyword_rank is not None: score += 1 / (k + keyword_rank) return scoreReranking
After retrieval, a cross-encoder reranker scores each candidate against the query. More computationally expensive than embedding similarity, but significantly more accurate.
Initial retrieval: top-50 chunks (cheap embeddings) │ Reranker model (cross-encoder) │ Reranked top-5 chunks (high quality) │ Context for LLMCohere Rerank, Voyage Rerank, and BGE-Reranker-v2-M3 are popular options.
Phase 3: Generation
The retrieved chunks and the original question are combined into a prompt:
System: You are a helpful assistant. Answer the user's question using only the provided context. If the answer is not in the context, say so.
Context:[DOCUMENT 1]{retrieved chunk 1 content}
[DOCUMENT 2]{retrieved chunk 2 content}
[DOCUMENT 3]{retrieved chunk 3 content}
Question: {user_question}
Answer:The critical instruction: “using only the provided context.” Without this, the model will blend retrieved facts with its own (potentially outdated or wrong) knowledge.
Advanced RAG Patterns (2025–2026)
Query Rewriting
Expand or decompose the user’s query before retrieval. A vague query becomes multiple precise sub-queries.
Original: "How do I integrate with our CRM?"Rewritten: ["CRM integration API documentation", "Salesforce connection setup steps", "webhook configuration CRM"]Self-RAG
The model retrieves documents, decides whether they’re sufficient, critiques its own answer against them, and iterates if needed. More expensive but higher accuracy.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer to the query, embed that answer, and use it for retrieval. Counterintuitive but often outperforms query embedding directly.
Parent Document Retrieval
Retrieve small chunks for precision, then expand to the parent section for context. Balances retrieval granularity with generation context richness.
RAG Evaluation Metrics
| Metric | What It Measures | Tool |
|---|---|---|
| Context Recall | % of relevant info retrieved | RAGAS |
| Context Precision | % of retrieved info that’s relevant | RAGAS |
| Faithfulness | Does the answer stay within the retrieved context? | RAGAS, TruLens |
| Answer Relevancy | Does the answer address the question? | RAGAS |
| End-to-end accuracy | Is the final answer correct? | Human eval or LLM judge |
RAGAS (Retrieval Augmented Generation Assessment Suite) is the standard framework for automated RAG evaluation.
Production Checklist
- Chunking strategy validated on representative documents
- Embedding model evaluated on your domain (generic benchmarks aren’t enough)
- Hybrid search implemented (pure semantic search misses too much)
- Reranker added for top-K refinement
- Source citations included in responses (for user trust and auditability)
- Retrieval evaluation metrics instrumented (recall, precision)
- Staleness handling for re-indexing when documents change
- Guardrails to prevent hallucination outside retrieved context
- Latency budget: retrieval + reranking + generation should be < 3s for most apps