Retrieval-Augmented Generation (RAG)

Large language models know a lot. But they don’t know your internal documents, your latest product specs, last week’s meeting notes, or anything published after their training cutoff. RAG is the architecture that bridges this gap.

The Problem RAG Solves

Without RAG, you have two bad options for knowledge injection:

Include everything in the system prompt — bloated context, expensive, and usually impossible (your knowledge base is thousands of documents)
Fine-tune on your knowledge — slow, expensive, stale the moment anything changes, and doesn’t reliably inject specific facts (models don’t “learn facts” well from fine-tuning)

RAG offers a third way: retrieve the relevant knowledge at query time and include only that in the context. The model gets exactly the information it needs, nothing more.

How RAG Works: The Full Pipeline

Document Corpus                Query Time
────────────────               ──────────────────────────────────
Your PDFs, docs,               User question
spreadsheets, wikis
        │                              │
        ▼                              ▼
   [Chunk & Embed]               [Embed Query]
        │                              │
        ▼                              │
   Vector Database ←────Search─────────┘
   (indexed embeddings)        │
                                ▼
                         Top-K relevant chunks
                                │
                                ▼
                    [Augment Prompt with chunks]
                                │
                                ▼
                         LLM generates response
                         grounded in retrieved context

Two phases: indexing (done once, then incrementally) and retrieval + generation (done at query time).

Phase 1: Indexing

Chunking

Documents must be split into smaller pieces that fit meaningfully within the context window. Chunking strategy significantly affects retrieval quality.

Fixed-size chunking: Split every N tokens (e.g., 512 tokens with 50-token overlap between chunks). Simple, predictable, but ignores document structure.

Semantic chunking: Split on natural boundaries — paragraphs, sections, sentences. Preserves meaning but produces variable-size chunks.

Hierarchical chunking: Index at multiple granularities. Store both paragraph-level chunks (for specific retrieval) and document-level summaries (for broad context). At query time, retrieve specific chunks but include section-level context.

Document structure for hierarchical chunking:
  Document summary (500 tokens) ← for broad context
    ├── Section 1 summary (100 tokens)
    │     ├── Paragraph 1 (150 tokens) ← retrieved
    │     ├── Paragraph 2 (140 tokens) ← retrieved
    │     └── Paragraph 3 (160 tokens)
    └── Section 2 summary (120 tokens)

Embedding

Each chunk is converted into a high-dimensional vector that captures its semantic meaning. Similar topics produce vectors that are close together in the vector space.

Common embedding models:

text-embedding-3-large (OpenAI): Strong multilingual, 3072 dimensions
voyage-large-2 (Voyage AI): Best in class for retrieval as of 2025
Cohere embed-v3: Strong multilingual retrieval
BGE-M3 (open-source): Excellent, free to run locally

Phase 2: Retrieval

Dense Retrieval (Semantic Search)

Query is embedded, then compared to all stored chunk embeddings using cosine similarity or dot product. Returns the top-K most semantically similar chunks.

Strengths: Handles paraphrase, finds conceptually related content, multilingual
Weaknesses: Can miss exact keyword matches, struggles with rare terms and product codes

Sparse Retrieval (BM25 / Keyword Search)

Traditional TF-IDF style search. Exact keyword matching with statistical weighting. The backbone of Elasticsearch and most search engines for 30+ years.

Strengths: Reliable exact-match, efficient, interpretable
Weaknesses: Doesn’t understand synonyms or paraphrase

Hybrid Search (Best of Both)

Combine both approaches and fuse the scores. Standard practice in production systems:

# Hybrid search with Reciprocal Rank Fusion
semantic_results = vector_db.search(query_embedding, top_k=20)
keyword_results = elasticsearch.search(query_text, top_k=20)

# RRF score = Σ 1/(k + rank_i) for each document i
def rrf_score(doc, semantic_rank, keyword_rank, k=60):
    score = 0
    if semantic_rank is not None:
        score += 1 / (k + semantic_rank)
    if keyword_rank is not None:
        score += 1 / (k + keyword_rank)
    return score

Reranking

After retrieval, a cross-encoder reranker scores each candidate against the query. More computationally expensive than embedding similarity, but significantly more accurate.

Initial retrieval: top-50 chunks (cheap embeddings)
                          │
                   Reranker model (cross-encoder)
                          │
            Reranked top-5 chunks (high quality)
                          │
                    Context for LLM

Cohere Rerank, Voyage Rerank, and BGE-Reranker-v2-M3 are popular options.

Phase 3: Generation

The retrieved chunks and the original question are combined into a prompt:

System: You are a helpful assistant. Answer the user's question
        using only the provided context. If the answer is not
        in the context, say so.

Context:
[DOCUMENT 1]
{retrieved chunk 1 content}

[DOCUMENT 2]
{retrieved chunk 2 content}

[DOCUMENT 3]
{retrieved chunk 3 content}

Question: {user_question}

Answer:

The critical instruction: “using only the provided context.” Without this, the model will blend retrieved facts with its own (potentially outdated or wrong) knowledge.

Advanced RAG Patterns (2025–2026)

Query Rewriting

Expand or decompose the user’s query before retrieval. A vague query becomes multiple precise sub-queries.

Original: "How do I integrate with our CRM?"
Rewritten: ["CRM integration API documentation",
            "Salesforce connection setup steps",
            "webhook configuration CRM"]

Self-RAG

The model retrieves documents, decides whether they’re sufficient, critiques its own answer against them, and iterates if needed. More expensive but higher accuracy.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer to the query, embed that answer, and use it for retrieval. Counterintuitive but often outperforms query embedding directly.

Parent Document Retrieval

Retrieve small chunks for precision, then expand to the parent section for context. Balances retrieval granularity with generation context richness.

RAG Evaluation Metrics

Metric	What It Measures	Tool
Context Recall	% of relevant info retrieved	RAGAS
Context Precision	% of retrieved info that’s relevant	RAGAS
Faithfulness	Does the answer stay within the retrieved context?	RAGAS, TruLens
Answer Relevancy	Does the answer address the question?	RAGAS
End-to-end accuracy	Is the final answer correct?	Human eval or LLM judge

RAGAS (Retrieval Augmented Generation Assessment Suite) is the standard framework for automated RAG evaluation.

Production Checklist

Chunking strategy validated on representative documents
Embedding model evaluated on your domain (generic benchmarks aren’t enough)
Hybrid search implemented (pure semantic search misses too much)
Reranker added for top-K refinement
Source citations included in responses (for user trust and auditability)
Retrieval evaluation metrics instrumented (recall, precision)
Staleness handling for re-indexing when documents change
Guardrails to prevent hallucination outside retrieved context
Latency budget: retrieval + reranking + generation should be < 3s for most apps