AI  /  Generative AI

Generative AI 26 guides · updated 2026

From transformer foundations to production RAG, tool-using agents, and the Model Context Protocol — the GenAI stack as it's actually being built in 2026.

Retrieval-Augmented Generation (RAG)

Large language models know a lot. But they don’t know your internal documents, your latest product specs, last week’s meeting notes, or anything published after their training cutoff. RAG is the architecture that bridges this gap.


The Problem RAG Solves

Without RAG, you have two bad options for knowledge injection:

  1. Include everything in the system prompt — bloated context, expensive, and usually impossible (your knowledge base is thousands of documents)
  2. Fine-tune on your knowledge — slow, expensive, stale the moment anything changes, and doesn’t reliably inject specific facts (models don’t “learn facts” well from fine-tuning)

RAG offers a third way: retrieve the relevant knowledge at query time and include only that in the context. The model gets exactly the information it needs, nothing more.


How RAG Works: The Full Pipeline

Document Corpus Query Time
──────────────── ──────────────────────────────────
Your PDFs, docs, User question
spreadsheets, wikis
│ │
▼ ▼
[Chunk & Embed] [Embed Query]
│ │
▼ │
Vector Database ←────Search─────────┘
(indexed embeddings) │
Top-K relevant chunks
[Augment Prompt with chunks]
LLM generates response
grounded in retrieved context

Two phases: indexing (done once, then incrementally) and retrieval + generation (done at query time).


Phase 1: Indexing

Chunking

Documents must be split into smaller pieces that fit meaningfully within the context window. Chunking strategy significantly affects retrieval quality.

Fixed-size chunking: Split every N tokens (e.g., 512 tokens with 50-token overlap between chunks). Simple, predictable, but ignores document structure.

Semantic chunking: Split on natural boundaries — paragraphs, sections, sentences. Preserves meaning but produces variable-size chunks.

Hierarchical chunking: Index at multiple granularities. Store both paragraph-level chunks (for specific retrieval) and document-level summaries (for broad context). At query time, retrieve specific chunks but include section-level context.

Document structure for hierarchical chunking:
Document summary (500 tokens) ← for broad context
├── Section 1 summary (100 tokens)
│ ├── Paragraph 1 (150 tokens) ← retrieved
│ ├── Paragraph 2 (140 tokens) ← retrieved
│ └── Paragraph 3 (160 tokens)
└── Section 2 summary (120 tokens)

Embedding

Each chunk is converted into a high-dimensional vector that captures its semantic meaning. Similar topics produce vectors that are close together in the vector space.

Common embedding models:


Phase 2: Retrieval

Query is embedded, then compared to all stored chunk embeddings using cosine similarity or dot product. Returns the top-K most semantically similar chunks.

Strengths: Handles paraphrase, finds conceptually related content, multilingual
Weaknesses: Can miss exact keyword matches, struggles with rare terms and product codes

Traditional TF-IDF style search. Exact keyword matching with statistical weighting. The backbone of Elasticsearch and most search engines for 30+ years.

Strengths: Reliable exact-match, efficient, interpretable
Weaknesses: Doesn’t understand synonyms or paraphrase

Hybrid Search (Best of Both)

Combine both approaches and fuse the scores. Standard practice in production systems:

# Hybrid search with Reciprocal Rank Fusion
semantic_results = vector_db.search(query_embedding, top_k=20)
keyword_results = elasticsearch.search(query_text, top_k=20)
# RRF score = Σ 1/(k + rank_i) for each document i
def rrf_score(doc, semantic_rank, keyword_rank, k=60):
score = 0
if semantic_rank is not None:
score += 1 / (k + semantic_rank)
if keyword_rank is not None:
score += 1 / (k + keyword_rank)
return score

Reranking

After retrieval, a cross-encoder reranker scores each candidate against the query. More computationally expensive than embedding similarity, but significantly more accurate.

Initial retrieval: top-50 chunks (cheap embeddings)
Reranker model (cross-encoder)
Reranked top-5 chunks (high quality)
Context for LLM

Cohere Rerank, Voyage Rerank, and BGE-Reranker-v2-M3 are popular options.


Phase 3: Generation

The retrieved chunks and the original question are combined into a prompt:

System: You are a helpful assistant. Answer the user's question
using only the provided context. If the answer is not
in the context, say so.
Context:
[DOCUMENT 1]
{retrieved chunk 1 content}
[DOCUMENT 2]
{retrieved chunk 2 content}
[DOCUMENT 3]
{retrieved chunk 3 content}
Question: {user_question}
Answer:

The critical instruction: “using only the provided context.” Without this, the model will blend retrieved facts with its own (potentially outdated or wrong) knowledge.


Advanced RAG Patterns (2025–2026)

Query Rewriting

Expand or decompose the user’s query before retrieval. A vague query becomes multiple precise sub-queries.

Original: "How do I integrate with our CRM?"
Rewritten: ["CRM integration API documentation",
"Salesforce connection setup steps",
"webhook configuration CRM"]

Self-RAG

The model retrieves documents, decides whether they’re sufficient, critiques its own answer against them, and iterates if needed. More expensive but higher accuracy.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer to the query, embed that answer, and use it for retrieval. Counterintuitive but often outperforms query embedding directly.

Parent Document Retrieval

Retrieve small chunks for precision, then expand to the parent section for context. Balances retrieval granularity with generation context richness.


RAG Evaluation Metrics

MetricWhat It MeasuresTool
Context Recall% of relevant info retrievedRAGAS
Context Precision% of retrieved info that’s relevantRAGAS
FaithfulnessDoes the answer stay within the retrieved context?RAGAS, TruLens
Answer RelevancyDoes the answer address the question?RAGAS
End-to-end accuracyIs the final answer correct?Human eval or LLM judge

RAGAS (Retrieval Augmented Generation Assessment Suite) is the standard framework for automated RAG evaluation.


Production Checklist