AI  /  Generative AI

Generative AI 26 guides · updated 2026

From transformer foundations to production RAG, tool-using agents, and the Model Context Protocol — the GenAI stack as it's actually being built in 2026.

Attention Mechanism

If you’ve ever wondered how a language model can write coherently about something mentioned 10,000 tokens ago, the answer is attention. It’s the mechanism that lets every part of a sequence directly influence every other part — and it’s the single most important innovation in modern AI.


The Intuition Before the Math

Imagine you’re reading: “The trophy didn’t fit in the suitcase because it was too big.”

What does “it” refer to? The trophy. You figured that out by “attending” to trophy and big simultaneously while processing it, even though they’re separated by several words.

That’s exactly what attention does in a Transformer. For every token it’s processing, the model asks: which other tokens in this sequence are most relevant right now? It then aggregates information from those tokens, weighted by relevance.


The Three Vectors: Q, K, V

Attention is computed using three learned projections of each token’s representation:

The analogy is a fuzzy database lookup:

Query: "I need information about the subject of this sentence"
Keys: [token1: "noun phrase", token2: "verb phrase", token3: "pronoun"]
→ Match query against keys to find relevance scores
Values: [actual content of each token]
→ Weighted sum of values based on relevance scores

Scaled Dot-Product Attention

The formal operation:

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

Breaking this down:

  1. Q·Kᵀ — Dot product of queries with all keys. Gives a raw similarity score between every pair of tokens (an N×N matrix).

  2. / √d_k — Scale by the square root of key dimension. Without this, dot products in high dimensions get very large, pushing softmax into regions with near-zero gradients. Scaling keeps gradients healthy.

  3. softmax(…) — Convert scores to probabilities (0 to 1, summing to 1). High scores become dominant weights.

  4. · V — Weighted sum of all value vectors. Each output token is a blend of all input tokens, weighted by how much attention it paid to each.

Step 1: Attention scores (raw)
Token 1 Token 2 Token 3
Token 1 [ 8.2, 2.1, 1.3 ]
Token 2 [ 0.4, 9.1, 3.2 ]
Token 3 [ 1.1, 4.3, 7.8 ]
Step 2: After softmax (probabilities)
Token 1 [ 0.92, 0.04, 0.04 ] ← mostly attends to itself
Token 2 [ 0.02, 0.91, 0.07 ]
Token 3 [ 0.04, 0.15, 0.81 ]

Multi-Head Attention

A single attention head only captures one kind of relationship at a time. Multi-head attention runs several attention operations in parallel, each with its own Q, K, V projections.

Input
├── Head 1: Q₁, K₁, V₁ → Attention₁ (syntactic relationships)
├── Head 2: Q₂, K₂, V₂ → Attention₂ (semantic similarity)
├── Head 3: Q₃, K₃, V₃ → Attention₃ (co-reference resolution)
└── Head h: Qₕ, Kₕ, Vₕ → Attentionₕ (positional patterns)
Concatenate all heads
Output projection (Wₒ)
Output

Each head might specialize in a different kind of linguistic or semantic relationship. Research has found heads that track subject-verb agreement, heads that detect named entity types, and heads that handle long-range dependencies.

The number of heads is a hyperparameter:


Causal Masking for Decoder-Only Models

In a generative model (GPT, Claude, LLaMA), the model should only be able to attend to previous tokens — not future ones it hasn’t generated yet. This is enforced with a causal mask: a triangular matrix that sets future positions to -∞ before the softmax.

Attention mask (1 = allowed, -∞ = masked):
t1 t2 t3 t4
t1 [ 1, -∞, -∞, -∞ ]
t2 [ 1, 1, -∞, -∞ ]
t3 [ 1, 1, 1, -∞ ]
t4 [ 1, 1, 1, 1 ]

When -∞ is passed through softmax, it becomes 0 — effectively hiding those positions. This ensures the model generates text autoregressively without “cheating” by looking ahead.


Cross-Attention (Encoder-Decoder)

In models that have both an encoder and decoder (like T5 or translation models), the decoder uses cross-attention to look at the encoder’s output:

This is how machine translation works: the decoder “looks up” relevant source language information when generating each target word.


Flash Attention: Making It Practical

The naive attention computation has O(N²) memory complexity — for a 128K token context, the attention matrix would need terabytes of memory. Flash Attention (2022, then v2 in 2023) solved this with a clever tiling approach that computes attention in blocks, avoiding materializing the full N×N matrix.

Results: 2–4× faster than standard attention on modern GPUs, with near-linear memory usage. It’s now standard in every production framework.

Standard attention: Load Q, K, V → Compute full N×N matrix → Write back
Flash Attention: Tile Q, K, V → Compute in SRAM blocks → Never write full matrix

Modern Variants You Should Know

VariantKey IdeaUsed In
Multi-Query Attention (MQA)All query heads share one K, VFalcon, PaLM
Grouped-Query Attention (GQA)Groups of queries share K, VLLaMA 3, Mistral, Gemma
Sliding Window AttentionOnly attend to last W tokensMistral 7B, LongFormer
Ring AttentionDistribute attention across GPUs1M+ context models
Linear AttentionApproximate O(N²) as O(N)RWKV, Mamba-adjacent models

Why Attention Produces Intelligence

There’s something profound about what attention enables. By allowing every token to directly influence every other token, the model can:

This long-range, flexible information routing is what makes Transformers qualitatively different from previous architectures — and why “attention is all you need” turned out to be a prescient title.