Attention Mechanism

If you’ve ever wondered how a language model can write coherently about something mentioned 10,000 tokens ago, the answer is attention. It’s the mechanism that lets every part of a sequence directly influence every other part — and it’s the single most important innovation in modern AI.

The Intuition Before the Math

Imagine you’re reading: “The trophy didn’t fit in the suitcase because it was too big.”

What does “it” refer to? The trophy. You figured that out by “attending” to trophy and big simultaneously while processing it, even though they’re separated by several words.

That’s exactly what attention does in a Transformer. For every token it’s processing, the model asks: which other tokens in this sequence are most relevant right now? It then aggregates information from those tokens, weighted by relevance.

The Three Vectors: Q, K, V

Attention is computed using three learned projections of each token’s representation:

Query (Q): What this token is looking for
Key (K): What each token offers (its “label”)
Value (V): The actual information each token carries

The analogy is a fuzzy database lookup:

Query: "I need information about the subject of this sentence"
Keys:  [token1: "noun phrase", token2: "verb phrase", token3: "pronoun"]
       → Match query against keys to find relevance scores
Values: [actual content of each token]
       → Weighted sum of values based on relevance scores

Scaled Dot-Product Attention

The formal operation:

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

Breaking this down:

Q·Kᵀ — Dot product of queries with all keys. Gives a raw similarity score between every pair of tokens (an N×N matrix).
/ √d_k — Scale by the square root of key dimension. Without this, dot products in high dimensions get very large, pushing softmax into regions with near-zero gradients. Scaling keeps gradients healthy.
softmax(…) — Convert scores to probabilities (0 to 1, summing to 1). High scores become dominant weights.
· V — Weighted sum of all value vectors. Each output token is a blend of all input tokens, weighted by how much attention it paid to each.

Step 1: Attention scores (raw)
        Token 1  Token 2  Token 3
Token 1 [ 8.2,    2.1,    1.3  ]
Token 2 [ 0.4,    9.1,    3.2  ]
Token 3 [ 1.1,    4.3,    7.8  ]

Step 2: After softmax (probabilities)
Token 1 [ 0.92,   0.04,   0.04 ]  ← mostly attends to itself
Token 2 [ 0.02,   0.91,   0.07 ]
Token 3 [ 0.04,   0.15,   0.81 ]

Multi-Head Attention

A single attention head only captures one kind of relationship at a time. Multi-head attention runs several attention operations in parallel, each with its own Q, K, V projections.

Input
  │
  ├── Head 1: Q₁, K₁, V₁ → Attention₁  (syntactic relationships)
  ├── Head 2: Q₂, K₂, V₂ → Attention₂  (semantic similarity)
  ├── Head 3: Q₃, K₃, V₃ → Attention₃  (co-reference resolution)
  └── Head h: Qₕ, Kₕ, Vₕ → Attentionₕ  (positional patterns)
                                  │
                           Concatenate all heads
                                  │
                         Output projection (Wₒ)
                                  │
                               Output

Each head might specialize in a different kind of linguistic or semantic relationship. Research has found heads that track subject-verb agreement, heads that detect named entity types, and heads that handle long-range dependencies.

The number of heads is a hyperparameter:

GPT-2 small: 12 heads
LLaMA 3 8B: 32 heads
GPT-4 (estimated): 96+ heads

Causal Masking for Decoder-Only Models

In a generative model (GPT, Claude, LLaMA), the model should only be able to attend to previous tokens — not future ones it hasn’t generated yet. This is enforced with a causal mask: a triangular matrix that sets future positions to -∞ before the softmax.

Attention mask (1 = allowed, -∞ = masked):
         t1   t2   t3   t4
    t1 [  1,  -∞,  -∞,  -∞ ]
    t2 [  1,   1,  -∞,  -∞ ]
    t3 [  1,   1,   1,  -∞ ]
    t4 [  1,   1,   1,   1 ]

When -∞ is passed through softmax, it becomes 0 — effectively hiding those positions. This ensures the model generates text autoregressively without “cheating” by looking ahead.

Cross-Attention (Encoder-Decoder)

In models that have both an encoder and decoder (like T5 or translation models), the decoder uses cross-attention to look at the encoder’s output:

Q comes from the decoder’s current hidden state (what the decoder wants to know)
K and V come from the encoder’s output (what the encoder knows about the input)

This is how machine translation works: the decoder “looks up” relevant source language information when generating each target word.

Flash Attention: Making It Practical

The naive attention computation has O(N²) memory complexity — for a 128K token context, the attention matrix would need terabytes of memory. Flash Attention (2022, then v2 in 2023) solved this with a clever tiling approach that computes attention in blocks, avoiding materializing the full N×N matrix.

Results: 2–4× faster than standard attention on modern GPUs, with near-linear memory usage. It’s now standard in every production framework.

Standard attention: Load Q, K, V → Compute full N×N matrix → Write back
Flash Attention:    Tile Q, K, V → Compute in SRAM blocks → Never write full matrix

Modern Variants You Should Know

Variant	Key Idea	Used In
Multi-Query Attention (MQA)	All query heads share one K, V	Falcon, PaLM
Grouped-Query Attention (GQA)	Groups of queries share K, V	LLaMA 3, Mistral, Gemma
Sliding Window Attention	Only attend to last W tokens	Mistral 7B, LongFormer
Ring Attention	Distribute attention across GPUs	1M+ context models
Linear Attention	Approximate O(N²) as O(N)	RWKV, Mamba-adjacent models

Why Attention Produces Intelligence

There’s something profound about what attention enables. By allowing every token to directly influence every other token, the model can:

Track pronouns back to their referents across thousands of words
Recognize that “Apple” in a finance document refers to the company, not the fruit
Answer questions by retrieving the relevant passage from a long document
Generate code that correctly uses a variable defined 200 lines earlier

This long-range, flexible information routing is what makes Transformers qualitatively different from previous architectures — and why “attention is all you need” turned out to be a prescient title.

Written by NPBlue AI Team — AI / ML Engineers who builds and ships production GenAI systems — not just demo notebooks.

Reviewed for technical accuracy. Spot an error? Let us know.