Attention Mechanism Explained: How Models Learn What to Focus On
The attention mechanism solved a specific, concrete problem — the fixed-size context vector bottleneck in sequence-to-sequence models, covered in Sequence-to-Sequence Models — and in doing so, it introduced an idea so fundamentally useful that it eventually became the core building block of the transformer architecture, covered next in Transformers. Understanding attention at the mechanical level — queries, keys, values, and how they combine — is essential groundwork before tackling transformers directly.
The Core Idea: Let the Model Choose What to Focus On
Instead of forcing a decoder to rely on a single, fixed-size summary of the entire input, attention lets it look back at every position in the input at each generation step, computing a weighted combination where more relevant positions get higher weight.
Generating the French word "vous" (for "you"):Attention weights over the English input "How are you": "How" → 0.05 "are" → 0.10 "you" → 0.85 ← the model learns to focus heavily on the relevant wordThese attention weights aren’t hand-specified — they’re computed dynamically, differently for every single output word being generated, and the mechanism for computing them is learned during training just like any other part of the network.
Queries, Keys, and Values: The Mechanism Behind the Weights
Attention computes its weights using three learned projections of the input, borrowed conceptually from information retrieval terminology: a query (what am I looking for right now), keys (a label for each piece of available information), and values (the actual content of that information).
import numpy as np
def softmax(x): exp_x = np.exp(x - np.max(x)) return exp_x / np.sum(exp_x)
def attention(query, keys, values): # Step 1: compute similarity scores between the query and every key scores = np.dot(keys, query) # how relevant is each key to this query
# Step 2: convert scores into a valid probability distribution weights = softmax(scores) # attention weights, summing to 1
# Step 3: compute a weighted sum of the values, using those weights output = np.dot(weights, values) # the attended output return output, weightsThe dot product between query and keys (step 1) measures similarity — a query vector that closely aligns with a particular key vector produces a high score, which softmax (covered in Probability Distributions) then converts into a proportionally large attention weight for that position.
A Concrete Numeric Example
query = np.array([1.0, 0.5])keys = np.array([ [1.0, 0.5], # very similar to the query [0.1, 0.9], # less similar [-1.0, -0.5], # dissimilar])values = np.array([ [10, 20], [30, 40], [50, 60],])
output, weights = attention(query, keys, values)print(weights) # heavily weighted toward the first key/value pair, the most similar oneprint(output) # a weighted blend of the values, dominated by the first oneThe output is a blend of the available values, but weighted overwhelmingly toward whichever value’s corresponding key was most relevant to the current query — exactly the “focus on what matters right now” behavior attention is designed to produce.
Attention in the Encoder-Decoder Context
Applied to the sequence-to-sequence setting, the decoder’s current hidden state acts as the query, and the encoder’s hidden states at every input position act as both keys and values — at each decoding step, the decoder computes a fresh set of attention weights over the entire input, rather than relying on one fixed summary computed once.
# Conceptual: at each decoding step, recompute attention over the whole inputfor decoding_step in range(output_length): query = decoder_hidden_state context, attention_weights = attention(query, encoder_hidden_states, encoder_hidden_states) # context is now a dynamically-weighted summary, different for each decoding step, # rather than one fixed vector used for the entire output sequenceThis directly solves the bottleneck described in Sequence-to-Sequence Models — no matter how long the input sequence is, the decoder can access any part of it directly and with full strength at every single generation step, rather than everything being forced through one compressed vector.
Why This Idea Generalizes Far Beyond Translation
The query-key-value mechanism doesn’t require an encoder-decoder setup at all — it can be applied within a single sequence, letting every position attend to every other position in the same sequence. This generalization, called self-attention, is precisely the mechanism that powers the transformer architecture, covered fully in Transformers, and it’s what allows a model to directly relate any two words in a sentence regardless of how far apart they are — a word at the very start of a document can directly influence the interpretation of a word at the very end, in a single computation, without needing to pass information through many sequential recurrent steps.
Visualizing Attention: A Genuinely Useful Interpretability Tool
Because attention weights are explicit, interpretable numbers, visualizing them — which input words a model attended to most strongly when producing a specific output — has become a widely used, genuinely informative tool for understanding what a trained model is actually doing, beyond just its final predictions.
Attention heatmap example (translation task): How are you"Comment" 0.7 0.2 0.1"allez" 0.1 0.8 0.1"vous" 0.1 0.1 0.8This kind of visualization directly shows the model correctly learning word-level alignment between the source and target languages, entirely from data, without any explicit alignment supervision provided during training.
Summary
| Component | Role |
|---|---|
| Query | What the model is currently looking for |
| Key | A label/representation for each available piece of information |
| Value | The actual content retrieved, weighted by relevance |
| Attention weights | Learned, per-step relevance scores, computed via query-key similarity |
Attention’s core contribution — letting a model dynamically decide what to focus on, rather than relying on a fixed, one-size-fits-all summary — turned out to be far more broadly useful than its original translation use case, becoming the foundational mechanism behind the transformer architecture covered next.