Attention Mechanism Explained: How Models Learn What to Focus On

How the attention mechanism works step by step — queries, keys, values, and attention weights — and why it fixed seq2seq's bottleneck.

Attention Mechanism Explained: How Models Learn What to Focus On

The attention mechanism solved a specific, concrete problem — the fixed-size context vector bottleneck in sequence-to-sequence models, covered in Sequence-to-Sequence Models — and in doing so, it introduced an idea so fundamentally useful that it eventually became the core building block of the transformer architecture, covered next in Transformers. Understanding attention at the mechanical level — queries, keys, values, and how they combine — is essential groundwork before tackling transformers directly.


The Core Idea: Let the Model Choose What to Focus On

Instead of forcing a decoder to rely on a single, fixed-size summary of the entire input, attention lets it look back at every position in the input at each generation step, computing a weighted combination where more relevant positions get higher weight.

Generating the French word "vous" (for "you"):
Attention weights over the English input "How are you":
"How" → 0.05
"are" → 0.10
"you" → 0.85 ← the model learns to focus heavily on the relevant word

These attention weights aren’t hand-specified — they’re computed dynamically, differently for every single output word being generated, and the mechanism for computing them is learned during training just like any other part of the network.


Queries, Keys, and Values: The Mechanism Behind the Weights

Attention computes its weights using three learned projections of the input, borrowed conceptually from information retrieval terminology: a query (what am I looking for right now), keys (a label for each piece of available information), and values (the actual content of that information).

import numpy as np
def softmax(x):
exp_x = np.exp(x - np.max(x))
return exp_x / np.sum(exp_x)
def attention(query, keys, values):
# Step 1: compute similarity scores between the query and every key
scores = np.dot(keys, query) # how relevant is each key to this query
# Step 2: convert scores into a valid probability distribution
weights = softmax(scores) # attention weights, summing to 1
# Step 3: compute a weighted sum of the values, using those weights
output = np.dot(weights, values) # the attended output
return output, weights

The dot product between query and keys (step 1) measures similarity — a query vector that closely aligns with a particular key vector produces a high score, which softmax (covered in Probability Distributions) then converts into a proportionally large attention weight for that position.


A Concrete Numeric Example

query = np.array([1.0, 0.5])
keys = np.array([
[1.0, 0.5], # very similar to the query
[0.1, 0.9], # less similar
[-1.0, -0.5], # dissimilar
])
values = np.array([
[10, 20],
[30, 40],
[50, 60],
])
output, weights = attention(query, keys, values)
print(weights) # heavily weighted toward the first key/value pair, the most similar one
print(output) # a weighted blend of the values, dominated by the first one

The output is a blend of the available values, but weighted overwhelmingly toward whichever value’s corresponding key was most relevant to the current query — exactly the “focus on what matters right now” behavior attention is designed to produce.


Attention in the Encoder-Decoder Context

Applied to the sequence-to-sequence setting, the decoder’s current hidden state acts as the query, and the encoder’s hidden states at every input position act as both keys and values — at each decoding step, the decoder computes a fresh set of attention weights over the entire input, rather than relying on one fixed summary computed once.

# Conceptual: at each decoding step, recompute attention over the whole input
for decoding_step in range(output_length):
query = decoder_hidden_state
context, attention_weights = attention(query, encoder_hidden_states, encoder_hidden_states)
# context is now a dynamically-weighted summary, different for each decoding step,
# rather than one fixed vector used for the entire output sequence

This directly solves the bottleneck described in Sequence-to-Sequence Models — no matter how long the input sequence is, the decoder can access any part of it directly and with full strength at every single generation step, rather than everything being forced through one compressed vector.


Why This Idea Generalizes Far Beyond Translation

The query-key-value mechanism doesn’t require an encoder-decoder setup at all — it can be applied within a single sequence, letting every position attend to every other position in the same sequence. This generalization, called self-attention, is precisely the mechanism that powers the transformer architecture, covered fully in Transformers, and it’s what allows a model to directly relate any two words in a sentence regardless of how far apart they are — a word at the very start of a document can directly influence the interpretation of a word at the very end, in a single computation, without needing to pass information through many sequential recurrent steps.


Visualizing Attention: A Genuinely Useful Interpretability Tool

Because attention weights are explicit, interpretable numbers, visualizing them — which input words a model attended to most strongly when producing a specific output — has become a widely used, genuinely informative tool for understanding what a trained model is actually doing, beyond just its final predictions.

Attention heatmap example (translation task):
How are you
"Comment" 0.7 0.2 0.1
"allez" 0.1 0.8 0.1
"vous" 0.1 0.1 0.8

This kind of visualization directly shows the model correctly learning word-level alignment between the source and target languages, entirely from data, without any explicit alignment supervision provided during training.

Summary

ComponentRole
QueryWhat the model is currently looking for
KeyA label/representation for each available piece of information
ValueThe actual content retrieved, weighted by relevance
Attention weightsLearned, per-step relevance scores, computed via query-key similarity

Attention’s core contribution — letting a model dynamically decide what to focus on, rather than relying on a fixed, one-size-fits-all summary — turned out to be far more broadly useful than its original translation use case, becoming the foundational mechanism behind the transformer architecture covered next.