The Attention Mechanism, Explained Through a Worked Numeric Example

Before attention existed, sequence-to-sequence models had an awkward, structurally limiting problem that no amount of clever training could fully work around: an encoder had to compress an entire input sentence — however long — into one single, fixed-size vector, and the decoder had to generate the entire output from that one compressed summary alone. Translating a five-word sentence and a fifty-word sentence used exactly the same amount of “memory” for the encoder’s summary, which is obviously a losing proposition for longer inputs. Attention fixed this specific bottleneck, and in doing so, it introduced an idea flexible enough that it eventually became the single core building block of the transformer architecture — everything covered later in this track, including Transformers itself, builds directly on the mechanism explained here.

The Core Idea, Before Any Math

Instead of forcing a decoder to work from one fixed summary of the entire input, attention lets it look back at every position in the input at every single generation step, and decide — freshly, each time — which positions matter most right now.

Translating English "How are you" to French, word by word:

Generating "Comment": attention weights over "How", "are", "you"
  How  → 0.70   ← heavily focused on "How"
  are  → 0.20
  you  → 0.10

Generating "vous" (the French word for "you"):
  How  → 0.05
  are  → 0.10
  you  → 0.85   ← now heavily focused on "you" instead

These weights aren’t hand-programmed rules — they’re computed dynamically, differently for every single word the model generates, using a mechanism that’s learned during training just like every other part of the network.

The Mechanism: Query, Key, Value

Attention computes these weights through three learned projections of the input, borrowing terminology from information retrieval: a query (what am I looking for, right now), a set of keys (a label attached to each available piece of information), and a set of values (the actual content associated with each key).

import numpy as np

def softmax(x):
    exp_x = np.exp(x - np.max(x))   # subtract max for numerical stability
    return exp_x / np.sum(exp_x)

def attention(query, keys, values):
    scores = np.dot(keys, query)          # step 1: similarity between query and each key
    weights = softmax(scores)             # step 2: turn scores into a probability distribution
    output = np.dot(weights, values)      # step 3: weighted sum of the values
    return output, weights

Tracing Real Numbers Through Every Step

Rather than trusting the code abstractly, walk an actual example through it end to end.

query = np.array([1.0, 0.5])

keys = np.array([
    [1.0, 0.5],     # position A: nearly identical to the query
    [0.1, 0.9],     # position B: somewhat different
    [-1.0, -0.5],   # position C: essentially opposite
])

values = np.array([
    [10, 20],   # the content at position A
    [30, 40],   # the content at position B
    [50, 60],   # the content at position C
])

output, weights = attention(query, keys, values)

print("Raw similarity scores:", np.dot(keys, query))
print("Attention weights (after softmax):", weights)
print("Final output (weighted blend of values):", output)

Running this: the raw dot-product scores come out as roughly [1.25, 0.55, -1.25] — position A scores highest because its key vector is nearly identical to the query, position C scores lowest because its key points in nearly the opposite direction. Softmax converts these into a proper probability distribution, something like [0.63, 0.31, 0.06] (exact values depend on the precise numbers, but the ordering is what matters): position A dominates, position B contributes a meaningful but smaller amount, and position C is almost entirely ignored. The final output is a weighted blend of the three value vectors — but overwhelmingly pulled toward position A’s value [10, 20], exactly because that position’s key was the best match for what the query was “looking for.”

This is the entire mechanism, laid bare in numbers: similarity in, weighted relevance out, with the weighting recomputed fresh for every different query.

A Detail the Simplified Version Glosses Over: Scaling

The code shown above computes similarity as a raw dot product between query and key vectors. In practice, real implementations divide that score by the square root of the key vector’s dimensionality before applying softmax:

def scaled_attention(query, keys, values):
    d_k = keys.shape[-1]                          # dimensionality of the key vectors
    scores = np.dot(keys, query) / np.sqrt(d_k)   # scaled similarity
    weights = softmax(scores)
    output = np.dot(weights, values)
    return output, weights

The reason this matters: as vector dimensionality grows, dot products between random vectors tend to grow larger in magnitude simply as a statistical consequence of summing more terms — not because the vectors are actually more “similar” in any meaningful sense. Feed a softmax function scores that are too large in magnitude, and it produces an extremely peaked, nearly one-hot distribution (almost all weight on a single position, everything else near zero), which in turn produces extremely small gradients during backpropagation, since softmax’s gradient flattens out sharply near its extremes — precisely the same flattening behavior that causes the vanishing gradient problem discussed in Vanishing Gradient Problem. Dividing by sqrt(d_k) keeps the scores in a numerically reasonable range regardless of how large the vectors are, which keeps softmax’s output — and therefore the gradient flowing back through it — well-behaved throughout training. This scaling factor is exactly why the mechanism transformers use is called “scaled dot-product attention,” not simply “dot-product attention.”

Placing This Inside an Encoder-Decoder

Applied to the original sequence-to-sequence setting, the decoder’s current hidden state becomes the query, and the encoder’s hidden states across the entire input sequence serve as both the keys and the values.

# Conceptual sketch: recompute attention fresh at every decoding step
for decoding_step in range(output_length):
    query = decoder_hidden_state
    context, attention_weights = attention(query, encoder_hidden_states, encoder_hidden_states)
    # `context` is a fresh, dynamically-weighted summary of the ENTIRE input,
    # recomputed differently at every single decoding step -- not one fixed
    # vector reused across the whole output sequence

This directly dissolves the original bottleneck described earlier: no matter how long the input sequence gets, the decoder can reach any part of it directly, at full strength, at every single generation step — rather than everything being squeezed through one fixed-size vector regardless of the input’s actual length.

Why This Idea Generalizes Far Beyond Translation

The query-key-value mechanism doesn’t actually require two separate sequences (an encoder and a decoder) at all — it works just as well applied within a single sequence, letting every position attend to every other position in that same sequence. This generalization is called self-attention, and it’s precisely the mechanism that powers the transformer architecture. Self-attention is what lets a model directly relate the first word of a long document to the last word, in a single computational step, without needing to pass information sequentially through many intermediate recurrent steps the way older architectures did — a meaningful structural advantage that shows up repeatedly once we get to Transformers.

Visualizing Attention: More Than a Pretty Picture

Because attention weights are explicit, human-readable numbers rather than opaque internal states, visualizing them — which input positions a model weighted most heavily when producing a specific output — has become one of the field’s genuinely useful interpretability tools, not just a decorative diagram.

Attention heatmap for an English-to-French translation task:

              How    are    you
"Comment"    0.70   0.20   0.10
"allez"      0.10   0.80   0.10
"vous"       0.10   0.10   0.80

This heatmap directly shows the model having learned correct word-level alignment between the two languages entirely from data — nobody told the model that “vous” corresponds to “you”; it discovered that correspondence purely by learning which input positions were useful to attend to when generating each output word, guided only by translation examples during training and nothing more explicit than that.

A Preview of Multi-Head Attention

One more detail worth knowing before moving on, since it comes up immediately in every real transformer implementation: rather than computing attention once, transformers compute it several times in parallel — using several independent sets of learned query, key, and value projections, called “heads” — and then combine the results.

The intuition behind this is worth internalizing even before seeing the full mechanics later: a single attention computation can only really capture one notion of “relevance” at a time — but language (and many other sequential domains) has multiple, simultaneously useful notions of relevance. One head might learn to track grammatical subject-verb relationships, another might learn to track which pronoun refers to which earlier noun, and another might track something less interpretable but still useful for the task. Running several heads in parallel and combining their outputs gives the model several independent “attention perspectives” on the same input at once, rather than being forced to compress every kind of relevant relationship into a single weighting scheme. This is exactly the mechanism used inside every transformer block, covered fully in Transformers.

A Common Point of Confusion, Addressed Directly

People new to attention sometimes assume the query, key, and value vectors must come from genuinely different sources or have obviously different meanings. In cross-attention (encoder-decoder), that’s true — queries come from the decoder, keys and values come from the encoder. But in self-attention, all three are computed from the same input sequence, just through three separate learned linear projections (three different weight matrices applied to the same input). The distinct names aren’t describing three different pieces of data; they’re describing three different roles the same underlying data plays, computed through three different learned transformations. This distinction — same source, different learned roles, produced by three separately learned weight matrices — is worth internalizing now, because it’s exactly how self-attention is implemented inside every transformer block covered next.

Summary

Component	Role
Query	What the model is currently looking for
Key	A learned representation attached to each available piece of information
Value	The actual content retrieved, weighted according to relevance
Attention weights	Per-position relevance scores, computed from query-key similarity and normalized by softmax
Self-attention	The same query-key-value mechanism applied within one sequence, rather than between two
Multi-head attention	Several independent attention computations run in parallel, then combined

Attention’s real contribution was letting a model decide dynamically what to focus on, rather than being forced to rely on one fixed, one-size-fits-all summary regardless of input length. That single idea turned out to generalize far beyond its original translation use case, becoming the foundational mechanism inside the transformer architecture covered in full next — and the scaling and multi-head extensions covered above are exactly the details that turn this basic mechanism into something powerful enough to underpin models operating at genuinely large scale.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.