Transformers Explained: Self-Attention, Multi-Head Attention, and Positional Encoding

How transformers work — self-attention, multi-head attention, and positional encoding — and why they replaced RNNs for sequence modeling.

Transformers Explained: Self-Attention, Multi-Head Attention, and Positional Encoding

The transformer architecture, introduced in the 2017 paper aptly titled “Attention Is All You Need,” made a genuinely radical claim: you don’t need recurrence at all to model sequences — attention alone, applied cleverly within a single sequence, is sufficient. This single architectural shift is directly responsible for essentially every modern large language model, and understanding its three core components — self-attention, multi-head attention, and positional encoding — is what makes the current era of deep learning legible rather than mysterious.


Self-Attention: Every Position Attends to Every Other Position

Self-attention applies the query-key-value mechanism covered in Attention Mechanism within a single sequence — every word’s representation is updated by attending to every other word in the same sequence, including itself.

import numpy as np
def self_attention(X, W_q, W_k, W_v):
Q = X @ W_q # queries: shape (seq_len, d_k)
K = X @ W_k # keys
V = X @ W_v # values
scores = Q @ K.T / np.sqrt(K.shape[-1]) # scaled dot-product similarity
weights = softmax_rows(scores) # attention weights, per row
output = weights @ V # weighted combination of values
return output

The scaling by sqrt(d_k) prevents the dot products from growing too large in magnitude as the dimensionality increases, which would otherwise push softmax into a region with extremely small gradients — a direct, practical instance of the numerical stability concerns covered in Numerical Computation.

For a sentence like “The cat sat because it was tired,” self-attention lets the model directly connect “it” to “cat” in a single computation — no matter how many words separate them, unlike an RNN which would need to carry that information forward through every intermediate time step, exactly the long-range dependency limitation covered in Recurrent Neural Networks.


Multi-Head Attention: Multiple Relationships at Once

A single attention computation can only capture one specific type of relationship between words at a time. Multi-head attention runs several attention computations in parallel, each with its own separately learned query/key/value projections, letting the model capture multiple different kinds of relationships simultaneously.

def multi_head_attention(X, num_heads, d_model):
d_k = d_model // num_heads
head_outputs = []
for head in range(num_heads):
W_q, W_k, W_v = get_head_weights(head, d_model, d_k) # each head has its own weights
head_output = self_attention(X, W_q, W_k, W_v)
head_outputs.append(head_output)
concatenated = np.concatenate(head_outputs, axis=-1)
return concatenated @ W_output # a final learned linear projection combines all heads

In practice, different attention heads have been observed to specialize in genuinely different linguistic relationships — one head might consistently attend to syntactic dependencies (subject-verb agreement), another to coreference (pronouns to their referents), entirely emerging from training rather than being explicitly designed.


Positional Encoding: Restoring the Order Information Self-Attention Discards

Self-attention, as described so far, has a real limitation: it computes relationships between all pairs of positions without any inherent notion of order — shuffling the words in a sentence would produce the exact same set of pairwise attention computations, just reordered, unlike an RNN which processes tokens strictly in sequence and naturally encodes their order. Positional encoding fixes this by adding position-specific information directly to each token’s input representation before self-attention is applied.

def positional_encoding(seq_len, d_model):
position = np.arange(seq_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((seq_len, d_model))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
return pe
# Added directly to the token embeddings before they enter the transformer layers
input_embeddings = token_embeddings + positional_encoding(seq_len, d_model)

The specific sine/cosine formula was chosen because it produces a unique encoding for every position while also having a useful mathematical property: the encoding for position p + k can be expressed as a linear function of the encoding for position p, which empirically helps the model generalize its understanding of relative position.


The Complete Transformer Block

Input embeddings + Positional encoding
Multi-Head Self-Attention
Add & Normalize (residual + layer norm)
Feedforward Network (a small MLP, applied per position)
Add & Normalize (residual + layer norm)
Output
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, num_heads, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
self.feedforward = nn.Sequential(
nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
attn_output, _ = self.attention(x, x, x)
x = self.norm1(x + attn_output) # residual connection + layer norm
ff_output = self.feedforward(x)
x = self.norm2(x + ff_output) # another residual connection + layer norm
return x

Notice the residual connections around both the attention and feedforward sublayers — directly connecting to the gradient-flow benefits covered in Vanishing Gradient Problem and Popular CNN Architectures, applied here specifically to enable stacking dozens of transformer blocks deep. Layer normalization, not batch normalization, is used specifically because it doesn’t depend on batch composition, covered in Batch Normalization.


Why Transformers Replaced RNNs

Parallelization. Self-attention computes relationships between all pairs of positions simultaneously, rather than sequentially one time step at a time — this maps dramatically more efficiently onto GPU parallel computation than an RNN’s inherently sequential processing.

Direct long-range dependencies. Any two positions in a sequence, however far apart, are connected through exactly one attention computation, not a chain of sequential steps that gradients must survive intact, largely sidestepping the long-range dependency problem that motivated LSTM/GRU in the first place.

Summary

ComponentPurpose
Self-attentionEvery position attends to every other position within the same sequence
Multi-head attentionMultiple parallel attention computations, capturing different relationship types
Positional encodingRestores order information that pure self-attention otherwise discards
Residual connections + layer normEnables stacking many transformer blocks deep, stably

The transformer’s genuinely radical claim — that recurrence isn’t necessary, and attention alone suffices — has proven correct at a scale few predicted, and it’s the direct architectural foundation for the large language models covered next in Large Language Models.