Transformer Architecture

In 2017, a team at Google published a paper with a modest title: Attention Is All You Need. It introduced the Transformer, and within five years, it became the foundation of every significant AI breakthrough — language models, image generators, protein folding predictors, and speech systems. Understanding Transformers isn’t optional if you work with modern AI.

Why Not Just Use RNNs?

Before Transformers, sequence models (RNNs, LSTMs, GRUs) processed tokens one at a time, passing a hidden state forward. This created two problems:

Sequential bottleneck — You can’t parallelize training because step N depends on step N-1. Training is slow.
Long-range memory — Information from early in a sequence degrades as it propagates through many steps. The model “forgets” the beginning of a long document.

Transformers solve both problems by processing all tokens simultaneously and using attention to directly connect any two positions in a sequence, regardless of distance.

The High-Level Architecture

The original Transformer had two main components: an encoder and a decoder. Later models specialize into one or the other.

Input Sequence                     Output Sequence
      │                                  ↑
      ▼                                  │
┌───────────────┐             ┌─────────────────────┐
│   ENCODER     │             │      DECODER         │
│               │             │                      │
│  Embedding    │             │  Embedding           │
│  + Pos. Enc.  │             │  + Pos. Enc.         │
│               │             │                      │
│  Self-Attn    │──Context──▶ │  Masked Self-Attn    │
│  Feed-Forward │             │  Cross-Attention     │
│  (×N layers)  │             │  Feed-Forward        │
│               │             │  (×N layers)         │
└───────────────┘             └─────────────────────┘

Encoder-only (BERT): Great for understanding and classification. Reads the whole input at once.
Decoder-only (GPT, Claude, LLaMA): Great for text generation. Predicts next token autoregressively.
Encoder-decoder (T5, BART, original MT models): Translation, summarization, structured generation.

Inside a Transformer Block

Every Transformer layer (block) has the same structure. Stack 12 of them for BERT-base, 96 for GPT-4 scale.

Input (residual stream)
        │
        ▼
  ┌─────────────┐
  │ Layer Norm  │
  └─────────────┘
        │
        ▼
  ┌─────────────────┐
  │  Multi-Head     │
  │  Attention      │
  └─────────────────┘
        │
        ▼ (+ residual connection)
  ┌─────────────┐
  │ Layer Norm  │
  └─────────────┘
        │
        ▼
  ┌─────────────────┐
  │  Feed-Forward   │
  │  Network (FFN)  │
  └─────────────────┘
        │
        ▼ (+ residual connection)
     Output

Two key sub-layers, both wrapped in residual connections:

Multi-Head Attention — lets tokens “talk to” each other
Feed-Forward Network — processes each token independently (two linear layers with a GELU in between)

Positional Encoding

Transformers have no built-in sense of order — all tokens are processed in parallel. To tell the model where each token sits in the sequence, you add positional information to the token embeddings.

Absolute sinusoidal (original paper): Fixed sine/cosine patterns of different frequencies.

Learned absolute (BERT, GPT-2): A learned embedding for each position index.

Rotary Position Embedding — RoPE (LLaMA, Mistral, GPT-NeoX): Encodes position as a rotation applied to query/key vectors. Scales naturally to longer sequences. Dominant in 2024–2026 models.

ALiBi (MPT, BLOOM): Adds a position-dependent bias to attention scores. Simple, effective, good extrapolation.

The Feed-Forward Network: More Than a Detail

The FFN is often overlooked, but it accounts for roughly two-thirds of a Transformer’s parameters. Modern analysis suggests the FFN layers act as a kind of “knowledge store” — encoding factual associations learned during pre-training.

FFN(x) = W₂ · GELU(W₁ · x + b₁) + b₂

The dimension expansion inside the FFN (usually 4× the model dimension) creates an intermediate representation rich enough to store and recall complex patterns.

In recent architectures like LLaMA and Mistral, this has been upgraded to a Gated Linear Unit (SwiGLU):

FFN_SwiGLU(x) = (SiLU(W₁·x) ⊙ W₃·x) · W₂

Three weight matrices instead of two, but faster training and better performance at the same parameter count.

Model Variants You’ll Encounter

Model Family	Architecture	Speciality
BERT / RoBERTa	Encoder-only	Text classification, NER, retrieval
GPT / LLaMA / Mistral	Decoder-only	Text generation, chat, code
T5 / BART / mBART	Encoder-decoder	Translation, summarization
Vision Transformer (ViT)	Encoder-only	Image classification
CLIP	Dual encoder	Image-text alignment
Whisper	Encoder-decoder	Speech recognition

Modern Efficiency Improvements (2024–2026)

The vanilla Transformer is expensive. A decade of research has produced several important improvements:

Grouped-Query Attention (GQA)

Multiple query heads share a single key-value head. Reduces KV cache memory by 4–8× without meaningful accuracy loss. Used in LLaMA 3, Mistral, and Gemma.

Sliding Window Attention

Instead of attending to all previous tokens, each token only attends to a fixed window. Long-range information flows through successive layers. Used in Mistral 7B.

Mixture of Experts (MoE)

Replace each dense FFN with a set of “expert” FFNs, routing each token to 2–4 experts out of 8–64. Dramatically increases model capacity without proportionally increasing compute per token. Used in Mixtral 8×7B, GPT-4 (speculated), and Gemini 1.5.

MoE Layer:
         Token → Router → Expert 1
                        ↘ Expert 3 (out of 8 total)
                        Only 2 experts fire per token

Multi-head Latent Attention (MLA)

DeepSeek’s innovation: compress key-value states into a low-rank latent, then expand. Significantly reduces KV cache without GQA’s accuracy trade-off.

How This Connects to LLMs

Every LLM you use — GPT-4, Claude, Gemini, LLaMA — is a Transformer decoder stack. When you send a prompt:

The text is tokenized into integer IDs
Each ID is looked up in an embedding table
Positional encodings are added
The sequence flows through N Transformer blocks
The final hidden states are projected to vocabulary logits
A token is sampled from those logits
That token is appended and the process repeats

Understanding this loop is essential for understanding why LLMs behave the way they do — why they’re fluent, why they hallucinate, and why context length matters so much.