Sequence-to-Sequence Models Explained: Encoder-Decoder Architecture
Translating a sentence, summarizing a document, or generating a response to a question all share a structural property that plain classification doesn’t: the input is a sequence, and the output is also a sequence — often of a different length entirely. Sequence-to-sequence (seq2seq) models were built specifically to handle this input-sequence-to-output-sequence structure, introducing the encoder-decoder pattern that remains foundational even in today’s transformer-based systems.
The Encoder-Decoder Structure
A seq2seq model consists of two components: an encoder that reads the entire input sequence and compresses it into a fixed-size representation, and a decoder that generates the output sequence one element at a time, conditioned on that representation.
Input sequence: "How are you" │ ▼ [ Encoder ] (an RNN/LSTM, reading the input) │ ▼ context vector (fixed-size summary) │ ▼ [ Decoder ] (generates output one word at a time) │ ▼Output sequence: "Comment allez-vous"import torch.nn as nn
class Encoder(nn.Module): def __init__(self, vocab_size, embed_size, hidden_size): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_size) self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
def forward(self, x): embedded = self.embedding(x) outputs, (hidden, cell) = self.lstm(embedded) return hidden, cell # the final hidden/cell state becomes the "context"
class Decoder(nn.Module): def __init__(self, vocab_size, embed_size, hidden_size): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_size) self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True) self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden, cell): embedded = self.embedding(x) output, (hidden, cell) = self.lstm(embedded, (hidden, cell)) prediction = self.fc(output) return prediction, hidden, cellThe encoder here is built on LSTM (covered in LSTM and GRU), specifically for its improved ability to retain information across a full input sequence before compressing it into the context vector the decoder relies on.
Generating the Output One Token at a Time
The decoder generates its output sequence autoregressively — each generated word becomes part of the input for generating the next one.
def generate_sequence(decoder, hidden, cell, start_token, max_length, vocab): generated = [start_token] current_input = torch.tensor([[start_token]])
for _ in range(max_length): prediction, hidden, cell = decoder(current_input, hidden, cell) next_token = prediction.argmax(dim=-1).item() generated.append(next_token)
if next_token == vocab["<end>"]: break current_input = torch.tensor([[next_token]])
return generatedThis autoregressive generation pattern — predict one token, feed it back in, predict the next — is the same fundamental generation mechanism used by modern large language models, covered in Large Language Models, even though the underlying architecture has since shifted from LSTM-based encoder-decoders to transformers.
The Fundamental Bottleneck: Compressing Everything Into One Vector
The original seq2seq design has a genuine, well-documented limitation: the encoder has to compress the entire input sequence — regardless of length — into a single, fixed-size context vector. For short sentences, this works reasonably well; for long sentences or documents, critical information from early in the sequence tends to get diluted or lost by the time the encoder finishes processing everything, directly connecting to the long-range dependency issues covered in Recurrent Neural Networks.
Short input: "How are you" -- compresses reasonably well into one vector
Long input: a 500-word paragraph -- forcing all of thisinformation through one fixed-size vector loses substantial detail,especially information from earlier in the paragraphThe Fix: Attention
This exact bottleneck — losing information by forcing an entire variable-length sequence through one fixed-size vector — is precisely what motivated the attention mechanism, covered in full in Attention Mechanism. Instead of relying solely on one compressed context vector, attention lets the decoder look back at all of the encoder’s intermediate states when generating each output word, dynamically focusing on whichever parts of the input are most relevant at each specific generation step.
Without attention: decoder relies only on one final compressed context vector
With attention: decoder can "look back" at every encoder position,weighting them differently depending on what it's currently generatingWhy Seq2Seq’s Structure Still Matters, Even Post-Transformers
Transformers, covered next in Transformers, replaced the LSTM-based encoder and decoder with self-attention-based ones, but the overall encoder-decoder pattern — read the entire input, then generate the output conditioned on it — persists directly in modern translation and summarization systems built on transformer architectures. Understanding seq2seq’s original design, and precisely why it needed attention to overcome its bottleneck, is what makes the transformer’s design choices feel like a natural, motivated evolution rather than an unrelated architecture appearing from nowhere.
Teacher Forcing: A Practical Training Technique
A subtlety worth knowing about how seq2seq models are actually trained: during training, rather than feeding the decoder’s own (possibly wrong, especially early in training) predictions back in as the next input, a technique called “teacher forcing” instead feeds the true target sequence’s previous token, regardless of what the model actually predicted.
# Teacher forcing: use the true previous token as input, not the model's own predictionfor t in range(target_length): prediction, hidden, cell = decoder(true_target[t-1], hidden, cell) loss += compute_loss(prediction, true_target[t])This substantially stabilizes and speeds up training, since the decoder isn’t compounding its own early mistakes across an entire sequence — but it does create a mismatch between training (always given the correct previous token) and actual inference (must rely on its own, potentially imperfect, previous predictions), which is why techniques that gradually reduce teacher forcing over training, or eliminate it in more modern architectures, are sometimes used to narrow this train/inference gap.
Summary
| Component | Role |
|---|---|
| Encoder | Reads the full input sequence, produces a representation |
| Decoder | Generates the output sequence, one token at a time, conditioned on the encoder’s output |
| Context vector | The original (limited) way the decoder accessed encoder information |
| Attention | The direct fix for the fixed-size context vector’s information bottleneck |
Sequence-to-sequence models introduced the encoder-decoder pattern that remains foundational to translation, summarization, and generation tasks today — and their specific, well-understood bottleneck is exactly what makes attention’s contribution, covered next, genuinely necessary rather than an arbitrary added complexity.