Recurrent Neural Networks (RNNs) Explained: Handling Sequential Data
Neither an MLP nor a CNN has any built-in notion of order — feed a sentence’s words into either architecture without special handling, and it has no way to know “the” came before “cat,” or that word order even matters. Recurrent Neural Networks were built specifically to process sequences, maintaining a “memory” of what came before as they read through a sequence one element at a time.
The Core Idea: A Hidden State Carried Across Time Steps
An RNN processes a sequence one element at a time, maintaining a hidden state that gets updated at each step and carries forward information from everything seen so far.
import numpy as np
def rnn_step(x_t, h_prev, W_xh, W_hh, b_h): h_t = np.tanh(x_t @ W_xh + h_prev @ W_hh + b_h) return h_t
# Processing a sequence of 5 time stepsh = np.zeros(64) # initial hidden state, all zerosfor t in range(5): x_t = sequence[t] # the input at this time step h = rnn_step(x_t, h, W_xh, W_hh, b_h) # h now encodes info from steps 0 through tThe critical detail: the same weight matrices (W_xh, W_hh) are reused at every single time step — this weight sharing across time, analogous to the spatial weight sharing in convolution covered in Convolutional Neural Networks, is what lets an RNN process sequences of arbitrary length with a fixed number of parameters.
Unrolling an RNN Through Time
Visualizing an RNN “unrolled” across time steps makes its structure clearer — it’s the same cell, applied repeatedly, with the hidden state flowing forward between applications.
x_0 → [RNN cell] → h_0 → [RNN cell] → h_1 → [RNN cell] → h_2 → ... ↑ ↑ ↑ x_0 x_1 x_2Each box represents the exact same weights being applied — this is precisely why training an RNN via backpropagation through this unrolled structure is called “backpropagation through time,” a direct extension of the standard backpropagation covered in Backpropagation, just applied across the time dimension instead of (or in addition to) network depth.
A Complete RNN in PyTorch
import torch.nn as nn
class SimpleRNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super().__init__() self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x): output, hidden = self.rnn(x) # output: all time steps; hidden: final hidden state final_output = self.fc(hidden[-1]) return final_outputThis is used for tasks like sentiment classification (read an entire sentence, produce one output at the end) or, with output taken at every time step instead of just the last, tasks like part-of-speech tagging (one output per input word).
Why RNNs Struggle With Long Sequences
Because the same weight matrices are applied repeatedly across every time step, backpropagation through time involves multiplying the same gradient-related terms together many times — for a sequence of length 100, that’s roughly 100 repeated multiplications, making RNNs an especially acute, well-documented case of both the Vanishing Gradient Problem and Exploding Gradient Problem.
Information from time step 1 has to survive being multipliedby the same weight matrix roughly 99 more times before it caninfluence the output at time step 100 -- in practice, thisinformation reliably fades to near-nothing well before that pointThis means a plain RNN struggles to learn dependencies between elements that are far apart in a sequence — remembering the subject of a sentence’s first word by the time it reaches the twentieth word, for instance, is genuinely difficult for this architecture.
The Direct Motivation for LSTM and GRU
This specific, well-understood limitation — plain RNNs losing long-range information due to repeated multiplicative gradient decay — is exactly what motivated the LSTM and GRU architectures, covered in full in LSTM and GRU. Both introduce gating mechanisms specifically designed to let important information persist across many time steps without being multiplicatively degraded the way a plain RNN’s hidden state is.
Where Plain RNNs Are Still Used Today
Plain (vanilla) RNNs are rarely the first choice for new projects today — LSTM, GRU, and increasingly transformer-based architectures (covered in Transformers) have largely superseded them for tasks involving longer sequences. Plain RNNs remain reasonable for:
- Very short sequences, where the long-range dependency problem simply doesn’t have room to manifest.
- Extremely resource-constrained settings, where their computational simplicity relative to LSTM/GRU offers a genuine practical advantage.
- Educational purposes — understanding the plain RNN thoroughly, as covered here, is what makes LSTM and GRU’s specific design choices (covered next) make complete sense as targeted fixes rather than arbitrary added complexity.
Many-to-One, One-to-Many, and Many-to-Many: RNN Input/Output Patterns
RNNs are flexible in how inputs and outputs are structured, not limited to a single fixed pattern. Many-to-one (sentiment classification: read an entire sequence, produce one output) is the pattern shown earlier in this guide. One-to-many (image captioning: one input, generate a sequence of words describing it) reverses this. Many-to-many (translation, part-of-speech tagging) produces an output at every input position, or a full output sequence potentially of a different length, directly connecting to the encoder-decoder pattern covered in Sequence-to-Sequence Models. Recognizing which of these patterns a given task actually needs is one of the first architectural decisions to make before writing any code — it directly determines where in the network outputs should be taken from, and how the training loss should be computed.
Summary
| Concept | Detail |
|---|---|
| Hidden state | Carries information forward across time steps |
| Weight sharing across time | Same weights reused at every step, enabling variable-length sequences |
| Backpropagation through time | Standard backprop applied across the unrolled time dimension |
| Key limitation | Long sequences cause vanishing/exploding gradients, losing long-range dependencies |
RNNs introduced the essential idea of maintaining state across a sequence — the specific mechanism for how that state is maintained is exactly what LSTM and GRU improve on next, directly targeting the long-range dependency problem described here.