Recurrent Neural Networks (RNNs) Explained: Handling Sequential Data

Neither an MLP nor a CNN has any built-in notion of order — feed a sentence’s words into either architecture without special handling, and it has no way to know “the” came before “cat,” or that word order even matters. Recurrent Neural Networks were built specifically to process sequences, maintaining a “memory” of what came before as they read through a sequence one element at a time.

The Core Idea: A Hidden State Carried Across Time Steps

An RNN processes a sequence one element at a time, maintaining a hidden state that gets updated at each step and carries forward information from everything seen so far.

import numpy as np

def rnn_step(x_t, h_prev, W_xh, W_hh, b_h):
    h_t = np.tanh(x_t @ W_xh + h_prev @ W_hh + b_h)
    return h_t

# Processing a sequence of 5 time steps
h = np.zeros(64)   # initial hidden state, all zeros
for t in range(5):
    x_t = sequence[t]                          # the input at this time step
    h = rnn_step(x_t, h, W_xh, W_hh, b_h)       # h now encodes info from steps 0 through t

The critical detail: the same weight matrices (W_xh, W_hh) are reused at every single time step — this weight sharing across time, analogous to the spatial weight sharing in convolution covered in Convolutional Neural Networks, is what lets an RNN process sequences of arbitrary length with a fixed number of parameters.

Unrolling an RNN Through Time

Visualizing an RNN “unrolled” across time steps makes its structure clearer — it’s the same cell, applied repeatedly, with the hidden state flowing forward between applications.

x_0 → [RNN cell] → h_0 → [RNN cell] → h_1 → [RNN cell] → h_2 → ...
         ↑                    ↑                   ↑
        x_0                  x_1                 x_2

Each box represents the exact same weights being applied — this is precisely why training an RNN via backpropagation through this unrolled structure is called “backpropagation through time,” a direct extension of the standard backpropagation covered in Backpropagation, just applied across the time dimension instead of (or in addition to) network depth.

A Complete RNN in PyTorch

import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        output, hidden = self.rnn(x)     # output: all time steps; hidden: final hidden state
        final_output = self.fc(hidden[-1])
        return final_output

This is used for tasks like sentiment classification (read an entire sentence, produce one output at the end) or, with output taken at every time step instead of just the last, tasks like part-of-speech tagging (one output per input word).

Why RNNs Struggle With Long Sequences

Because the same weight matrices are applied repeatedly across every time step, backpropagation through time involves multiplying the same gradient-related terms together many times — for a sequence of length 100, that’s roughly 100 repeated multiplications, making RNNs an especially acute, well-documented case of both the Vanishing Gradient Problem and Exploding Gradient Problem.

Information from time step 1 has to survive being multiplied
by the same weight matrix roughly 99 more times before it can
influence the output at time step 100 -- in practice, this
information reliably fades to near-nothing well before that point

This means a plain RNN struggles to learn dependencies between elements that are far apart in a sequence — remembering the subject of a sentence’s first word by the time it reaches the twentieth word, for instance, is genuinely difficult for this architecture.

The Direct Motivation for LSTM and GRU

This specific, well-understood limitation — plain RNNs losing long-range information due to repeated multiplicative gradient decay — is exactly what motivated the LSTM and GRU architectures, covered in full in LSTM and GRU. Both introduce gating mechanisms specifically designed to let important information persist across many time steps without being multiplicatively degraded the way a plain RNN’s hidden state is.

Where Plain RNNs Are Still Used Today

Plain (vanilla) RNNs are rarely the first choice for new projects today — LSTM, GRU, and increasingly transformer-based architectures (covered in Transformers) have largely superseded them for tasks involving longer sequences. Plain RNNs remain reasonable for:

Very short sequences, where the long-range dependency problem simply doesn’t have room to manifest.
Extremely resource-constrained settings, where their computational simplicity relative to LSTM/GRU offers a genuine practical advantage.
Educational purposes — understanding the plain RNN thoroughly, as covered here, is what makes LSTM and GRU’s specific design choices (covered next) make complete sense as targeted fixes rather than arbitrary added complexity.

Many-to-One, One-to-Many, and Many-to-Many: RNN Input/Output Patterns

RNNs are flexible in how inputs and outputs are structured, not limited to a single fixed pattern. Many-to-one (sentiment classification: read an entire sequence, produce one output) is the pattern shown earlier in this guide. One-to-many (image captioning: one input, generate a sequence of words describing it) reverses this. Many-to-many (translation, part-of-speech tagging) produces an output at every input position, or a full output sequence potentially of a different length, directly connecting to the encoder-decoder pattern covered in Sequence-to-Sequence Models. Recognizing which of these patterns a given task actually needs is one of the first architectural decisions to make before writing any code — it directly determines where in the network outputs should be taken from, and how the training loss should be computed.

Summary

Concept	Detail
Hidden state	Carries information forward across time steps
Weight sharing across time	Same weights reused at every step, enabling variable-length sequences
Backpropagation through time	Standard backprop applied across the unrolled time dimension
Key limitation	Long sequences cause vanishing/exploding gradients, losing long-range dependencies

RNNs introduced the essential idea of maintaining state across a sequence — the specific mechanism for how that state is maintained is exactly what LSTM and GRU improve on next, directly targeting the long-range dependency problem described here.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.