Recurrent Neural Networks: Sequential Data and Temporal Dependencies

Master recurrent neural networks — hidden state, BPTT, vanishing gradients, RNN for sequence modeling, time series, text processing, and comparison with Transformers.

Recurrent Neural Networks

Recurrent Neural Networks process sequences by maintaining a hidden state — a memory that carries information from previous time steps. This makes them natural for language, time series, and any data where order and context matter.


How RNNs Work

At each time step t, the RNN takes the current input xₜ and the previous hidden state hₜ₋₁, and produces a new hidden state hₜ:

hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b)
yₜ = Wᵧhₜ
Where:
xₜ = input at time t
hₜ = hidden state at time t (the "memory")
Wₓ = input weights (shared across all time steps)
Wₕ = recurrent weights (shared across all time steps)
yₜ = output at time t

The same weights are used at every time step — the network learns to process sequences of any length.


Unrolled Computation Graph

x₁ → [RNN cell] → h₁ → [RNN cell] → h₂ → [RNN cell] → h₃ → output
↑ ↑ ↑
x₂ x₃ x₄

For many-to-one tasks (sentiment classification): use only the last hidden state hₙ
For many-to-many (sequence tagging): use hidden states at every step h₁, h₂, …, hₙ
For one-to-many (text generation): pass a single input and generate step by step


PyTorch RNN

import torch
import torch.nn as nn
class RNNClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.rnn = nn.RNN(
input_size=embed_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True, # Input shape: (batch, seq_len, features)
dropout=0.3,
bidirectional=True # Process sequence both forward and backward
)
self.classifier = nn.Linear(hidden_dim * 2, num_classes) # *2 for bidirectional
def forward(self, x):
# x shape: (batch, seq_len)
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
output, hidden = self.rnn(embedded) # output: (batch, seq_len, hidden*2)
# Use final hidden state for classification
final_hidden = output[:, -1, :] # Last time step: (batch, hidden*2)
return self.classifier(final_hidden)

Vanishing Gradient Problem

Standard RNNs struggle with long sequences because gradients decay exponentially as they flow backward through time (BPTT — Backpropagation Through Time):

Gradient at step t = ∏ᵢ (∂hᵢ/∂hᵢ₋₁) × (∂Loss/∂hₙ)
For long sequences (T=100+):
If each factor < 1: gradients → 0 (vanishing — no learning from early inputs)
If each factor > 1: gradients → ∞ (exploding — training instability)

Gradient clipping (for exploding gradients):

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

The fundamental solution to vanishing gradients is LSTM and GRU, which use gating mechanisms to maintain long-range memory.


Bidirectional RNNs

For tasks like NER or machine translation where context from both directions matters:

birnn = nn.RNN(input_size=128, hidden_size=256, bidirectional=True, batch_first=True)
# Output has 2×hidden_size channels (forward + backward concatenated)
output, hidden = birnn(x) # output: (batch, seq_len, 512)

When to Use RNNs (2026)

Plain RNNs have largely been replaced by:

  • LSTMs/GRUs for moderate-length sequences with complex dependencies
  • Transformers for most NLP tasks — better parallelization and long-range attention
  • Temporal Convolutional Networks (TCNs) for many time series problems

RNNs are still useful for:

  • Embedded systems where LSTM/Transformer weights are too large
  • Online/streaming prediction (process one step at a time)
  • Learning about sequence models before LSTMs or Transformers

In production NLP, Transformers have largely superseded RNNs, but the sequence-processing intuition transfers directly.