Recurrent Neural Networks
Recurrent Neural Networks process sequences by maintaining a hidden state — a memory that carries information from previous time steps. This makes them natural for language, time series, and any data where order and context matter.
How RNNs Work
At each time step t, the RNN takes the current input xₜ and the previous hidden state hₜ₋₁, and produces a new hidden state hₜ:
hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b)yₜ = Wᵧhₜ
Where: xₜ = input at time t hₜ = hidden state at time t (the "memory") Wₓ = input weights (shared across all time steps) Wₕ = recurrent weights (shared across all time steps) yₜ = output at time tThe same weights are used at every time step — the network learns to process sequences of any length.
Unrolled Computation Graph
x₁ → [RNN cell] → h₁ → [RNN cell] → h₂ → [RNN cell] → h₃ → output ↑ ↑ ↑ x₂ x₃ x₄For many-to-one tasks (sentiment classification): use only the last hidden state hₙ
For many-to-many (sequence tagging): use hidden states at every step h₁, h₂, …, hₙ
For one-to-many (text generation): pass a single input and generate step by step
PyTorch RNN
import torchimport torch.nn as nn
class RNNClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) self.rnn = nn.RNN( input_size=embed_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, # Input shape: (batch, seq_len, features) dropout=0.3, bidirectional=True # Process sequence both forward and backward ) self.classifier = nn.Linear(hidden_dim * 2, num_classes) # *2 for bidirectional
def forward(self, x): # x shape: (batch, seq_len) embedded = self.embedding(x) # (batch, seq_len, embed_dim) output, hidden = self.rnn(embedded) # output: (batch, seq_len, hidden*2)
# Use final hidden state for classification final_hidden = output[:, -1, :] # Last time step: (batch, hidden*2) return self.classifier(final_hidden)Vanishing Gradient Problem
Standard RNNs struggle with long sequences because gradients decay exponentially as they flow backward through time (BPTT — Backpropagation Through Time):
Gradient at step t = ∏ᵢ (∂hᵢ/∂hᵢ₋₁) × (∂Loss/∂hₙ)
For long sequences (T=100+): If each factor < 1: gradients → 0 (vanishing — no learning from early inputs) If each factor > 1: gradients → ∞ (exploding — training instability)Gradient clipping (for exploding gradients):
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)The fundamental solution to vanishing gradients is LSTM and GRU, which use gating mechanisms to maintain long-range memory.
Bidirectional RNNs
For tasks like NER or machine translation where context from both directions matters:
birnn = nn.RNN(input_size=128, hidden_size=256, bidirectional=True, batch_first=True)
# Output has 2×hidden_size channels (forward + backward concatenated)output, hidden = birnn(x) # output: (batch, seq_len, 512)When to Use RNNs (2026)
Plain RNNs have largely been replaced by:
- LSTMs/GRUs for moderate-length sequences with complex dependencies
- Transformers for most NLP tasks — better parallelization and long-range attention
- Temporal Convolutional Networks (TCNs) for many time series problems
RNNs are still useful for:
- Embedded systems where LSTM/Transformer weights are too large
- Online/streaming prediction (process one step at a time)
- Learning about sequence models before LSTMs or Transformers
In production NLP, Transformers have largely superseded RNNs, but the sequence-processing intuition transfers directly.