Recurrent Neural Networks

Recurrent Neural Networks process sequences by maintaining a hidden state — a memory that carries information from previous time steps. This makes them natural for language, time series, and any data where order and context matter.

How RNNs Work

At each time step t, the RNN takes the current input xₜ and the previous hidden state hₜ₋₁, and produces a new hidden state hₜ:

hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b)
yₜ = Wᵧhₜ

Where:
  xₜ = input at time t
  hₜ = hidden state at time t (the "memory")
  Wₓ = input weights (shared across all time steps)
  Wₕ = recurrent weights (shared across all time steps)
  yₜ = output at time t

The same weights are used at every time step — the network learns to process sequences of any length.

Unrolled Computation Graph

x₁ → [RNN cell] → h₁ → [RNN cell] → h₂ → [RNN cell] → h₃ → output
                    ↑                  ↑                  ↑
                    x₂                x₃                x₄

For many-to-one tasks (sentiment classification): use only the last hidden state hₙ
For many-to-many (sequence tagging): use hidden states at every step h₁, h₂, …, hₙ
For one-to-many (text generation): pass a single input and generate step by step

PyTorch RNN

import torch
import torch.nn as nn

class RNNClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.RNN(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,     # Input shape: (batch, seq_len, features)
            dropout=0.3,
            bidirectional=True    # Process sequence both forward and backward
        )
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional

    def forward(self, x):
        # x shape: (batch, seq_len)
        embedded = self.embedding(x)          # (batch, seq_len, embed_dim)
        output, hidden = self.rnn(embedded)   # output: (batch, seq_len, hidden*2)

        # Use final hidden state for classification
        final_hidden = output[:, -1, :]       # Last time step: (batch, hidden*2)
        return self.classifier(final_hidden)

Vanishing Gradient Problem

Standard RNNs struggle with long sequences because gradients decay exponentially as they flow backward through time (BPTT — Backpropagation Through Time):

Gradient at step t = ∏ᵢ (∂hᵢ/∂hᵢ₋₁) × (∂Loss/∂hₙ)

For long sequences (T=100+):
  If each factor < 1: gradients → 0 (vanishing — no learning from early inputs)
  If each factor > 1: gradients → ∞ (exploding — training instability)

Gradient clipping (for exploding gradients):

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

The fundamental solution to vanishing gradients is LSTM and GRU, which use gating mechanisms to maintain long-range memory.

Bidirectional RNNs

For tasks like NER or machine translation where context from both directions matters:

birnn = nn.RNN(input_size=128, hidden_size=256, bidirectional=True, batch_first=True)

# Output has 2×hidden_size channels (forward + backward concatenated)
output, hidden = birnn(x)  # output: (batch, seq_len, 512)

When to Use RNNs (2026)

Plain RNNs have largely been replaced by:

LSTMs/GRUs for moderate-length sequences with complex dependencies
Transformers for most NLP tasks — better parallelization and long-range attention
Temporal Convolutional Networks (TCNs) for many time series problems

RNNs are still useful for:

Embedded systems where LSTM/Transformer weights are too large
Online/streaming prediction (process one step at a time)
Learning about sequence models before LSTMs or Transformers

In production NLP, Transformers have largely superseded RNNs, but the sequence-processing intuition transfers directly.