Long Short-Term Memory (LSTM)

LSTMs solve the core problem of standard RNNs: they can’t learn long-range dependencies because gradients vanish over long sequences. LSTMs use a gating mechanism to selectively remember and forget information, allowing them to learn dependencies spanning hundreds of time steps.

The Cell State

The key innovation: LSTMs maintain two state vectors instead of one:

Cell state (cₜ): Long-term memory — flows with minimal modification, like a conveyor belt
Hidden state (hₜ): Short-term working memory — used for output at each step

Cell state:  c₀ ─── × ─── + ─── × ─── c₁
                     ↑       ↑       ↑
                   forget  input   output
Hidden state: h₀ ─────────────────────── h₁

The Three Gates

Each gate is a sigmoid-activated layer that outputs values between 0 and 1:

Forget Gate

“What should we forget from long-term memory?”

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)
cₜ = fₜ × cₜ₋₁   (element-wise multiplication)
fₜ = 0: completely forget   fₜ = 1: completely remember

Input Gate

“What new information should we store?”

iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi)    (how much to update)
c̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc)  (candidate values)
cₜ = cₜ + iₜ × c̃ₜ                 (update cell state)

Output Gate

“What should we output from the updated cell state?”

oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)
hₜ = oₜ × tanh(cₜ)

PyTorch LSTM

import torch
import torch.nn as nn

class LSTMForecaster(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=False
        )
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        lstm_out, (hidden, cell) = self.lstm(x)

        # Use last time step for forecasting
        last_out = lstm_out[:, -1, :]  # (batch, hidden_size)
        return self.fc(last_out)

# Time series forecasting: 10 past steps → predict 1 step ahead
model = LSTMForecaster(input_size=5, hidden_size=128, num_layers=2, output_size=1)

Time Series Forecasting Example

import numpy as np

def create_sequences(data, seq_len):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i+seq_len])
        y.append(data[i+seq_len])
    return np.array(X), np.array(y)

# data shape: (T, n_features)
X, y = create_sequences(data, seq_len=30)  # 30-step lookback window

# Train/test split (time-based, no shuffling!)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

GRU: Gated Recurrent Unit

GRU simplifies LSTM by merging the cell and hidden state into one and using only two gates (reset and update). Often matches LSTM performance with fewer parameters:

gru = nn.GRU(
    input_size=10,
    hidden_size=128,
    num_layers=2,
    batch_first=True,
    dropout=0.2
)

# GRU outputs: (output, hidden) vs LSTM: (output, (hidden, cell))
output, hidden = gru(x)

LSTM vs GRU rule of thumb: Start with GRU (fewer params, faster), switch to LSTM if performance is insufficient on long sequences.

Stacked LSTM

# Multiple LSTM layers for higher capacity
stacked_lstm = nn.LSTM(
    input_size=input_size,
    hidden_size=256,
    num_layers=3,         # 3 stacked LSTM layers
    batch_first=True,
    dropout=0.3           # Applied between layers (not after last layer)
)

LSTM vs Transformers for Sequences (2026)

	LSTM	Transformer
Long-range dependencies	Good	Excellent
Training speed	Sequential	Parallelizable
Inference (streaming)	Excellent	Slower (self-attention)
Small datasets	Competitive	Needs more data
Time series	Strong	Competitive

LSTMs remain competitive for time series forecasting and streaming inference where Transformers are too slow. For most NLP and long document tasks, Transformers have taken over.