Long Short-Term Memory (LSTM)
LSTMs solve the core problem of standard RNNs: they can’t learn long-range dependencies because gradients vanish over long sequences. LSTMs use a gating mechanism to selectively remember and forget information, allowing them to learn dependencies spanning hundreds of time steps.
The Cell State
The key innovation: LSTMs maintain two state vectors instead of one:
- Cell state (cₜ): Long-term memory — flows with minimal modification, like a conveyor belt
- Hidden state (hₜ): Short-term working memory — used for output at each step
Cell state: c₀ ─── × ─── + ─── × ─── c₁ ↑ ↑ ↑ forget input outputHidden state: h₀ ─────────────────────── h₁The Three Gates
Each gate is a sigmoid-activated layer that outputs values between 0 and 1:
Forget Gate
“What should we forget from long-term memory?”
fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)cₜ = fₜ × cₜ₋₁ (element-wise multiplication)fₜ = 0: completely forget fₜ = 1: completely rememberInput Gate
“What new information should we store?”
iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi) (how much to update)c̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc) (candidate values)cₜ = cₜ + iₜ × c̃ₜ (update cell state)Output Gate
“What should we output from the updated cell state?”
oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)hₜ = oₜ × tanh(cₜ)PyTorch LSTM
import torchimport torch.nn as nn
class LSTMForecaster(nn.Module): def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2): super().__init__() self.lstm = nn.LSTM( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0, bidirectional=False ) self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x): # x shape: (batch, seq_len, input_size) lstm_out, (hidden, cell) = self.lstm(x)
# Use last time step for forecasting last_out = lstm_out[:, -1, :] # (batch, hidden_size) return self.fc(last_out)
# Time series forecasting: 10 past steps → predict 1 step aheadmodel = LSTMForecaster(input_size=5, hidden_size=128, num_layers=2, output_size=1)Time Series Forecasting Example
import numpy as np
def create_sequences(data, seq_len): X, y = [], [] for i in range(len(data) - seq_len): X.append(data[i:i+seq_len]) y.append(data[i+seq_len]) return np.array(X), np.array(y)
# data shape: (T, n_features)X, y = create_sequences(data, seq_len=30) # 30-step lookback window
# Train/test split (time-based, no shuffling!)split = int(0.8 * len(X))X_train, X_test = X[:split], X[split:]y_train, y_test = y[:split], y[split:]GRU: Gated Recurrent Unit
GRU simplifies LSTM by merging the cell and hidden state into one and using only two gates (reset and update). Often matches LSTM performance with fewer parameters:
gru = nn.GRU( input_size=10, hidden_size=128, num_layers=2, batch_first=True, dropout=0.2)
# GRU outputs: (output, hidden) vs LSTM: (output, (hidden, cell))output, hidden = gru(x)LSTM vs GRU rule of thumb: Start with GRU (fewer params, faster), switch to LSTM if performance is insufficient on long sequences.
Stacked LSTM
# Multiple LSTM layers for higher capacitystacked_lstm = nn.LSTM( input_size=input_size, hidden_size=256, num_layers=3, # 3 stacked LSTM layers batch_first=True, dropout=0.3 # Applied between layers (not after last layer))LSTM vs Transformers for Sequences (2026)
| LSTM | Transformer | |
|---|---|---|
| Long-range dependencies | Good | Excellent |
| Training speed | Sequential | Parallelizable |
| Inference (streaming) | Excellent | Slower (self-attention) |
| Small datasets | Competitive | Needs more data |
| Time series | Strong | Competitive |
LSTMs remain competitive for time series forecasting and streaming inference where Transformers are too slow. For most NLP and long document tasks, Transformers have taken over.