LSTM: Long Short-Term Memory Networks for Sequence Modeling

Understand LSTM networks — forget gate, input gate, output gate, cell state, GRU comparison, and practical use for time series forecasting and NLP tasks.

Long Short-Term Memory (LSTM)

LSTMs solve the core problem of standard RNNs: they can’t learn long-range dependencies because gradients vanish over long sequences. LSTMs use a gating mechanism to selectively remember and forget information, allowing them to learn dependencies spanning hundreds of time steps.


The Cell State

The key innovation: LSTMs maintain two state vectors instead of one:

  • Cell state (cₜ): Long-term memory — flows with minimal modification, like a conveyor belt
  • Hidden state (hₜ): Short-term working memory — used for output at each step
Cell state: c₀ ─── × ─── + ─── × ─── c₁
↑ ↑ ↑
forget input output
Hidden state: h₀ ─────────────────────── h₁

The Three Gates

Each gate is a sigmoid-activated layer that outputs values between 0 and 1:

Forget Gate

“What should we forget from long-term memory?”

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)
cₜ = fₜ × cₜ₋₁ (element-wise multiplication)
fₜ = 0: completely forget fₜ = 1: completely remember

Input Gate

“What new information should we store?”

iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi) (how much to update)
c̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc) (candidate values)
cₜ = cₜ + iₜ × c̃ₜ (update cell state)

Output Gate

“What should we output from the updated cell state?”

oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)
hₜ = oₜ × tanh(cₜ)

PyTorch LSTM

import torch
import torch.nn as nn
class LSTMForecaster(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
super().__init__()
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
bidirectional=False
)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# x shape: (batch, seq_len, input_size)
lstm_out, (hidden, cell) = self.lstm(x)
# Use last time step for forecasting
last_out = lstm_out[:, -1, :] # (batch, hidden_size)
return self.fc(last_out)
# Time series forecasting: 10 past steps → predict 1 step ahead
model = LSTMForecaster(input_size=5, hidden_size=128, num_layers=2, output_size=1)

Time Series Forecasting Example

import numpy as np
def create_sequences(data, seq_len):
X, y = [], []
for i in range(len(data) - seq_len):
X.append(data[i:i+seq_len])
y.append(data[i+seq_len])
return np.array(X), np.array(y)
# data shape: (T, n_features)
X, y = create_sequences(data, seq_len=30) # 30-step lookback window
# Train/test split (time-based, no shuffling!)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

GRU: Gated Recurrent Unit

GRU simplifies LSTM by merging the cell and hidden state into one and using only two gates (reset and update). Often matches LSTM performance with fewer parameters:

gru = nn.GRU(
input_size=10,
hidden_size=128,
num_layers=2,
batch_first=True,
dropout=0.2
)
# GRU outputs: (output, hidden) vs LSTM: (output, (hidden, cell))
output, hidden = gru(x)

LSTM vs GRU rule of thumb: Start with GRU (fewer params, faster), switch to LSTM if performance is insufficient on long sequences.


Stacked LSTM

# Multiple LSTM layers for higher capacity
stacked_lstm = nn.LSTM(
input_size=input_size,
hidden_size=256,
num_layers=3, # 3 stacked LSTM layers
batch_first=True,
dropout=0.3 # Applied between layers (not after last layer)
)

LSTM vs Transformers for Sequences (2026)

LSTMTransformer
Long-range dependenciesGoodExcellent
Training speedSequentialParallelizable
Inference (streaming)ExcellentSlower (self-attention)
Small datasetsCompetitiveNeeds more data
Time seriesStrongCompetitive

LSTMs remain competitive for time series forecasting and streaming inference where Transformers are too slow. For most NLP and long document tasks, Transformers have taken over.