LSTM and GRU Explained: Solving the RNN Long-Range Dependency Problem
Plain RNNs, covered in Recurrent Neural Networks, lose long-range information because their hidden state gets repeatedly overwritten and multiplicatively degraded at every time step. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) solve this with a specific architectural mechanism — gates — that let the network learn what to remember, what to forget, and what to output, rather than being forced to blend everything together the same way at every step.
The Core Innovation: Gates as Learned, Selective Filters
A gate in an LSTM or GRU is a small neural network component (a sigmoid layer) that outputs values between 0 and 1, acting as a learned “how much to let through” filter — a gate output near 0 blocks information, near 1 lets it pass freely, and the network learns the right gate behavior for a given context through training, exactly like any other weight.
def gate(x, h_prev, W_x, W_h, b): return sigmoid(x @ W_x + h_prev @ W_h + b) # output between 0 and 1LSTM: Three Gates and a Separate Cell State
LSTM introduces a separate “cell state” — a kind of conveyor belt that can carry information across many time steps largely unchanged — alongside the regular hidden state, controlled by three distinct gates.
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))
def lstm_step(x_t, h_prev, c_prev, weights): combined = np.concatenate([x_t, h_prev])
forget_gate = sigmoid(combined @ weights["W_f"]) # what to forget from cell state input_gate = sigmoid(combined @ weights["W_i"]) # what new info to add candidate = np.tanh(combined @ weights["W_c"]) # candidate new information output_gate = sigmoid(combined @ weights["W_o"]) # what to output based on cell state
c_t = forget_gate * c_prev + input_gate * candidate # update cell state h_t = output_gate * np.tanh(c_t) # compute new hidden state
return h_t, c_tForget gate: decides what to discard from the existing cell state — crucial for letting the network deliberately drop irrelevant older information rather than being forced to retain everything.
Input gate: decides what new information from the current input should be added to the cell state.
Output gate: decides what part of the (updated) cell state should actually be exposed as the hidden state output for this time step.
The cell state’s update (forget_gate * c_prev + input_gate * candidate) is additive rather than purely multiplicative through a nonlinearity at every step — this additive path is exactly what allows gradients to flow backward across many time steps without vanishing nearly as severely as in a plain RNN.
import torch.nn as nn
lstm_layer = nn.LSTM(input_size=128, hidden_size=256, batch_first=True)GRU: A Simpler Alternative With Fewer Gates
GRU simplifies LSTM’s design, using just two gates and merging the cell state and hidden state into a single state — fewer parameters, computationally cheaper, while retaining most of LSTM’s practical benefit for mitigating long-range gradient decay.
def gru_step(x_t, h_prev, weights): combined = np.concatenate([x_t, h_prev])
update_gate = sigmoid(combined @ weights["W_z"]) # how much of the past to keep reset_gate = sigmoid(combined @ weights["W_r"]) # how much past info to ignore for the candidate
combined_reset = np.concatenate([x_t, reset_gate * h_prev]) candidate = np.tanh(combined_reset @ weights["W_h"])
h_t = (1 - update_gate) * h_prev + update_gate * candidate return h_tgru_layer = nn.GRU(input_size=128, hidden_size=256, batch_first=True)LSTM vs. GRU: Which to Choose
| LSTM | GRU | |
|---|---|---|
| Number of gates | 3 (forget, input, output) | 2 (update, reset) |
| Separate cell state | Yes | No — merged with hidden state |
| Parameter count | Higher | Lower (roughly 25% fewer) |
| Training speed | Slower | Faster |
| Typical performance | Slightly better on some tasks with very long sequences | Often comparable, faster to train |
In practice, the performance difference between the two is usually small and task-dependent — GRU is often a reasonable default for faster experimentation given its lower parameter count, while LSTM remains a strong choice when maximum capacity for capturing long-range dependencies is worth the added computational cost.
Where LSTM/GRU Still Matter Today
Transformers, covered in Transformers, have largely replaced LSTM/GRU for large-scale language modeling and many sequence tasks, since self-attention handles long-range dependencies even more directly and parallelizes far better across modern hardware. LSTM and GRU remain genuinely useful for:
- Smaller-scale sequence tasks where a transformer’s computational overhead isn’t justified.
- Time-series forecasting, where LSTM/GRU-based architectures remain common and competitive.
- Resource-constrained or streaming/online settings, where processing one time step at a time with a small, fixed hidden state (rather than attending over an entire sequence at once) has genuine practical advantages.
Bidirectional Variants: Reading Sequences in Both Directions
Both LSTM and GRU have bidirectional variants, which process a sequence in both the forward and backward direction simultaneously, then combine the two resulting representations. This matters for tasks where the full sequence is available upfront (unlike real-time generation, where future tokens genuinely aren’t known yet) — for tasks like named entity recognition or sentiment classification of a complete sentence, knowing what comes after a given word can be just as informative as knowing what came before it.
import torch.nn as nn
bidirectional_lstm = nn.LSTM(input_size=128, hidden_size=256, bidirectional=True, batch_first=True)# Output hidden size effectively doubles, since forward and backward# representations are concatenated together at each positionThis bidirectional processing is only appropriate when the entire sequence is available before generating any output — it’s not compatible with autoregressive, left-to-right generation of the kind covered in Sequence-to-Sequence Models and Large Language Models, where future tokens are, by definition, not yet known at generation time.
Summary
| Concept | Purpose |
|---|---|
| Gates | Learned, selective filters controlling what information flows through |
| Cell state (LSTM specifically) | A more stable, largely-additive pathway for long-range information |
| GRU simplification | Fewer parameters, comparable performance on many tasks |
LSTM and GRU didn’t just add complexity to RNNs arbitrarily — every gate exists to solve the exact, specific long-range dependency problem described in Recurrent Neural Networks, giving the network explicit, learned control over what to remember and what to forget.