LSTM and GRU Explained: Solving the RNN Long-Range Dependency Problem

How LSTM and GRU gates work internally, why they preserve long-range information better than plain RNNs, and when to choose one over the other.

LSTM and GRU Explained: Solving the RNN Long-Range Dependency Problem

Plain RNNs, covered in Recurrent Neural Networks, lose long-range information because their hidden state gets repeatedly overwritten and multiplicatively degraded at every time step. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) solve this with a specific architectural mechanism — gates — that let the network learn what to remember, what to forget, and what to output, rather than being forced to blend everything together the same way at every step.


The Core Innovation: Gates as Learned, Selective Filters

A gate in an LSTM or GRU is a small neural network component (a sigmoid layer) that outputs values between 0 and 1, acting as a learned “how much to let through” filter — a gate output near 0 blocks information, near 1 lets it pass freely, and the network learns the right gate behavior for a given context through training, exactly like any other weight.

def gate(x, h_prev, W_x, W_h, b):
return sigmoid(x @ W_x + h_prev @ W_h + b) # output between 0 and 1

LSTM: Three Gates and a Separate Cell State

LSTM introduces a separate “cell state” — a kind of conveyor belt that can carry information across many time steps largely unchanged — alongside the regular hidden state, controlled by three distinct gates.

import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def lstm_step(x_t, h_prev, c_prev, weights):
combined = np.concatenate([x_t, h_prev])
forget_gate = sigmoid(combined @ weights["W_f"]) # what to forget from cell state
input_gate = sigmoid(combined @ weights["W_i"]) # what new info to add
candidate = np.tanh(combined @ weights["W_c"]) # candidate new information
output_gate = sigmoid(combined @ weights["W_o"]) # what to output based on cell state
c_t = forget_gate * c_prev + input_gate * candidate # update cell state
h_t = output_gate * np.tanh(c_t) # compute new hidden state
return h_t, c_t

Forget gate: decides what to discard from the existing cell state — crucial for letting the network deliberately drop irrelevant older information rather than being forced to retain everything.

Input gate: decides what new information from the current input should be added to the cell state.

Output gate: decides what part of the (updated) cell state should actually be exposed as the hidden state output for this time step.

The cell state’s update (forget_gate * c_prev + input_gate * candidate) is additive rather than purely multiplicative through a nonlinearity at every step — this additive path is exactly what allows gradients to flow backward across many time steps without vanishing nearly as severely as in a plain RNN.

import torch.nn as nn
lstm_layer = nn.LSTM(input_size=128, hidden_size=256, batch_first=True)

GRU: A Simpler Alternative With Fewer Gates

GRU simplifies LSTM’s design, using just two gates and merging the cell state and hidden state into a single state — fewer parameters, computationally cheaper, while retaining most of LSTM’s practical benefit for mitigating long-range gradient decay.

def gru_step(x_t, h_prev, weights):
combined = np.concatenate([x_t, h_prev])
update_gate = sigmoid(combined @ weights["W_z"]) # how much of the past to keep
reset_gate = sigmoid(combined @ weights["W_r"]) # how much past info to ignore for the candidate
combined_reset = np.concatenate([x_t, reset_gate * h_prev])
candidate = np.tanh(combined_reset @ weights["W_h"])
h_t = (1 - update_gate) * h_prev + update_gate * candidate
return h_t
gru_layer = nn.GRU(input_size=128, hidden_size=256, batch_first=True)

LSTM vs. GRU: Which to Choose

LSTMGRU
Number of gates3 (forget, input, output)2 (update, reset)
Separate cell stateYesNo — merged with hidden state
Parameter countHigherLower (roughly 25% fewer)
Training speedSlowerFaster
Typical performanceSlightly better on some tasks with very long sequencesOften comparable, faster to train

In practice, the performance difference between the two is usually small and task-dependent — GRU is often a reasonable default for faster experimentation given its lower parameter count, while LSTM remains a strong choice when maximum capacity for capturing long-range dependencies is worth the added computational cost.


Where LSTM/GRU Still Matter Today

Transformers, covered in Transformers, have largely replaced LSTM/GRU for large-scale language modeling and many sequence tasks, since self-attention handles long-range dependencies even more directly and parallelizes far better across modern hardware. LSTM and GRU remain genuinely useful for:

  • Smaller-scale sequence tasks where a transformer’s computational overhead isn’t justified.
  • Time-series forecasting, where LSTM/GRU-based architectures remain common and competitive.
  • Resource-constrained or streaming/online settings, where processing one time step at a time with a small, fixed hidden state (rather than attending over an entire sequence at once) has genuine practical advantages.

Bidirectional Variants: Reading Sequences in Both Directions

Both LSTM and GRU have bidirectional variants, which process a sequence in both the forward and backward direction simultaneously, then combine the two resulting representations. This matters for tasks where the full sequence is available upfront (unlike real-time generation, where future tokens genuinely aren’t known yet) — for tasks like named entity recognition or sentiment classification of a complete sentence, knowing what comes after a given word can be just as informative as knowing what came before it.

import torch.nn as nn
bidirectional_lstm = nn.LSTM(input_size=128, hidden_size=256, bidirectional=True, batch_first=True)
# Output hidden size effectively doubles, since forward and backward
# representations are concatenated together at each position

This bidirectional processing is only appropriate when the entire sequence is available before generating any output — it’s not compatible with autoregressive, left-to-right generation of the kind covered in Sequence-to-Sequence Models and Large Language Models, where future tokens are, by definition, not yet known at generation time.

Summary

ConceptPurpose
GatesLearned, selective filters controlling what information flows through
Cell state (LSTM specifically)A more stable, largely-additive pathway for long-range information
GRU simplificationFewer parameters, comparable performance on many tasks

LSTM and GRU didn’t just add complexity to RNNs arbitrarily — every gate exists to solve the exact, specific long-range dependency problem described in Recurrent Neural Networks, giving the network explicit, learned control over what to remember and what to forget.