Deep Learning Optimizers Compared: SGD, Momentum, RMSProp, Adam, and AdamW

Plain gradient descent, covered in Gradient Descent, uses one fixed learning rate applied identically to every parameter, every step. Modern optimizers improve on this by adapting how updates are computed — accumulating momentum, scaling per-parameter learning rates based on gradient history, or both at once. Understanding what each one actually adds is what turns “just use Adam” from cargo-culted advice into an informed, occasionally-overridden default.

SGD: The Baseline

Plain stochastic gradient descent updates weights using only the current gradient, scaled by a fixed learning rate.

weight = weight - learning_rate * gradient

Simple, predictable, but slow to navigate landscapes with steep curvature in some directions and shallow curvature in others — a common shape in real loss landscapes that causes plain SGD to zigzag inefficiently rather than moving directly toward the minimum.

Momentum: Accumulating Velocity

Momentum adds a “velocity” term that accumulates a running average of past gradients, letting the optimizer build up speed in a consistent direction and dampen oscillation in directions where the gradient keeps flipping sign.

velocity = 0
momentum_coefficient = 0.9

for step in training_steps:
    gradient = compute_gradient(weight)
    velocity = momentum_coefficient * velocity + gradient
    weight = weight - learning_rate * velocity

The intuition is physical: imagine a ball rolling down a valley — momentum lets it build speed through consistently-sloped regions and smooths through small bumps and narrow oscillations, rather than reacting to every single local gradient independently.

RMSProp: Adaptive Per-Parameter Learning Rates

RMSProp tracks a running average of the squared gradient for each parameter individually, and uses it to scale that parameter’s effective learning rate — parameters with historically large gradients get a smaller effective step, and parameters with historically small gradients get a relatively larger one.

squared_grad_avg = 0
decay_rate = 0.9
epsilon = 1e-8

for step in training_steps:
    gradient = compute_gradient(weight)
    squared_grad_avg = decay_rate * squared_grad_avg + (1 - decay_rate) * gradient ** 2
    weight = weight - (learning_rate / (np.sqrt(squared_grad_avg) + epsilon)) * gradient

This per-parameter adaptivity is genuinely useful when different parameters in a network have very different typical gradient scales — a single global learning rate that works well for one parameter can be badly wrong for another, and RMSProp compensates for this automatically.

Adam: Combining Momentum and Adaptive Learning Rates

Adam (Adaptive Moment Estimation) combines both ideas — it tracks a running average of the gradient itself (like momentum) and a running average of the squared gradient (like RMSProp), using both to compute each update.

m = 0   # first moment (like momentum)
v = 0   # second moment (like RMSProp)
beta1, beta2 = 0.9, 0.999
epsilon = 1e-8

for t in range(1, num_steps + 1):
    gradient = compute_gradient(weight)
    m = beta1 * m + (1 - beta1) * gradient
    v = beta2 * v + (1 - beta2) * gradient ** 2

    m_corrected = m / (1 - beta1 ** t)   # bias correction for early steps
    v_corrected = v / (1 - beta2 ** t)

    weight = weight - learning_rate * m_corrected / (np.sqrt(v_corrected) + epsilon)

import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001)

Adam has become the default optimizer choice for the vast majority of deep learning projects, largely because it works reasonably well across a wide range of architectures and problems without extensive per-problem tuning — a genuinely practical advantage even when a more carefully-tuned SGD-with-momentum setup might eventually outperform it on a specific problem.

AdamW: Fixing Adam’s Interaction With Weight Decay

AdamW is a small but meaningful fix to how Adam interacts with L2 regularization (weight decay, covered in Regularization). In standard Adam, weight decay gets folded into the gradient before the adaptive scaling is applied — which interacts in a subtly undesirable way with Adam’s per-parameter adaptive learning rates. AdamW applies weight decay directly to the weights, separately from the gradient-based update, which turns out to produce measurably better generalization in practice.

optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

AdamW has become the standard choice for training transformer-based architectures, covered in Transformers, specifically because of this improved interaction with weight decay at the scale these models are typically trained.

Comparing the Optimizers

Optimizer	Tracks	Best for
SGD	Nothing extra — just the current gradient	Simple problems, or when combined with careful manual tuning
SGD + Momentum	Running average of gradients	Faster, smoother convergence than plain SGD
RMSProp	Running average of squared gradients	Problems with very different gradient scales across parameters
Adam	Both gradient and squared gradient averages	The practical default for most deep learning problems
AdamW	Same as Adam, decoupled weight decay	Transformer training, and generally preferred over Adam when weight decay is used

Optimizer Choice Interacts With Learning Rate Scheduling

It’s worth noting that none of these optimizers are typically used with a single fixed learning rate for an entire training run in practice — even Adam, which adapts per-parameter learning rates internally, still benefits substantially from an external learning rate schedule layered on top, covered in Learning Rate Scheduling. The optimizer determines how a given base learning rate gets applied and adapted per parameter; the schedule determines how that base learning rate itself changes over the course of training. Treating these as two independent, complementary decisions — rather than assuming a sufficiently sophisticated optimizer removes the need for scheduling entirely — reflects how most successful, real-world training recipes are actually configured.

Summary

Modern optimizers aren’t arbitrary alternatives to plain gradient descent — each one addresses a specific, well-understood limitation: momentum smooths noisy updates, RMSProp adapts to per-parameter gradient scale, and Adam/AdamW combine both. Starting with Adam or AdamW as a default, and only reaching for plain SGD with momentum when you have a specific reason (some research has found well-tuned SGD generalizes slightly better on certain vision tasks), is a reasonable, well-justified practical strategy.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.