Gradient Descent: The Engine Behind Machine Learning Optimization

Learn gradient descent — batch, stochastic, and mini-batch variants, learning rate, convergence, saddle points, and modern optimizers like Adam and AdamW.

Gradient Descent

Gradient descent is the algorithm that makes machine learning work. Every model from logistic regression to GPT-4 is trained by computing the gradient of a loss function with respect to its parameters and updating those parameters in the direction that reduces the loss.


The Core Idea

Starting at a random point on the loss surface:
1. Compute the gradient at current position: ∇L(θ)
2. Move in the opposite direction: θ ← θ - η × ∇L(θ)
3. Repeat until loss stops decreasing
η (eta) = learning rate: how large a step to take
∇L(θ) = gradient: the direction of steepest ascent (we want descent)

Variants by Batch Size

Batch Gradient Descent

Compute gradient over the entire dataset:

θ ← θ - η × (1/N) Σᵢ ∇Lᵢ(θ)
+ Exact gradient, stable convergence
- Slow for large datasets (entire dataset per update)
- Memory-intensive

Stochastic Gradient Descent (SGD)

Compute gradient on a single sample:

θ ← θ - η × ∇Lᵢ(θ)
+ Fast updates, can escape local minima (noisy)
- Very noisy gradient estimates, oscillates

Mini-Batch SGD (The Standard)

Compute gradient on a small batch (typically 32–512 samples):

for epoch in range(num_epochs):
for X_batch, y_batch in dataloader: # dataloader provides mini-batches
loss = criterion(model(X_batch), y_batch)
optimizer.zero_grad()
loss.backward() # Compute gradients
optimizer.step() # Update: θ ← θ - η × ∇L_batch

The Learning Rate

The most important hyperparameter in gradient descent:

Too small: θ ─────────────────────────────── minimum
(converges, but takes forever)
Too large: θ → → → ← → → ← (diverges or oscillates)
Just right: θ ───────── minimum
(converges in reasonable time)
import torch.optim as optim
# Typical learning rates by optimizer
sgd = optim.SGD(model.parameters(), lr=0.01) # SGD: 0.001–0.1
adam = optim.Adam(model.parameters(), lr=1e-3) # Adam: 1e-4–1e-2
adamw = optim.AdamW(model.parameters(), lr=1e-3) # AdamW: 1e-4–1e-2

Modern Optimizers

Adam (Adaptive Moment Estimation)

Adapts the learning rate for each parameter individually based on:

  • First moment: exponential moving average of gradients (momentum)
  • Second moment: exponential moving average of squared gradients (adaptive scaling)
adam = optim.Adam(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999), # Momentum coefficients
eps=1e-8, # Numerical stability
weight_decay=0 # No L2 regularization (use AdamW for that)
)

AdamW (Adam with Decoupled Weight Decay)

Fixes Adam’s L2 regularization bug — weight decay is applied directly to weights, not to gradients. This is the standard optimizer for Transformers and modern deep learning:

adamw = optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=0.01 # Proper weight decay (unlike Adam's weight_decay)
)

SGD with Momentum

Accumulates a velocity vector — dampens oscillations and accelerates in consistent directions:

sgd_momentum = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9, # Standard: 0.9
nesterov=True, # Nesterov momentum: slightly better in practice
weight_decay=1e-4
)

Gradient Clipping

Essential for RNNs and Transformers to prevent exploding gradients:

for X_batch, y_batch in dataloader:
loss = criterion(model(X_batch), y_batch)
optimizer.zero_grad()
loss.backward()
# Clip gradients before optimizer step
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Challenges in Non-Convex Optimization

Neural network loss surfaces are non-convex with:

  • Local minima: Points where gradient = 0 but not the global minimum
  • Saddle points: Gradient = 0 but downhill in some directions — SGD’s noise helps escape these
  • Flat regions / plateaus: Tiny gradients, slow progress
  • Sharp minima: Converges but poor generalization — flat minima generalize better

The good news: in practice, all local minima of well-trained large neural networks tend to have similar loss values. Reaching any of them gives a good solution.


Optimizer Selection Guide

Most deep learning tasks: → AdamW (lr=1e-3 to 1e-4)
Fine-tuning pretrained models: → AdamW with smaller lr (1e-5 to 1e-4)
Computer vision from scratch: → SGD + momentum (often better than Adam at scale)
Transformers (BERT, GPT training): → AdamW with warmup schedule
Simple tabular models: → Adam or AdamW
Sparse features (NLP bag-of-words): → Adagrad or Adam