Gradient Descent
Gradient descent is the algorithm that makes machine learning work. Every model from logistic regression to GPT-4 is trained by computing the gradient of a loss function with respect to its parameters and updating those parameters in the direction that reduces the loss.
The Core Idea
Starting at a random point on the loss surface: 1. Compute the gradient at current position: ∇L(θ) 2. Move in the opposite direction: θ ← θ - η × ∇L(θ) 3. Repeat until loss stops decreasing
η (eta) = learning rate: how large a step to take∇L(θ) = gradient: the direction of steepest ascent (we want descent)Variants by Batch Size
Batch Gradient Descent
Compute gradient over the entire dataset:
θ ← θ - η × (1/N) Σᵢ ∇Lᵢ(θ)
+ Exact gradient, stable convergence- Slow for large datasets (entire dataset per update)- Memory-intensiveStochastic Gradient Descent (SGD)
Compute gradient on a single sample:
θ ← θ - η × ∇Lᵢ(θ)
+ Fast updates, can escape local minima (noisy)- Very noisy gradient estimates, oscillatesMini-Batch SGD (The Standard)
Compute gradient on a small batch (typically 32–512 samples):
for epoch in range(num_epochs): for X_batch, y_batch in dataloader: # dataloader provides mini-batches loss = criterion(model(X_batch), y_batch) optimizer.zero_grad() loss.backward() # Compute gradients optimizer.step() # Update: θ ← θ - η × ∇L_batchThe Learning Rate
The most important hyperparameter in gradient descent:
Too small: θ ─────────────────────────────── minimum (converges, but takes forever)
Too large: θ → → → ← → → ← (diverges or oscillates)
Just right: θ ───────── minimum (converges in reasonable time)import torch.optim as optim
# Typical learning rates by optimizersgd = optim.SGD(model.parameters(), lr=0.01) # SGD: 0.001–0.1adam = optim.Adam(model.parameters(), lr=1e-3) # Adam: 1e-4–1e-2adamw = optim.AdamW(model.parameters(), lr=1e-3) # AdamW: 1e-4–1e-2Modern Optimizers
Adam (Adaptive Moment Estimation)
Adapts the learning rate for each parameter individually based on:
- First moment: exponential moving average of gradients (momentum)
- Second moment: exponential moving average of squared gradients (adaptive scaling)
adam = optim.Adam( model.parameters(), lr=1e-3, betas=(0.9, 0.999), # Momentum coefficients eps=1e-8, # Numerical stability weight_decay=0 # No L2 regularization (use AdamW for that))AdamW (Adam with Decoupled Weight Decay)
Fixes Adam’s L2 regularization bug — weight decay is applied directly to weights, not to gradients. This is the standard optimizer for Transformers and modern deep learning:
adamw = optim.AdamW( model.parameters(), lr=1e-3, weight_decay=0.01 # Proper weight decay (unlike Adam's weight_decay))SGD with Momentum
Accumulates a velocity vector — dampens oscillations and accelerates in consistent directions:
sgd_momentum = optim.SGD( model.parameters(), lr=0.01, momentum=0.9, # Standard: 0.9 nesterov=True, # Nesterov momentum: slightly better in practice weight_decay=1e-4)Gradient Clipping
Essential for RNNs and Transformers to prevent exploding gradients:
for X_batch, y_batch in dataloader: loss = criterion(model(X_batch), y_batch) optimizer.zero_grad() loss.backward()
# Clip gradients before optimizer step torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()Challenges in Non-Convex Optimization
Neural network loss surfaces are non-convex with:
- Local minima: Points where gradient = 0 but not the global minimum
- Saddle points: Gradient = 0 but downhill in some directions — SGD’s noise helps escape these
- Flat regions / plateaus: Tiny gradients, slow progress
- Sharp minima: Converges but poor generalization — flat minima generalize better
The good news: in practice, all local minima of well-trained large neural networks tend to have similar loss values. Reaching any of them gives a good solution.
Optimizer Selection Guide
Most deep learning tasks: → AdamW (lr=1e-3 to 1e-4)Fine-tuning pretrained models: → AdamW with smaller lr (1e-5 to 1e-4)Computer vision from scratch: → SGD + momentum (often better than Adam at scale)Transformers (BERT, GPT training): → AdamW with warmup scheduleSimple tabular models: → Adam or AdamWSparse features (NLP bag-of-words): → Adagrad or Adam