Gradient Descent

Gradient descent is the algorithm that makes machine learning work. Every model from logistic regression to GPT-4 is trained by computing the gradient of a loss function with respect to its parameters and updating those parameters in the direction that reduces the loss.

The Core Idea

Starting at a random point on the loss surface:
  1. Compute the gradient at current position: ∇L(θ)
  2. Move in the opposite direction: θ ← θ - η × ∇L(θ)
  3. Repeat until loss stops decreasing

η (eta) = learning rate: how large a step to take
∇L(θ)  = gradient: the direction of steepest ascent (we want descent)

Variants by Batch Size

Batch Gradient Descent

Compute gradient over the entire dataset:

θ ← θ - η × (1/N) Σᵢ ∇Lᵢ(θ)

+ Exact gradient, stable convergence
- Slow for large datasets (entire dataset per update)
- Memory-intensive

Stochastic Gradient Descent (SGD)

Compute gradient on a single sample:

θ ← θ - η × ∇Lᵢ(θ)

+ Fast updates, can escape local minima (noisy)
- Very noisy gradient estimates, oscillates

Mini-Batch SGD (The Standard)

Compute gradient on a small batch (typically 32–512 samples):

for epoch in range(num_epochs):
    for X_batch, y_batch in dataloader:  # dataloader provides mini-batches
        loss = criterion(model(X_batch), y_batch)
        optimizer.zero_grad()
        loss.backward()       # Compute gradients
        optimizer.step()      # Update: θ ← θ - η × ∇L_batch

The Learning Rate

The most important hyperparameter in gradient descent:

Too small: θ ─────────────────────────────── minimum
           (converges, but takes forever)

Too large: θ → → → ← → → ← (diverges or oscillates)

Just right: θ ───────── minimum
            (converges in reasonable time)

import torch.optim as optim

# Typical learning rates by optimizer
sgd = optim.SGD(model.parameters(), lr=0.01)       # SGD: 0.001–0.1
adam = optim.Adam(model.parameters(), lr=1e-3)     # Adam: 1e-4–1e-2
adamw = optim.AdamW(model.parameters(), lr=1e-3)   # AdamW: 1e-4–1e-2

Modern Optimizers

Adam (Adaptive Moment Estimation)

Adapts the learning rate for each parameter individually based on:

First moment: exponential moving average of gradients (momentum)
Second moment: exponential moving average of squared gradients (adaptive scaling)

adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),   # Momentum coefficients
    eps=1e-8,             # Numerical stability
    weight_decay=0        # No L2 regularization (use AdamW for that)
)

AdamW (Adam with Decoupled Weight Decay)

Fixes Adam’s L2 regularization bug — weight decay is applied directly to weights, not to gradients. This is the standard optimizer for Transformers and modern deep learning:

adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01     # Proper weight decay (unlike Adam's weight_decay)
)

SGD with Momentum

Accumulates a velocity vector — dampens oscillations and accelerates in consistent directions:

sgd_momentum = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,         # Standard: 0.9
    nesterov=True,        # Nesterov momentum: slightly better in practice
    weight_decay=1e-4
)

Gradient Clipping

Essential for RNNs and Transformers to prevent exploding gradients:

for X_batch, y_batch in dataloader:
    loss = criterion(model(X_batch), y_batch)
    optimizer.zero_grad()
    loss.backward()

    # Clip gradients before optimizer step
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

Challenges in Non-Convex Optimization

Neural network loss surfaces are non-convex with:

Local minima: Points where gradient = 0 but not the global minimum
Saddle points: Gradient = 0 but downhill in some directions — SGD’s noise helps escape these
Flat regions / plateaus: Tiny gradients, slow progress
Sharp minima: Converges but poor generalization — flat minima generalize better

The good news: in practice, all local minima of well-trained large neural networks tend to have similar loss values. Reaching any of them gives a good solution.

Optimizer Selection Guide

Most deep learning tasks:              → AdamW (lr=1e-3 to 1e-4)
Fine-tuning pretrained models:        → AdamW with smaller lr (1e-5 to 1e-4)
Computer vision from scratch:         → SGD + momentum (often better than Adam at scale)
Transformers (BERT, GPT training):    → AdamW with warmup schedule
Simple tabular models:                → Adam or AdamW
Sparse features (NLP bag-of-words):   → Adagrad or Adam