Learning Rate Scheduling: Step Decay, Cosine Annealing, and Warmup
A single fixed learning rate for an entire training run is rarely the optimal choice — early in training, larger steps help the model make fast progress across the loss landscape; later in training, smaller steps help it settle precisely into a good minimum without overshooting. Learning rate scheduling formalizes this intuition, adjusting the learning rate systematically over the course of training rather than committing to one value throughout.
Why a Fixed Learning Rate Is a Compromise
Recall from Learning Rate that too high a learning rate causes oscillation and divergence, while too low causes painfully slow convergence. A fixed learning rate has to be a single compromise value that’s reasonable throughout the entire run — but the ideal learning rate genuinely changes as training progresses, since the loss landscape’s local shape near the current weights evolves as those weights update.
Early training: Large steps help cover distance quickly across the loss landscapeLate training: Small steps help settle precisely into a good minimum without overshooting past itStep Decay: Reduce at Fixed Intervals
Step decay reduces the learning rate by a fixed factor at predetermined intervals — a simple, widely used schedule.
import torch.optim as optimfrom torch.optim.lr_scheduler import StepLR
optimizer = optim.Adam(model.parameters(), lr=0.1)scheduler = StepLR(optimizer, step_size=10, gamma=0.1) # reduce by 10x every 10 epochs
for epoch in range(num_epochs): train_one_epoch(model, train_loader, optimizer) scheduler.step() print(f"Epoch {epoch}, current LR: {scheduler.get_last_lr()}")
# LR progression: 0.1 -> 0.1 -> ... (10 epochs) -> 0.01 -> ... -> 0.001 -> ...This produces a “staircase” pattern — the learning rate stays constant for a while, then drops sharply, then stays constant again at the new lower value.
Cosine Annealing: Smooth, Gradual Decay
Cosine annealing decreases the learning rate smoothly following a cosine curve, starting high and gradually decreasing to near zero by the end of training, without the abrupt jumps of step decay.
from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = optim.Adam(model.parameters(), lr=0.1)scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
for epoch in range(num_epochs): train_one_epoch(model, train_loader, optimizer) scheduler.step()Learning rate over training (cosine annealing):
LR │╲ │ ╲___ │ ╲___ │ ╲______ │ ╲___________ └────────────────────────────── epochsThe smooth, continuous decrease avoids the sudden shock a step-decay drop can introduce, and cosine annealing has become a particularly popular default schedule for training modern computer vision and transformer architectures.
Warmup: Starting Small Before Ramping Up
Counter-intuitively, many modern architectures — especially transformers, covered in Transformers — benefit from starting with a small learning rate and gradually increasing it over the first several hundred or thousand steps, before applying decay for the rest of training.
def warmup_then_decay_lr(step, warmup_steps, base_lr, total_steps): if step < warmup_steps: return base_lr * (step / warmup_steps) # linear warmup else: # cosine decay after warmup progress = (step - warmup_steps) / (total_steps - warmup_steps) return base_lr * 0.5 * (1 + np.cos(np.pi * progress))Learning rate over training (warmup + cosine decay):
LR │ ╱‾╲___ │ ╱ ╲___ │ ╱ ╲______ │ ╱ ╲______ │____╱ └────────────────────────────── warmup then decayWarmup exists specifically because, early in training, weights are freshly initialized and gradients can be unreliable or unusually large — jumping straight to a large learning rate at this exact moment risks instability, exactly the kind covered in Exploding Gradient Problem. A brief warmup period lets the model’s activations and gradients settle into a more stable range before the full learning rate is applied.
Reduce-on-Plateau: Adapting Based on Actual Progress
Rather than following a fixed, predetermined schedule, this approach monitors validation loss and reduces the learning rate only when progress has genuinely stalled.
from torch.optim.lr_scheduler import ReduceLROnPlateau
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3)
for epoch in range(num_epochs): train_one_epoch(model, train_loader, optimizer) val_loss = evaluate(model, val_loader) scheduler.step(val_loss) # reduces LR if val_loss hasn't improved for 'patience' epochsThis is particularly useful when you’re not certain in advance how many epochs training will need, adapting the schedule to the model’s actual observed progress rather than committing to a fixed schedule upfront.
Choosing a Schedule
| Schedule | Best for |
|---|---|
| Step decay | Simple, well-understood default; works reasonably across many architectures |
| Cosine annealing | Smooth decay, popular default for modern vision and transformer training |
| Warmup + decay | Essential for transformer training specifically; helps early training stability |
| Reduce-on-plateau | When training duration isn’t fixed in advance, adapting to actual progress |
Scheduling Per-Iteration vs. Per-Epoch
A detail worth being precise about: some schedules (like the warmup shown above) are typically defined in terms of training iterations (individual batch updates), while others (like step decay’s “every 10 epochs”) are defined in terms of epochs — directly connecting to the distinction covered in Epochs, Batch Size, and Iterations. Mixing these up is a real, common source of misconfigured training runs — a warmup meant to last 1,000 iterations, accidentally configured to last 1,000 epochs instead, would keep the learning rate artificially suppressed for what could be the entire training run. Always check which unit a given scheduler implementation actually expects, since frameworks vary in their conventions and defaults here.
Double-checking this single detail before a long, expensive training run begins is a cheap way to avoid discovering, many hours in, that the schedule never actually behaved as intended.
Summary
Learning rate scheduling exists because the single best learning rate genuinely isn’t constant throughout training — starting appropriately (sometimes with warmup), and decreasing over time (via step decay, cosine annealing, or adaptive reduction), consistently produces better final models than committing to one fixed value for an entire training run.