Learning Rate Scheduling: Step Decay, Cosine Annealing, and Warmup

Why a fixed learning rate is rarely optimal for the entire training run, and how step decay, cosine annealing, and warmup schedules fix this.

Learning Rate Scheduling: Step Decay, Cosine Annealing, and Warmup

A single fixed learning rate for an entire training run is rarely the optimal choice — early in training, larger steps help the model make fast progress across the loss landscape; later in training, smaller steps help it settle precisely into a good minimum without overshooting. Learning rate scheduling formalizes this intuition, adjusting the learning rate systematically over the course of training rather than committing to one value throughout.


Why a Fixed Learning Rate Is a Compromise

Recall from Learning Rate that too high a learning rate causes oscillation and divergence, while too low causes painfully slow convergence. A fixed learning rate has to be a single compromise value that’s reasonable throughout the entire run — but the ideal learning rate genuinely changes as training progresses, since the loss landscape’s local shape near the current weights evolves as those weights update.

Early training: Large steps help cover distance quickly across the loss landscape
Late training: Small steps help settle precisely into a good minimum
without overshooting past it

Step Decay: Reduce at Fixed Intervals

Step decay reduces the learning rate by a fixed factor at predetermined intervals — a simple, widely used schedule.

import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
optimizer = optim.Adam(model.parameters(), lr=0.1)
scheduler = StepLR(optimizer, step_size=10, gamma=0.1) # reduce by 10x every 10 epochs
for epoch in range(num_epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
print(f"Epoch {epoch}, current LR: {scheduler.get_last_lr()}")
# LR progression: 0.1 -> 0.1 -> ... (10 epochs) -> 0.01 -> ... -> 0.001 -> ...

This produces a “staircase” pattern — the learning rate stays constant for a while, then drops sharply, then stays constant again at the new lower value.


Cosine Annealing: Smooth, Gradual Decay

Cosine annealing decreases the learning rate smoothly following a cosine curve, starting high and gradually decreasing to near zero by the end of training, without the abrupt jumps of step decay.

from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = optim.Adam(model.parameters(), lr=0.1)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
for epoch in range(num_epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Learning rate over training (cosine annealing):
LR
│╲
│ ╲___
│ ╲___
│ ╲______
│ ╲___________
└──────────────────────────────
epochs

The smooth, continuous decrease avoids the sudden shock a step-decay drop can introduce, and cosine annealing has become a particularly popular default schedule for training modern computer vision and transformer architectures.


Warmup: Starting Small Before Ramping Up

Counter-intuitively, many modern architectures — especially transformers, covered in Transformers — benefit from starting with a small learning rate and gradually increasing it over the first several hundred or thousand steps, before applying decay for the rest of training.

def warmup_then_decay_lr(step, warmup_steps, base_lr, total_steps):
if step < warmup_steps:
return base_lr * (step / warmup_steps) # linear warmup
else:
# cosine decay after warmup
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return base_lr * 0.5 * (1 + np.cos(np.pi * progress))
Learning rate over training (warmup + cosine decay):
LR
│ ╱‾╲___
│ ╱ ╲___
│ ╱ ╲______
│ ╱ ╲______
│____╱
└──────────────────────────────
warmup then decay

Warmup exists specifically because, early in training, weights are freshly initialized and gradients can be unreliable or unusually large — jumping straight to a large learning rate at this exact moment risks instability, exactly the kind covered in Exploding Gradient Problem. A brief warmup period lets the model’s activations and gradients settle into a more stable range before the full learning rate is applied.


Reduce-on-Plateau: Adapting Based on Actual Progress

Rather than following a fixed, predetermined schedule, this approach monitors validation loss and reduces the learning rate only when progress has genuinely stalled.

from torch.optim.lr_scheduler import ReduceLROnPlateau
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3)
for epoch in range(num_epochs):
train_one_epoch(model, train_loader, optimizer)
val_loss = evaluate(model, val_loader)
scheduler.step(val_loss) # reduces LR if val_loss hasn't improved for 'patience' epochs

This is particularly useful when you’re not certain in advance how many epochs training will need, adapting the schedule to the model’s actual observed progress rather than committing to a fixed schedule upfront.


Choosing a Schedule

ScheduleBest for
Step decaySimple, well-understood default; works reasonably across many architectures
Cosine annealingSmooth decay, popular default for modern vision and transformer training
Warmup + decayEssential for transformer training specifically; helps early training stability
Reduce-on-plateauWhen training duration isn’t fixed in advance, adapting to actual progress

Scheduling Per-Iteration vs. Per-Epoch

A detail worth being precise about: some schedules (like the warmup shown above) are typically defined in terms of training iterations (individual batch updates), while others (like step decay’s “every 10 epochs”) are defined in terms of epochs — directly connecting to the distinction covered in Epochs, Batch Size, and Iterations. Mixing these up is a real, common source of misconfigured training runs — a warmup meant to last 1,000 iterations, accidentally configured to last 1,000 epochs instead, would keep the learning rate artificially suppressed for what could be the entire training run. Always check which unit a given scheduler implementation actually expects, since frameworks vary in their conventions and defaults here.

Double-checking this single detail before a long, expensive training run begins is a cheap way to avoid discovering, many hours in, that the schedule never actually behaved as intended.

Summary

Learning rate scheduling exists because the single best learning rate genuinely isn’t constant throughout training — starting appropriately (sometimes with warmup), and decreasing over time (via step decay, cosine annealing, or adaptive reduction), consistently produces better final models than committing to one fixed value for an entire training run.