Learning Rate Scheduling: Dynamic Optimization for Deep Learning

Master learning rate scheduling — warmup, cosine annealing, step decay, ReduceLROnPlateau, cyclic schedules, and how to pick the right schedule for your training run.

Learning Rate Scheduling

A fixed learning rate is rarely optimal throughout training. Too high at the end causes overshooting; too low at the start slows convergence. Learning rate schedules adapt the learning rate over time — often the difference between a model that converges and one that doesn’t.


Why Scheduling Matters

Training stages:
Early: Large LR → fast progress toward loss basin
Middle: Medium LR → fine-tuning within the basin
Late: Small LR → converge to sharp minimum without overshooting

A well-designed schedule can improve final accuracy by 1–3% on standard benchmarks — meaningful in competitive settings.


Common Schedules in PyTorch

Step Decay

Multiply LR by a factor every N epochs:

import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# LR: 0.1 → 0.01 (epoch 30) → 0.001 (epoch 60) → 0.0001 (epoch 90)
for epoch in range(90):
train(model, train_loader, optimizer)
validate(model, val_loader)
scheduler.step() # Call AFTER optimizer.step()

Cosine Annealing

Smoothly reduces LR following a cosine curve:

from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR(
optimizer,
T_max=100, # Cycle length (total epochs or half-cycle)
eta_min=1e-6 # Minimum LR
)
# LR decreases smoothly from initial to eta_min over T_max steps

Cosine Annealing with Warm Restarts (SGDR)

Periodically “restarts” the learning rate — helps escape local minima:

from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
scheduler = CosineAnnealingWarmRestarts(
optimizer,
T_0=10, # First cycle length
T_mult=2, # Each cycle is 2× as long as the previous
eta_min=1e-6
)

ReduceLROnPlateau

Reduces LR when a metric stops improving — adaptive, no need to set epoch count:

from torch.optim.lr_scheduler import ReduceLROnPlateau
scheduler = ReduceLROnPlateau(
optimizer,
mode='min', # Minimize validation loss
factor=0.5, # Multiply LR by 0.5 when triggered
patience=10, # Wait 10 epochs with no improvement
min_lr=1e-6,
verbose=True
)
for epoch in range(num_epochs):
train_loss = train(...)
val_loss = validate(...)
scheduler.step(val_loss) # Pass the metric to monitor

Linear Warmup

Transformers and large models train poorly with a large LR from the start — the randomly initialized weights produce large, noisy gradients. Linear warmup gradually ramps up the LR over the first few epochs:

from torch.optim.lr_scheduler import LinearLR, SequentialLR
# Warmup for first 10% of training, then cosine annealing
warmup_epochs = max(1, int(0.1 * total_epochs))
warmup_scheduler = LinearLR(optimizer, start_factor=0.1, end_factor=1.0,
total_iters=warmup_epochs)
main_scheduler = CosineAnnealingLR(optimizer, T_max=total_epochs - warmup_epochs)
scheduler = SequentialLR(optimizer,
schedulers=[warmup_scheduler, main_scheduler],
milestones=[warmup_epochs])

HuggingFace Warmup (Transformers)

from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
# Linear warmup + linear decay (BERT-style)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=1000,
num_training_steps=total_steps
)
# Linear warmup + cosine decay (GPT-style)
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=2000,
num_training_steps=total_steps
)

OneCycleLR: State-of-the-Art for Fast Training

Combines warmup + annealing into one cycle. Often achieves competitive results in fewer epochs:

from torch.optim.lr_scheduler import OneCycleLR
# Set up optimizer with low initial LR
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = OneCycleLR(
optimizer,
max_lr=0.1, # Peak LR
steps_per_epoch=len(train_loader),
epochs=25,
pct_start=0.3, # 30% of training is warmup
anneal_strategy='cos' # Cosine annealing
)
# Call every BATCH (not every epoch!)
for X_batch, y_batch in train_loader:
loss = criterion(model(X_batch), y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step() # Per-step for OneCycleLR

Schedule Selection Guide

ScenarioRecommended Schedule
Training from scratch (CNNs)Cosine annealing or OneCycleLR
Transformer pretrainingLinear warmup + cosine decay
Fine-tuning pretrained modelLinear warmup + linear decay or ReduceLROnPlateau
Don’t know training lengthReduceLROnPlateau
Quick experimentsStepLR

The right schedule won’t rescue a bad architecture or bad data, but it can meaningfully accelerate training and improve final performance on well-designed models.