Learning Rate Scheduling
A fixed learning rate is rarely optimal throughout training. Too high at the end causes overshooting; too low at the start slows convergence. Learning rate schedules adapt the learning rate over time — often the difference between a model that converges and one that doesn’t.
Why Scheduling Matters
Training stages: Early: Large LR → fast progress toward loss basin Middle: Medium LR → fine-tuning within the basin Late: Small LR → converge to sharp minimum without overshootingA well-designed schedule can improve final accuracy by 1–3% on standard benchmarks — meaningful in competitive settings.
Common Schedules in PyTorch
Step Decay
Multiply LR by a factor every N epochs:
import torch.optim as optimfrom torch.optim.lr_scheduler import StepLR
optimizer = optim.SGD(model.parameters(), lr=0.1)scheduler = StepLR(optimizer, step_size=30, gamma=0.1)# LR: 0.1 → 0.01 (epoch 30) → 0.001 (epoch 60) → 0.0001 (epoch 90)
for epoch in range(90): train(model, train_loader, optimizer) validate(model, val_loader) scheduler.step() # Call AFTER optimizer.step()Cosine Annealing
Smoothly reduces LR following a cosine curve:
from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR( optimizer, T_max=100, # Cycle length (total epochs or half-cycle) eta_min=1e-6 # Minimum LR)# LR decreases smoothly from initial to eta_min over T_max stepsCosine Annealing with Warm Restarts (SGDR)
Periodically “restarts” the learning rate — helps escape local minima:
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
scheduler = CosineAnnealingWarmRestarts( optimizer, T_0=10, # First cycle length T_mult=2, # Each cycle is 2× as long as the previous eta_min=1e-6)ReduceLROnPlateau
Reduces LR when a metric stops improving — adaptive, no need to set epoch count:
from torch.optim.lr_scheduler import ReduceLROnPlateau
scheduler = ReduceLROnPlateau( optimizer, mode='min', # Minimize validation loss factor=0.5, # Multiply LR by 0.5 when triggered patience=10, # Wait 10 epochs with no improvement min_lr=1e-6, verbose=True)
for epoch in range(num_epochs): train_loss = train(...) val_loss = validate(...) scheduler.step(val_loss) # Pass the metric to monitorLinear Warmup
Transformers and large models train poorly with a large LR from the start — the randomly initialized weights produce large, noisy gradients. Linear warmup gradually ramps up the LR over the first few epochs:
from torch.optim.lr_scheduler import LinearLR, SequentialLR
# Warmup for first 10% of training, then cosine annealingwarmup_epochs = max(1, int(0.1 * total_epochs))warmup_scheduler = LinearLR(optimizer, start_factor=0.1, end_factor=1.0, total_iters=warmup_epochs)main_scheduler = CosineAnnealingLR(optimizer, T_max=total_epochs - warmup_epochs)
scheduler = SequentialLR(optimizer, schedulers=[warmup_scheduler, main_scheduler], milestones=[warmup_epochs])HuggingFace Warmup (Transformers)
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
# Linear warmup + linear decay (BERT-style)scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=1000, num_training_steps=total_steps)
# Linear warmup + cosine decay (GPT-style)scheduler = get_cosine_schedule_with_warmup( optimizer, num_warmup_steps=2000, num_training_steps=total_steps)OneCycleLR: State-of-the-Art for Fast Training
Combines warmup + annealing into one cycle. Often achieves competitive results in fewer epochs:
from torch.optim.lr_scheduler import OneCycleLR
# Set up optimizer with low initial LRoptimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = OneCycleLR( optimizer, max_lr=0.1, # Peak LR steps_per_epoch=len(train_loader), epochs=25, pct_start=0.3, # 30% of training is warmup anneal_strategy='cos' # Cosine annealing)
# Call every BATCH (not every epoch!)for X_batch, y_batch in train_loader: loss = criterion(model(X_batch), y_batch) optimizer.zero_grad() loss.backward() optimizer.step() scheduler.step() # Per-step for OneCycleLRSchedule Selection Guide
| Scenario | Recommended Schedule |
|---|---|
| Training from scratch (CNNs) | Cosine annealing or OneCycleLR |
| Transformer pretraining | Linear warmup + cosine decay |
| Fine-tuning pretrained model | Linear warmup + linear decay or ReduceLROnPlateau |
| Don’t know training length | ReduceLROnPlateau |
| Quick experiments | StepLR |
The right schedule won’t rescue a bad architecture or bad data, but it can meaningfully accelerate training and improve final performance on well-designed models.