Learning Rate Scheduling

A fixed learning rate is rarely optimal throughout training. Too high at the end causes overshooting; too low at the start slows convergence. Learning rate schedules adapt the learning rate over time — often the difference between a model that converges and one that doesn’t.

Why Scheduling Matters

Training stages:
  Early:  Large LR → fast progress toward loss basin
  Middle: Medium LR → fine-tuning within the basin
  Late:   Small LR → converge to sharp minimum without overshooting

A well-designed schedule can improve final accuracy by 1–3% on standard benchmarks — meaningful in competitive settings.

Common Schedules in PyTorch

Step Decay

Multiply LR by a factor every N epochs:

import torch.optim as optim
from torch.optim.lr_scheduler import StepLR

optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# LR: 0.1 → 0.01 (epoch 30) → 0.001 (epoch 60) → 0.0001 (epoch 90)

for epoch in range(90):
    train(model, train_loader, optimizer)
    validate(model, val_loader)
    scheduler.step()  # Call AFTER optimizer.step()

Cosine Annealing

Smoothly reduces LR following a cosine curve:

from torch.optim.lr_scheduler import CosineAnnealingLR

scheduler = CosineAnnealingLR(
    optimizer,
    T_max=100,     # Cycle length (total epochs or half-cycle)
    eta_min=1e-6   # Minimum LR
)
# LR decreases smoothly from initial to eta_min over T_max steps

Cosine Annealing with Warm Restarts (SGDR)

Periodically “restarts” the learning rate — helps escape local minima:

from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts

scheduler = CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,     # First cycle length
    T_mult=2,   # Each cycle is 2× as long as the previous
    eta_min=1e-6
)

ReduceLROnPlateau

Reduces LR when a metric stops improving — adaptive, no need to set epoch count:

from torch.optim.lr_scheduler import ReduceLROnPlateau

scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',       # Minimize validation loss
    factor=0.5,       # Multiply LR by 0.5 when triggered
    patience=10,      # Wait 10 epochs with no improvement
    min_lr=1e-6,
    verbose=True
)

for epoch in range(num_epochs):
    train_loss = train(...)
    val_loss = validate(...)
    scheduler.step(val_loss)  # Pass the metric to monitor

Linear Warmup

Transformers and large models train poorly with a large LR from the start — the randomly initialized weights produce large, noisy gradients. Linear warmup gradually ramps up the LR over the first few epochs:

from torch.optim.lr_scheduler import LinearLR, SequentialLR

# Warmup for first 10% of training, then cosine annealing
warmup_epochs = max(1, int(0.1 * total_epochs))
warmup_scheduler = LinearLR(optimizer, start_factor=0.1, end_factor=1.0,
                             total_iters=warmup_epochs)
main_scheduler = CosineAnnealingLR(optimizer, T_max=total_epochs - warmup_epochs)

scheduler = SequentialLR(optimizer,
                          schedulers=[warmup_scheduler, main_scheduler],
                          milestones=[warmup_epochs])

HuggingFace Warmup (Transformers)

from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup

# Linear warmup + linear decay (BERT-style)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=1000,
    num_training_steps=total_steps
)

# Linear warmup + cosine decay (GPT-style)
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=2000,
    num_training_steps=total_steps
)

OneCycleLR: State-of-the-Art for Fast Training

Combines warmup + annealing into one cycle. Often achieves competitive results in fewer epochs:

from torch.optim.lr_scheduler import OneCycleLR

# Set up optimizer with low initial LR
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

scheduler = OneCycleLR(
    optimizer,
    max_lr=0.1,                    # Peak LR
    steps_per_epoch=len(train_loader),
    epochs=25,
    pct_start=0.3,                 # 30% of training is warmup
    anneal_strategy='cos'          # Cosine annealing
)

# Call every BATCH (not every epoch!)
for X_batch, y_batch in train_loader:
    loss = criterion(model(X_batch), y_batch)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    scheduler.step()  # Per-step for OneCycleLR

Schedule Selection Guide

Scenario	Recommended Schedule
Training from scratch (CNNs)	Cosine annealing or OneCycleLR
Transformer pretraining	Linear warmup + cosine decay
Fine-tuning pretrained model	Linear warmup + linear decay or ReduceLROnPlateau
Don’t know training length	ReduceLROnPlateau
Quick experiments	StepLR

The right schedule won’t rescue a bad architecture or bad data, but it can meaningfully accelerate training and improve final performance on well-designed models.