Learning Rate Explained: The Single Most Important Hyperparameter
If you could only tune one hyperparameter before training a neural network, the learning rate would be the obvious choice — it has a larger, more immediate impact on whether training succeeds or fails than almost anything else you could adjust. Get it wrong, and no amount of additional training time or model capacity will fix the problem; get it right, and even a modest architecture trains efficiently.
What the Learning Rate Actually Controls
The learning rate scales how large a step gradient descent takes in the direction of the computed gradient, covered in Gradient Descent.
learning_rate = 0.01weight_update = learning_rate * gradientnew_weight = old_weight - weight_updateA larger learning rate means bigger steps per update — potentially faster progress, but with a real risk of overshooting the minimum entirely. A smaller learning rate means smaller, more cautious steps — safer, but potentially painfully slow to converge, or prone to getting stuck in a shallow local minimum.
Learning Rate Too High: Divergence
When the learning rate is too large, weight updates overshoot the minimum so dramatically that the loss doesn’t decrease steadily — it oscillates wildly, or diverges to increasingly large values, sometimes producing NaN loss within just a few steps.
Loss over training steps, learning rate too high:
Loss │ ╱╲ ╱╲╲ │ ╱ ╲ ╱ ╲╲ │ ╱ ╲ ╱ ╲╲___(diverges or oscillates wildly) └────────────────────────── stepsThis is one of the most common and most immediately diagnosable training failures — if loss is erratic or increasing from the very first few steps, the learning rate is almost always the first thing to check and reduce.
Learning Rate Too Low: Painfully Slow Convergence
When the learning rate is too small, each weight update makes only a tiny amount of progress — training might technically be working correctly, but so slowly that it would take an impractical number of epochs to reach a good solution, or it can get stuck in a shallow local minimum without enough “momentum” to escape it.
Loss over training steps, learning rate too low:
Loss │╲ │ ╲___ │ ╲______ │ ╲__________________ (barely decreasing, extremely slowly) └────────────────────────── stepsA Well-Chosen Learning Rate
Loss over training steps, learning rate well-chosen:
Loss │╲ │ ╲╲ │ ╲╲__ │ ╲╲___ │ ╲╲______ (steady, efficient decrease toward a good minimum) └────────────────────────── stepsThis is the pattern you’re aiming for — a smooth, steady decrease without wild oscillation, converging to a low loss value in a reasonable number of steps.
Practical Techniques for Choosing a Learning Rate
Start with framework defaults or established values from similar tasks. For Adam (covered in Optimizers), 0.001 is a common, reasonable starting point for many problems.
The learning rate finder technique. Run a short training pass while gradually increasing the learning rate from a very small to a very large value, plotting loss against learning rate — the ideal learning rate is typically found just before the loss curve starts increasing sharply.
import numpy as npimport matplotlib.pyplot as plt
learning_rates = np.logspace(-6, 0, 100) # from 0.000001 to 1.0losses = []
for lr in learning_rates: # train for a single step or few steps at this learning rate loss = train_one_step(model, lr) losses.append(loss)
plt.plot(learning_rates, losses)plt.xscale('log')# Look for the point where loss starts decreasing fastest,# and choose a learning rate somewhat before it starts increasing againWatch the first few hundred steps closely. If loss diverges or oscillates immediately, the learning rate is too high — reduce it by a factor of 10 and try again. If loss barely moves after many steps, it’s likely too low.
Why the Learning Rate Interacts With Everything Else
The “correct” learning rate isn’t a fixed, universal number — it interacts with batch size (larger batches often tolerate, and sometimes require, a proportionally larger learning rate, covered in Epochs, Batch Size, and Iterations), with the specific optimizer used (Adam and SGD often need meaningfully different learning rates for the same problem), and with the network’s depth (deeper networks are often more sensitive to learning rate choice). This is exactly why Learning Rate Scheduling exists — rather than committing to one fixed value for the entire training run, adjusting the learning rate over time captures the benefit of larger steps early and smaller, more precise steps later.
Learning Rate and Optimizer Choice Are Linked, Not Independent
A commonly overlooked detail: the “right” learning rate for a given problem depends heavily on which optimizer is being used, covered fully in Optimizers. Plain SGD often needs a meaningfully larger learning rate (sometimes 0.01 to 0.1) than Adam, which typically works well starting around 0.001 — this is a direct consequence of Adam’s internal adaptive per-parameter scaling, which already adjusts the effective step size somewhat automatically. Copying a learning rate value from a tutorial using a different optimizer than the one you’re actually using is a common, easy-to-make mistake that produces confusingly different training behavior than expected, even with an otherwise identical setup.
Summary
| Learning rate | Symptom | Fix |
|---|---|---|
| Too high | Loss oscillates or diverges, possibly to NaN | Reduce by 10x, try again |
| Too low | Loss decreases extremely slowly | Increase gradually, or use a learning rate finder |
| Well-chosen | Smooth, steady loss decrease | Continue, consider a schedule for further refinement |
No other single hyperparameter has as much immediate, visible impact on whether training succeeds at all — when a model isn’t training well and you’re not sure where to start debugging, the learning rate should be the very first thing you check.