Learning Rate Explained: The Single Most Important Hyperparameter

Why the learning rate has more impact on training success than almost any other hyperparameter, with practical guidance for choosing one.

Learning Rate Explained: The Single Most Important Hyperparameter

If you could only tune one hyperparameter before training a neural network, the learning rate would be the obvious choice — it has a larger, more immediate impact on whether training succeeds or fails than almost anything else you could adjust. Get it wrong, and no amount of additional training time or model capacity will fix the problem; get it right, and even a modest architecture trains efficiently.


What the Learning Rate Actually Controls

The learning rate scales how large a step gradient descent takes in the direction of the computed gradient, covered in Gradient Descent.

learning_rate = 0.01
weight_update = learning_rate * gradient
new_weight = old_weight - weight_update

A larger learning rate means bigger steps per update — potentially faster progress, but with a real risk of overshooting the minimum entirely. A smaller learning rate means smaller, more cautious steps — safer, but potentially painfully slow to converge, or prone to getting stuck in a shallow local minimum.


Learning Rate Too High: Divergence

When the learning rate is too large, weight updates overshoot the minimum so dramatically that the loss doesn’t decrease steadily — it oscillates wildly, or diverges to increasingly large values, sometimes producing NaN loss within just a few steps.

Loss over training steps, learning rate too high:
Loss
│ ╱╲ ╱╲╲
│ ╱ ╲ ╱ ╲╲
│ ╱ ╲ ╱ ╲╲___(diverges or oscillates wildly)
└──────────────────────────
steps

This is one of the most common and most immediately diagnosable training failures — if loss is erratic or increasing from the very first few steps, the learning rate is almost always the first thing to check and reduce.


Learning Rate Too Low: Painfully Slow Convergence

When the learning rate is too small, each weight update makes only a tiny amount of progress — training might technically be working correctly, but so slowly that it would take an impractical number of epochs to reach a good solution, or it can get stuck in a shallow local minimum without enough “momentum” to escape it.

Loss over training steps, learning rate too low:
Loss
│╲
│ ╲___
│ ╲______
│ ╲__________________ (barely decreasing, extremely slowly)
└──────────────────────────
steps

A Well-Chosen Learning Rate

Loss over training steps, learning rate well-chosen:
Loss
│╲
│ ╲╲
│ ╲╲__
│ ╲╲___
│ ╲╲______ (steady, efficient decrease toward a good minimum)
└──────────────────────────
steps

This is the pattern you’re aiming for — a smooth, steady decrease without wild oscillation, converging to a low loss value in a reasonable number of steps.


Practical Techniques for Choosing a Learning Rate

Start with framework defaults or established values from similar tasks. For Adam (covered in Optimizers), 0.001 is a common, reasonable starting point for many problems.

The learning rate finder technique. Run a short training pass while gradually increasing the learning rate from a very small to a very large value, plotting loss against learning rate — the ideal learning rate is typically found just before the loss curve starts increasing sharply.

import numpy as np
import matplotlib.pyplot as plt
learning_rates = np.logspace(-6, 0, 100) # from 0.000001 to 1.0
losses = []
for lr in learning_rates:
# train for a single step or few steps at this learning rate
loss = train_one_step(model, lr)
losses.append(loss)
plt.plot(learning_rates, losses)
plt.xscale('log')
# Look for the point where loss starts decreasing fastest,
# and choose a learning rate somewhat before it starts increasing again

Watch the first few hundred steps closely. If loss diverges or oscillates immediately, the learning rate is too high — reduce it by a factor of 10 and try again. If loss barely moves after many steps, it’s likely too low.


Why the Learning Rate Interacts With Everything Else

The “correct” learning rate isn’t a fixed, universal number — it interacts with batch size (larger batches often tolerate, and sometimes require, a proportionally larger learning rate, covered in Epochs, Batch Size, and Iterations), with the specific optimizer used (Adam and SGD often need meaningfully different learning rates for the same problem), and with the network’s depth (deeper networks are often more sensitive to learning rate choice). This is exactly why Learning Rate Scheduling exists — rather than committing to one fixed value for the entire training run, adjusting the learning rate over time captures the benefit of larger steps early and smaller, more precise steps later.

Learning Rate and Optimizer Choice Are Linked, Not Independent

A commonly overlooked detail: the “right” learning rate for a given problem depends heavily on which optimizer is being used, covered fully in Optimizers. Plain SGD often needs a meaningfully larger learning rate (sometimes 0.01 to 0.1) than Adam, which typically works well starting around 0.001 — this is a direct consequence of Adam’s internal adaptive per-parameter scaling, which already adjusts the effective step size somewhat automatically. Copying a learning rate value from a tutorial using a different optimizer than the one you’re actually using is a common, easy-to-make mistake that produces confusingly different training behavior than expected, even with an otherwise identical setup.

Summary

Learning rateSymptomFix
Too highLoss oscillates or diverges, possibly to NaNReduce by 10x, try again
Too lowLoss decreases extremely slowlyIncrease gradually, or use a learning rate finder
Well-chosenSmooth, steady loss decreaseContinue, consider a schedule for further refinement

No other single hyperparameter has as much immediate, visible impact on whether training succeeds at all — when a model isn’t training well and you’re not sure where to start debugging, the learning rate should be the very first thing you check.