Numerical Computation in Deep Learning: Precision, Overflow, and Stability
A model that trains perfectly for 50 epochs and then suddenly produces NaN loss out of nowhere is one of the most confusing failures in deep learning — and it’s almost never a bug in your model’s logic. It’s a numerical computation problem: floating-point precision limits, overflow, or underflow, quietly accumulating until they cross a threshold and break everything at once. Understanding these limits is what separates “restart training and hope” from actually fixing the root cause.
Floating-Point Precision: Computers Don’t Represent Real Numbers Exactly
Computers store real numbers using a fixed number of bits — commonly 32-bit (single precision) or 16-bit (half precision) in deep learning. This means most real numbers are actually stored as the closest representable approximation, not their exact value.
import numpy as np
a = np.float32(0.1) + np.float32(0.2)print(a) # 0.3print(a == np.float32(0.3)) # False! -- tiny representation errorThis tiny imprecision is usually harmless in isolation, but deep learning involves millions of these approximate operations chained together during training — small errors can accumulate, especially across many layers or many training steps.
Overflow: When Numbers Get Too Large to Represent
Overflow happens when a computed value exceeds the maximum number a given precision can represent, and the result becomes inf (infinity) instead of the actual (larger) number.
x = np.float32(1e38)result = x * 10print(result) # inf -- overflowed the maximum representable float32 valueA classic real-world trigger: computing exp() of a large number directly, which is exactly what a naive softmax implementation does.
def naive_softmax(scores): exp_scores = np.exp(scores) # exp(1000) overflows to inf return exp_scores / np.sum(exp_scores)
print(naive_softmax(np.array([1000, 1, 2]))) # produces nan, not the intended distributionUnderflow: When Numbers Get Too Small to Represent
The opposite problem — a value so close to zero that it rounds down to exactly zero, losing all information.
x = np.float32(1e-40)print(x) # 0.0 -- underflowed to exactly zeroThis is a common cause of the vanishing gradient problem manifesting numerically, not just conceptually — a gradient that’s mathematically very small (but not truly zero) can underflow to exactly zero in floating-point representation, at which point that weight receives literally no update at all, regardless of how many more training steps run.
The Fix: Numerically Stable Softmax
The standard fix for softmax overflow is subtracting the maximum value from every score before exponentiating — mathematically equivalent to the original formula, but numerically safe, since it guarantees the largest exponent computed is exp(0) = 1 rather than exp(large_number).
def stable_softmax(scores): shifted_scores = scores - np.max(scores) # largest value becomes 0 exp_scores = np.exp(shifted_scores) # no overflow possible return exp_scores / np.sum(exp_scores)
print(stable_softmax(np.array([1000, 1, 2]))) # correctly produces a valid distributionEvery production deep learning framework’s softmax implementation does exactly this internally — it’s a genuinely essential fix, not an optional optimization, which is why hand-implementing softmax naively is a common early mistake worth knowing to avoid.
Log-Sum-Exp: The General-Purpose Version of This Trick
The same shift-before-exponentiate trick generalizes to any computation involving sums of exponentials, commonly needed when computing cross-entropy loss combined with softmax in one numerically stable step.
def log_sum_exp(x): max_x = np.max(x) return max_x + np.log(np.sum(np.exp(x - max_x)))This is why frameworks provide combined functions like PyTorch’s F.cross_entropy (which internally fuses log-softmax and negative log-likelihood into one numerically stable operation) rather than expecting you to chain a separate softmax and log call yourself — the combined version avoids an unstable intermediate computation that the separated version wouldn’t.
Mixed Precision Training: A Practical Tradeoff
Modern training increasingly uses 16-bit floats for most computations (faster, less memory) while keeping certain sensitive operations — like the accumulation of gradients — in 32-bit precision, specifically to avoid the underflow/overflow risks of doing everything in lower precision.
import torch
scaler = torch.cuda.amp.GradScaler() # helps prevent underflow in 16-bit gradients
with torch.cuda.amp.autocast(): output = model(input_data) loss = criterion(output, target)
scaler.scale(loss).backward() # scales the loss up before backward passscaler.step(optimizer) # unscales before the actual weight updatescaler.update()The GradScaler here exists purely to combat underflow — 16-bit precision has a much smaller representable range, so gradients that would be fine in 32-bit can silently underflow to zero in 16-bit without this compensating scale factor.
Debugging NaN Losses Systematically
When a NaN loss appears, a systematic checklist is more productive than guessing: first, check the input data itself for missing or infinite values that might have slipped through preprocessing; second, check the learning rate, since an excessively high one is a common independent cause covered in Learning Rate; third, add gradient norm logging (covered in Exploding Gradient Problem) to see whether gradients were growing abnormally before the failure; fourth, check any custom loss function code for an unguarded division or logarithm that could produce inf or NaN directly. Working through this checklist in order, rather than randomly changing hyperparameters and rerunning, turns a frustrating, opaque failure into a specific, traceable root cause in most cases.
Summary
| Problem | Cause | Common Fix |
|---|---|---|
| Overflow | Values exceed max representable range | Numerically stable formulations (shift-before-exp) |
| Underflow | Values too small, round to exactly zero | Mixed-precision scaling, careful loss formulation |
| Precision loss | Approximate representation of real numbers | Accumulate sensitive sums in higher precision |
| NaN losses | Usually overflow/underflow propagating through the network | Gradient clipping, stable loss functions, lower learning rate |
A NaN loss appearing out of nowhere is almost always traceable to one of these three issues — understanding them turns an intimidating, seemingly random failure into a specific, diagnosable, and fixable problem.