Numerical Computation in Deep Learning: Precision, Overflow, and Stability

A model that trains perfectly for 50 epochs and then suddenly produces NaN loss out of nowhere is one of the most confusing failures in deep learning — and it’s almost never a bug in your model’s logic. It’s a numerical computation problem: floating-point precision limits, overflow, or underflow, quietly accumulating until they cross a threshold and break everything at once. Understanding these limits is what separates “restart training and hope” from actually fixing the root cause.

Floating-Point Precision: Computers Don’t Represent Real Numbers Exactly

Computers store real numbers using a fixed number of bits — commonly 32-bit (single precision) or 16-bit (half precision) in deep learning. This means most real numbers are actually stored as the closest representable approximation, not their exact value.

import numpy as np

a = np.float32(0.1) + np.float32(0.2)
print(a)                    # 0.3
print(a == np.float32(0.3)) # False! -- tiny representation error

This tiny imprecision is usually harmless in isolation, but deep learning involves millions of these approximate operations chained together during training — small errors can accumulate, especially across many layers or many training steps.

Overflow: When Numbers Get Too Large to Represent

Overflow happens when a computed value exceeds the maximum number a given precision can represent, and the result becomes inf (infinity) instead of the actual (larger) number.

x = np.float32(1e38)
result = x * 10
print(result)   # inf -- overflowed the maximum representable float32 value

A classic real-world trigger: computing exp() of a large number directly, which is exactly what a naive softmax implementation does.

def naive_softmax(scores):
    exp_scores = np.exp(scores)              # exp(1000) overflows to inf
    return exp_scores / np.sum(exp_scores)

print(naive_softmax(np.array([1000, 1, 2])))  # produces nan, not the intended distribution

Underflow: When Numbers Get Too Small to Represent

The opposite problem — a value so close to zero that it rounds down to exactly zero, losing all information.

x = np.float32(1e-40)
print(x)   # 0.0 -- underflowed to exactly zero

This is a common cause of the vanishing gradient problem manifesting numerically, not just conceptually — a gradient that’s mathematically very small (but not truly zero) can underflow to exactly zero in floating-point representation, at which point that weight receives literally no update at all, regardless of how many more training steps run.

The Fix: Numerically Stable Softmax

The standard fix for softmax overflow is subtracting the maximum value from every score before exponentiating — mathematically equivalent to the original formula, but numerically safe, since it guarantees the largest exponent computed is exp(0) = 1 rather than exp(large_number).

def stable_softmax(scores):
    shifted_scores = scores - np.max(scores)   # largest value becomes 0
    exp_scores = np.exp(shifted_scores)        # no overflow possible
    return exp_scores / np.sum(exp_scores)

print(stable_softmax(np.array([1000, 1, 2])))  # correctly produces a valid distribution

Every production deep learning framework’s softmax implementation does exactly this internally — it’s a genuinely essential fix, not an optional optimization, which is why hand-implementing softmax naively is a common early mistake worth knowing to avoid.

Log-Sum-Exp: The General-Purpose Version of This Trick

The same shift-before-exponentiate trick generalizes to any computation involving sums of exponentials, commonly needed when computing cross-entropy loss combined with softmax in one numerically stable step.

def log_sum_exp(x):
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))

This is why frameworks provide combined functions like PyTorch’s F.cross_entropy (which internally fuses log-softmax and negative log-likelihood into one numerically stable operation) rather than expecting you to chain a separate softmax and log call yourself — the combined version avoids an unstable intermediate computation that the separated version wouldn’t.

Mixed Precision Training: A Practical Tradeoff

Modern training increasingly uses 16-bit floats for most computations (faster, less memory) while keeping certain sensitive operations — like the accumulation of gradients — in 32-bit precision, specifically to avoid the underflow/overflow risks of doing everything in lower precision.

import torch

scaler = torch.cuda.amp.GradScaler()   # helps prevent underflow in 16-bit gradients

with torch.cuda.amp.autocast():
    output = model(input_data)
    loss = criterion(output, target)

scaler.scale(loss).backward()          # scales the loss up before backward pass
scaler.step(optimizer)                  # unscales before the actual weight update
scaler.update()

The GradScaler here exists purely to combat underflow — 16-bit precision has a much smaller representable range, so gradients that would be fine in 32-bit can silently underflow to zero in 16-bit without this compensating scale factor.

Debugging NaN Losses Systematically

When a NaN loss appears, a systematic checklist is more productive than guessing: first, check the input data itself for missing or infinite values that might have slipped through preprocessing; second, check the learning rate, since an excessively high one is a common independent cause covered in Learning Rate; third, add gradient norm logging (covered in Exploding Gradient Problem) to see whether gradients were growing abnormally before the failure; fourth, check any custom loss function code for an unguarded division or logarithm that could produce inf or NaN directly. Working through this checklist in order, rather than randomly changing hyperparameters and rerunning, turns a frustrating, opaque failure into a specific, traceable root cause in most cases.

Summary

Problem	Cause	Common Fix
Overflow	Values exceed max representable range	Numerically stable formulations (shift-before-exp)
Underflow	Values too small, round to exactly zero	Mixed-precision scaling, careful loss formulation
Precision loss	Approximate representation of real numbers	Accumulate sensitive sums in higher precision
NaN losses	Usually overflow/underflow propagating through the network	Gradient clipping, stable loss functions, lower learning rate

A NaN loss appearing out of nowhere is almost always traceable to one of these three issues — understanding them turns an intimidating, seemingly random failure into a specific, diagnosable, and fixable problem.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.