Backpropagation Explained From Scratch: What loss.backward() Really Computes

There’s a specific moment that happens to almost everyone learning deep learning: you write loss.backward(), it runs successfully, your model trains, and you move on without ever really knowing what that one line of code just did. That’s a completely reasonable thing to do when you’re trying to ship something — but it leaves a gap that eventually catches up with you, usually at the worst possible time, when a gradient turns into NaN, a layer mysteriously stops learning, or an experiment behaves in a way you can’t explain.

Backpropagation is the specific algorithm hiding behind that single line of code. It’s the reason deep learning became computationally practical at all, and it’s genuinely learnable at a mechanical level — not a mysterious black box, but a specific, repeatable application of one calculus rule you probably learned in school and forgot about entirely: the chain rule.

The Problem Backpropagation Actually Solves

A network’s final loss depends on every single weight, in every single layer — but the relationship between an early layer’s weight and the final loss is indirect. It has to pass through every layer in between. Naively, you might think computing each weight’s individual contribution requires a separate calculation for every weight — for a network with millions of parameters, that would mean millions of expensive calculations just to take one training step. Backpropagation’s actual achievement is computing the gradient for every single weight in roughly the same amount of work as one extra forward pass, regardless of how many weights the network has. That efficiency is not a minor optimization — it’s the specific reason training networks with millions or billions of parameters is computationally feasible at all.

The Two-Pass Structure, Visualized

The forward pass, covered in Forward Propagation, computes a prediction and the resulting loss by moving data through the network left to right. The backward pass then runs in reverse: starting from the loss, it computes how much each layer’s output contributed to that loss, then uses that to compute how much each layer’s weights contributed, working backward one layer at a time until every single weight in the network has an associated gradient.

A Concrete, Fully Worked Example

Abstract diagrams only get you so far — the real understanding comes from tracing actual numbers through actual computations. Here’s a tiny two-layer network, worked by hand in code.

import numpy as np

# --- Forward pass ---
X = np.array([[1.0, 2.0]])
W1 = np.array([[0.5, -0.3], [0.2, 0.8]])
W2 = np.array([[0.6], [-0.4]])
target = np.array([[1.0]])

z1 = X @ W1                     # (1x2) @ (2x2) = (1x2)
a1 = np.maximum(0, z1)          # ReLU activation
z2 = a1 @ W2                    # (1x2) @ (2x1) = (1x1)
prediction = z2

loss = np.mean((prediction - target) ** 2)

print("z1:", z1)
print("a1 (after ReLU):", a1)
print("prediction:", prediction)
print("loss:", loss)

Running this gives z1 = [0.9, 1.3], both positive so ReLU leaves them unchanged, giving a1 = [0.9, 1.3]. Then prediction = 0.9*0.6 + 1.3*(-0.4) = 0.54 - 0.52 = 0.02. The target was 1.0, so the model is currently quite wrong, and the loss reflects that.

Now the backward pass, computing each gradient explicitly, one chain-rule step at a time:

# --- Backward pass ---
d_loss_d_pred = 2 * (prediction - target) / target.size    # dLoss/dPrediction
print("dLoss/dPrediction:", d_loss_d_pred)

d_pred_d_W2 = a1.T                                           # dPrediction/dW2
d_loss_d_W2 = d_pred_d_W2 @ d_loss_d_pred                    # chain rule
print("dLoss/dW2:", d_loss_d_W2)

d_pred_d_a1 = W2.T
d_loss_d_a1 = d_loss_d_pred @ d_pred_d_a1                    # dLoss/dActivation1
print("dLoss/dActivation1:", d_loss_d_a1)

relu_derivative = (z1 > 0).astype(float)                     # 1 where positive, 0 otherwise
d_loss_d_z1 = d_loss_d_a1 * relu_derivative
print("dLoss/dZ1:", d_loss_d_z1)

d_loss_d_W1 = X.T @ d_loss_d_z1                              # chain rule, all the way back
print("dLoss/dW1:", d_loss_d_W1)

Each of these lines is doing exactly one thing: taking a gradient that’s already been computed one step closer to the loss, and multiplying it by the local derivative of the next operation back — exactly the chain rule introduced in Calculus for Deep Learning. Trace it line by line: d_loss_d_pred measures how the loss changes as the prediction changes. d_loss_d_W2 reuses that value, multiplied by how the prediction changes as W2 changes. d_loss_d_a1 reuses d_loss_d_pred again, this time combined with how the prediction depends on the activation. Notice the reuse — d_loss_d_pred appears in two separate later computations, computed only once. That reuse of intermediate results, repeated at every layer, is precisely what makes backpropagation efficient instead of prohibitively expensive.

Why Not Just Compute Gradients by Perturbation?

It’s worth understanding the alternative you’re avoiding, because it makes the efficiency argument concrete rather than abstract. You could, in principle, estimate a gradient numerically: nudge one weight by a tiny amount, rerun the entire forward pass, see how much the loss changed, and divide.

def numerical_gradient_check(W, X, target, epsilon=1e-5):
    original_value = W[0, 0]

    W[0, 0] = original_value + epsilon
    loss_plus = compute_loss(W, X, target)

    W[0, 0] = original_value - epsilon
    loss_minus = compute_loss(W, X, target)

    W[0, 0] = original_value  # restore
    estimated_gradient = (loss_plus - loss_minus) / (2 * epsilon)
    return estimated_gradient

This works, and it’s actually a standard technique for sanity-checking a backpropagation implementation during debugging. But look at the cost: this single snippet computes the gradient for exactly one weight, and it required two full forward passes to do it. A network with ten million weights would need twenty million forward passes to get every gradient this way — completely impractical. Backpropagation gets you every single gradient, for every weight in the entire network, using roughly the computational cost of just one additional forward pass. That’s not a small efficiency gain; it’s the difference between “trainable” and “not trainable” for any network of realistic size.

Seeing the Computation Graph Directly

It helps to visualize the exact graph our worked example built, since “automatic differentiation” is really just “walk this specific graph backward.”

Every solid arrow is something that happened during the forward pass; every dotted arrow is a gradient computed during the backward pass, flowing in the opposite direction. W1 and W2 are the two nodes we actually care about — they’re the trainable parameters, and the backward pass exists specifically to compute how the loss changes with respect to each of them. Everything else in the graph (z1, a1, z2) is an intermediate value that exists only to make that final calculation possible, and PyTorch is tracking every one of these intermediate nodes silently, the entire time your forward pass runs, specifically so it can walk this exact structure backward the moment you call .backward().

What loss.backward() Is Actually Doing

Modern frameworks automate the exact manual process shown above through a system called automatic differentiation. As your model runs its forward pass, the framework silently builds a graph recording every operation performed. Calling .backward() walks that graph in reverse, applying the chain rule at each recorded operation automatically.

import torch

X = torch.tensor([[1.0, 2.0]], requires_grad=False)
W1 = torch.tensor([[0.5, -0.3], [0.2, 0.8]], requires_grad=True)
W2 = torch.tensor([[0.6], [-0.4]], requires_grad=True)
target = torch.tensor([[1.0]])

a1 = torch.relu(X @ W1)
prediction = a1 @ W2
loss = ((prediction - target) ** 2).mean()

loss.backward()

print("W1.grad:", W1.grad)
print("W2.grad:", W2.grad)

Run this alongside the manual NumPy version above, and the gradients match — because .backward() is executing precisely the same chain-rule computation, just automated and generalized to work for any computation graph, not one specific hand-derived example. This is genuinely worth verifying for yourself at least once: it turns “the framework does something magical” into “the framework does the exact arithmetic I just did by hand, faster and for arbitrarily complex graphs.”

Two Mistakes That Silently Produce Wrong Results

Forgetting to zero gradients between steps. PyTorch accumulates gradients into .grad by default across multiple .backward() calls, rather than overwriting them each time. Skip optimizer.zero_grad(), and the gradients from the previous batch quietly add into the current batch’s computation.

optimizer.zero_grad()   # clear gradients accumulated from any previous step
loss.backward()          # compute this step's gradients fresh
optimizer.step()         # apply them to update the weights

Forgetting this line doesn’t crash anything — the model appears to train, just with subtly, silently incorrect gradient magnitudes, which is exactly the kind of bug that’s hardest to catch because nothing looks obviously broken.

Treating a None or NaN gradient as mysterious rather than traceable. Once you understand that every gradient is the product of a specific chain of local derivatives, a None gradient usually means a variable was detached from the computation graph somewhere upstream (often via .detach() or a raw NumPy conversion mid-computation), and a NaN gradient usually traces back to a specific numerically unstable operation (a division by a near-zero value, or a log of a near-zero value) at some point in the chain. Both become debuggable, not mysterious, once you think in terms of “which link in this specific chain broke” — the same graph-walking mental model from the diagram above, just applied in reverse to hunt down the one specific broken node responsible.

The Direct Link to Two Famous Training Failures

The chain rule multiplies many local derivatives together as it moves backward through a deep network’s layers. If most of those individual derivatives are smaller than 1 in magnitude — a common situation with sigmoid or tanh activations, which flatten out at extreme input values — the running product shrinks toward zero as it travels through many layers. That’s the vanishing gradient problem, covered in full in Vanishing Gradient Problem. If instead those derivatives are consistently larger than 1, the product grows without bound instead, producing the exploding gradient problem, covered in Exploding Gradient Problem. Neither of these is a separate, mysterious phenomenon — both are direct, predictable mathematical consequences of the exact repeated multiplication shown in this article’s worked example, just repeated across many more layers than our toy two-layer network.

A Habit Worth Building

Whenever a training run behaves strangely — a loss that refuses to decrease, a specific layer whose weights never seem to move, a gradient that suddenly turns into NaN partway through training — resist the urge to treat it as a black-box mystery to be fixed by trial and error alone. Instead, mentally walk the computation graph the way the diagram above shows: what operation produced this value, what’s the local derivative of that operation, and does that derivative make sense given the actual numbers flowing through it at this point in training. This habit, built once and reused constantly, turns backpropagation from an intimidating piece of machinery into the most reliable diagnostic tool you have for understanding why a network is or isn’t learning.

Summary

Step	What actually happens
Forward pass	Computes the prediction and the resulting loss, layer by layer, left to right
Backward pass	Applies the chain rule layer by layer, in reverse, starting from the loss
Gradient reuse	Each layer’s gradient computation reuses results already computed closer to the loss
Automatic differentiation	What `.backward()` does — automates exactly the manual process shown in this article

Backpropagation is not a separate algorithm bolted onto neural networks as an afterthought — it’s the systematic, efficient application of the chain rule to a layered computation, made tractable by reusing intermediate results as it works its way backward. Once you’ve traced through the worked example above by hand at least once, loss.backward() stops being a black box and becomes exactly what it is: a fast, general version of arithmetic you can do yourself, on a graph you can now actually picture rather than only imagine.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.