Backpropagation Explained Step by Step: How Neural Networks Actually Learn
Backpropagation is the algorithm that made deep learning practical — without an efficient way to compute how every single weight in a large network contributed to the final error, training anything beyond a shallow network would be computationally hopeless. It’s also one of the most misunderstood concepts in the field, often treated as a black box even by people who use it every day through loss.backward(). This guide walks through exactly what it computes and why.
The Problem Backpropagation Solves
A network’s final loss depends on every weight in every layer, but the relationship between an early-layer weight and the final loss is indirect — it passes through every layer in between. Backpropagation efficiently computes the gradient of the loss with respect to every weight, using the chain rule covered in Calculus for Deep Learning, without needing to recompute the entire forward pass separately for each individual weight.
The Two-Pass Structure
Forward pass: Input → Layer 1 → Layer 2 → Layer 3 → LossBackward pass: Input ← Layer 1 ← Layer 2 ← Layer 3 ← LossThe forward pass, covered in Forward Propagation, computes the prediction and the resulting loss. The backward pass then works in reverse, starting from the loss and propagating gradient information backward, layer by layer, until every weight has a computed gradient.
A Concrete Walkthrough: A Two-Layer Network
import numpy as np
# Forward passX = np.array([[1.0, 2.0]])W1 = np.array([[0.5, -0.3], [0.2, 0.8]])W2 = np.array([[0.6], [-0.4]])target = np.array([[1.0]])
z1 = X @ W1a1 = np.maximum(0, z1) # ReLUz2 = a1 @ W2prediction = z2
loss = np.mean((prediction - target) ** 2)
# Backward pass -- computing gradients using the chain ruled_loss_d_pred = 2 * (prediction - target) / target.size # dLoss/dPredictiond_pred_d_W2 = a1.T # dPrediction/dW2d_loss_d_W2 = d_pred_d_W2 @ d_loss_d_pred # chain rule: dLoss/dW2
d_pred_d_a1 = W2.Td_loss_d_a1 = d_loss_d_pred @ d_pred_d_a1 # dLoss/dActivation1
relu_derivative = (z1 > 0).astype(float)d_loss_d_z1 = d_loss_d_a1 * relu_derivative # dLoss/dZ1
d_loss_d_W1 = X.T @ d_loss_d_z1 # chain rule: dLoss/dW1
print("Gradient for W1:", d_loss_d_W1)print("Gradient for W2:", d_loss_d_W2)Each line here is a direct application of the chain rule: to find how much W1 (an earlier layer’s weight) affected the final loss, you multiply together the local derivative at every intermediate step between W1 and the loss — exactly the chained multiplication introduced conceptually in Calculus for Deep Learning.
Why This Is Efficient: Reusing Intermediate Computations
The key insight that makes backpropagation practical rather than prohibitively expensive: computing the gradient for W1 reuses d_loss_d_a1, which was already computed on the way to finding W2’s gradient. Without this reuse, computing each weight’s gradient independently (by slightly perturbing it and re-running the entire forward pass) would require a separate forward pass per weight — computationally infeasible for a network with millions of parameters. Backpropagation computes every gradient in roughly the same time as a single additional forward pass, regardless of how many weights the network has.
Automatic Differentiation: Why You Rarely Hand-Derive This
Modern frameworks track every operation performed during the forward pass and automatically construct the correct backward computation, so you never write the manual chain-rule code shown above for a real model.
import torch
X = torch.tensor([[1.0, 2.0]], requires_grad=False)W1 = torch.tensor([[0.5, -0.3], [0.2, 0.8]], requires_grad=True)W2 = torch.tensor([[0.6], [-0.4]], requires_grad=True)target = torch.tensor([[1.0]])
a1 = torch.relu(X @ W1)prediction = a1 @ W2loss = ((prediction - target) ** 2).mean()
loss.backward() # computes every gradient automaticallyprint(W1.grad)print(W2.grad)loss.backward() is executing precisely the manual process shown in the earlier code block — the framework has simply automated tracking and computing it, freeing you to focus on architecture rather than gradient derivation.
Why Backpropagation Sometimes Fails: A Preview
The chain rule multiplies many derivatives together across a deep network’s layers — if most of those derivatives are consistently smaller than 1 (common with sigmoid/tanh activations), the product shrinks toward zero as it propagates through many layers, causing the Vanishing Gradient Problem. If they’re consistently larger than 1, the product grows explosively, causing the Exploding Gradient Problem. Both of these well-known training failures are direct, mathematical consequences of exactly the chained multiplication shown in this guide’s walkthrough — not separate, unrelated phenomena.
Common Mistakes When Reasoning About Backpropagation
Assuming gradients are computed independently per weight. In reality, gradients for earlier-layer weights depend directly on gradients already computed for later layers — this dependency chain is exactly why backpropagation must proceed strictly backward through the network, layer by layer, rather than computing every weight’s gradient in an arbitrary order.
Forgetting to zero gradients between training steps. PyTorch accumulates gradients by default across multiple .backward() calls rather than overwriting them — optimizer.zero_grad() must be called before each new backward pass, or gradients from previous batches will incorrectly add into the current step’s computation, a subtle bug that produces plausible-looking but incorrect training behavior.
optimizer.zero_grad() # clear old gradients firstloss.backward() # compute fresh gradientsoptimizer.step() # apply themTreating .backward() as a black box. Understanding that it’s executing the exact chain-rule computation shown manually earlier in this guide is what makes debugging a None gradient, an unexpected NaN, or a frozen (non-updating) layer tractable, rather than mysterious.
Summary
| Step | What Happens |
|---|---|
| Forward pass | Computes the prediction and resulting loss |
| Backward pass | Applies the chain rule layer by layer, in reverse |
| Gradient reuse | Makes computing all gradients roughly as cheap as one extra forward pass |
| Automatic differentiation | What .backward() does — automates the manual process shown here |
Backpropagation isn’t a separate algorithm bolted onto neural networks — it’s the direct, systematic application of the chain rule to a layered computation, made efficient by reusing intermediate results as it works backward through the network.