Calculus for Deep Learning: Derivatives, Gradients, and the Chain Rule
If linear algebra describes what a neural network computes, calculus describes how it learns. Every weight update during training is the direct result of a derivative calculation — specifically, the chain rule applied repeatedly across every layer. You don’t need to solve calculus problems by hand to train a model, but understanding what a gradient actually represents transforms backpropagation from a mysterious framework internal into something you can reason about and debug.
Derivatives: The Rate of Change
A derivative measures how much a function’s output changes when its input changes slightly — the slope of the function at a given point. For a simple function like f(x) = x², the derivative f'(x) = 2x tells you the slope at any point x.
def f(x): return x ** 2
def f_derivative(x): return 2 * x
# At x = 3, the function is increasing at a rate of 6 units per unit of xprint(f_derivative(3)) # 6In deep learning, the function in question is almost always the loss function — and the derivative tells you how much the loss changes as you nudge a specific weight. That single number is what tells the optimizer which direction to move the weight.
Partial Derivatives: One Variable at a Time
A neural network’s loss depends on thousands or millions of weights simultaneously, not just one variable. A partial derivative measures how the loss changes with respect to one specific weight, holding all others constant.
# Loss as a function of two weights: L(w1, w2) = w1^2 + 3*w1*w2 + w2^2# Partial derivative with respect to w1: dL/dw1 = 2*w1 + 3*w2# Partial derivative with respect to w2: dL/dw2 = 3*w1 + 2*w2
def partial_w1(w1, w2): return 2 * w1 + 3 * w2
def partial_w2(w1, w2): return 3 * w1 + 2 * w2Computing every weight’s partial derivative separately is exactly what training a network requires — each weight gets its own signal telling it which direction, and by how much, it should move to reduce the loss.
Gradients: All Partial Derivatives, Bundled Together
A gradient is simply the vector of all partial derivatives of a function with respect to every one of its variables. For a network with a million weights, the gradient is a vector with a million entries — one number per weight, each telling that specific weight how to change.
import numpy as np
def gradient(w1, w2): return np.array([partial_w1(w1, w2), partial_w2(w1, w2)])
grad = gradient(1.0, 2.0) # array([8.0, 7.0])The gradient always points in the direction of steepest increase of the function. Since training wants to minimize loss, gradient descent moves weights in the opposite direction of the gradient — this single fact is the entire mechanical basis of how every neural network learns, covered in depth in Gradient Descent.
The Chain Rule: Why Backpropagation Works at All
A deep network is a composition of functions — the output of layer 1 feeds into layer 2, whose output feeds into layer 3, and so on. The chain rule is the calculus rule for differentiating a composition of functions, and it’s the entire mathematical justification for backpropagation.
If y = f(g(x)), then dy/dx = f'(g(x)) * g'(x)Applied to a network, this means: to know how a weight in an early layer affects the final loss, you multiply together the derivatives at every layer between that weight and the output.
# Simplified 2-layer chain rule exampledef layer1(x, w1): return x * w1
def layer2(h, w2): return h * w2
def loss(pred, target): return (pred - target) ** 2
# dLoss/dw1 = dLoss/dpred * dpred/dh * dh/dw1def gradient_w1(x, w1, w2, target): h = layer1(x, w1) pred = layer2(h, w2) dloss_dpred = 2 * (pred - target) dpred_dh = w2 dh_dw1 = x return dloss_dpred * dpred_dh * dh_dw1This chained multiplication, applied automatically and efficiently across every layer, is exactly what Backpropagation does — and it’s also the direct cause of the Vanishing Gradient Problem, where many small derivatives multiplied together shrink toward zero in very deep networks.
Why Frameworks Compute This Automatically
PyTorch’s autograd and TensorFlow’s GradientTape compute every one of these derivatives automatically via a technique called automatic differentiation — you never hand-derive the chain rule for a real network.
import torch
x = torch.tensor(2.0, requires_grad=True)w = torch.tensor(3.0, requires_grad=True)y = w * xy.backward()print(w.grad) # tensor(2.) -- dy/dw = x = 2Knowing what .backward() is actually computing — the chain rule, applied automatically across every operation in your network’s computation graph — is what turns a training script from “code I copied that works” into something you can actually debug when loss stops decreasing or gradients misbehave.
Second-Order Derivatives: A Brief, Useful Mention
Beyond the first derivatives used in standard gradient descent, some optimization methods make use of second-order derivatives (the derivative of the derivative), which describe the curvature of the loss landscape rather than just its slope. The Hessian matrix, containing all second-order partial derivatives, directly connects to the eigenvalue analysis of loss landscape shape covered in Eigenvalues and Eigenvectors — a large positive curvature in every direction indicates a sharp minimum, while near-zero curvature in some directions suggests a flatter region or saddle point. Computing the full Hessian is generally too expensive for large neural networks, which is exactly why nearly all practical deep learning optimizers rely on first-order gradient information alone, occasionally supplemented with cheaper approximations of curvature (as some advanced optimizers do), rather than computing second-order derivatives directly and exactly.
Summary
| Concept | Role |
|---|---|
| Derivative | How much output changes per unit change in input |
| Partial derivative | The derivative with respect to one specific weight |
| Gradient | The vector of all partial derivatives — the full “which direction to move” signal |
| Chain rule | How derivatives combine across layers, enabling backpropagation |
Calculus in deep learning isn’t abstract theory sitting alongside the code — it is the training loop. Every loss.backward() call is the chain rule executing automatically, and every weight update is a direct consequence of a gradient calculated exactly the way shown above.