Gradient Descent Explained: Batch, Stochastic, and Mini-Batch Variants
Every neural network you’ll ever train learns through the same core mechanism: compute how wrong the current predictions are, calculate the gradient of that error with respect to every weight, and nudge each weight slightly in the direction that reduces the error. This process, repeated thousands or millions of times, is gradient descent — and the specific variant you use (batch, stochastic, or mini-batch) has a real, practical effect on training speed, stability, and final model quality.
The Core Algorithm
Gradient descent updates each weight by subtracting a fraction (controlled by the learning rate) of the gradient of the loss with respect to that weight.
learning_rate = 0.01
for weight in model.parameters(): gradient = compute_gradient(loss, weight) weight = weight - learning_rate * gradientThe gradient, covered mathematically in Calculus for Deep Learning, always points in the direction of steepest increase — subtracting it moves the weight in the direction that decreases the loss, which is the entire mechanical basis of training.
Batch Gradient Descent: Using the Entire Dataset Per Update
Batch gradient descent computes the gradient using the entire training dataset before making a single weight update.
for epoch in range(num_epochs): predictions = model(X_train_full) # entire dataset at once loss = compute_loss(predictions, y_train_full) gradients = compute_gradients(loss, model.parameters()) update_weights(model.parameters(), gradients, learning_rate)This produces the most accurate possible gradient estimate at each step, but it’s often computationally infeasible for large datasets — computing a single update requires a full pass over potentially millions of examples, and it means the model only gets to update its weights once per epoch, which is extremely slow to converge in practice.
Stochastic Gradient Descent (SGD): One Example at a Time
Stochastic gradient descent updates the weights after computing the gradient from just a single training example.
for epoch in range(num_epochs): for x_i, y_i in zip(X_train, y_train): # one example at a time prediction = model(x_i) loss = compute_loss(prediction, y_i) gradient = compute_gradient(loss, model.parameters()) update_weights(model.parameters(), gradient, learning_rate)This makes far more frequent updates (one per example, rather than one per full dataset pass), but each individual gradient estimate is noisy, since it’s based on just a single data point rather than the true average error across the whole dataset. This noise isn’t purely a downside — it can actually help the optimizer escape shallow local minima, connecting back to the non-convex optimization landscape discussed in Optimization Basics.
Mini-Batch Gradient Descent: The Practical Default
Mini-batch gradient descent splits the difference — computing the gradient from a small batch of examples (commonly 32, 64, or 128) rather than either the full dataset or a single example.
batch_size = 64
for epoch in range(num_epochs): for batch_X, batch_y in get_batches(X_train, y_train, batch_size): predictions = model(batch_X) loss = compute_loss(predictions, batch_y) gradients = compute_gradients(loss, model.parameters()) update_weights(model.parameters(), gradients, learning_rate)This is what virtually every real deep learning training loop actually uses. It balances gradient accuracy (averaging over enough examples to reduce noise substantially) against update frequency (still making many updates per epoch, unlike full-batch gradient descent) — and critically, mini-batches map efficiently onto GPU parallel computation, processing dozens or hundreds of examples simultaneously rather than one at a time.
Comparing the Three Variants
| Variant | Gradient computed from | Update frequency | Gradient noise | GPU efficiency |
|---|---|---|---|---|
| Batch GD | Entire dataset | Once per epoch | Very low | Poor for large datasets |
| Stochastic GD | Single example | Once per example | High | Poor — underutilizes parallelism |
| Mini-batch GD | Small batch (32–512) | Once per batch | Moderate | Excellent |
Why Batch Size Is a Real Hyperparameter, Not Just a Memory Constraint
Batch size affects more than just how much GPU memory a training run consumes — it directly affects gradient noise, and through that, both training stability and final generalization quality. Very large batch sizes produce smoother, more accurate gradients but can converge to sharper minima that generalize slightly worse; smaller batches introduce useful noise that can act as a mild regularizer, at the cost of noisier, less stable individual updates. This interacts directly with the learning rate — a common practical rule of thumb is that increasing batch size often calls for a proportionally larger learning rate to maintain similar training dynamics, covered further in Learning Rate and Epochs, Batch Size, and Iterations.
What “Modern” Optimizers Add on Top
Plain gradient descent (in any of its three variants) uses a single, fixed learning rate for every weight and every step — modern optimizers like Adam and RMSProp, covered in Optimizers, build on top of mini-batch gradient descent by adaptively adjusting the effective learning rate per parameter, based on the recent history of that parameter’s gradients. These aren’t a replacement for the core gradient descent mechanism described here — they’re a refinement of exactly this process.
Shuffling Data Between Epochs
A detail that’s easy to overlook but genuinely matters: mini-batch gradient descent assumes each batch is a reasonably representative, independent sample of the overall dataset. If data isn’t shuffled between epochs, the model sees batches in the exact same fixed order every single epoch, which can introduce subtle biases — for instance, if the dataset happens to be sorted by class, early batches in every epoch would be dominated by one class, skewing gradient estimates during that portion of training in a way that doesn’t reflect the true data distribution.
train_loader = DataLoader(dataset, batch_size=64, shuffle=True) # reshuffled every epochThis is why shuffle=True is close to a universal default for training data loaders — it ensures each epoch’s sequence of mini-batches presents a genuinely different, representative sampling order, keeping gradient estimates less biased by any incidental ordering in how the raw dataset happens to be stored.
Summary
| Concept | Practical Takeaway |
|---|---|
| Batch GD | Most accurate gradient, rarely practical at scale |
| Stochastic GD | Noisiest gradient, most frequent updates |
| Mini-batch GD | The practical default — balances accuracy, speed, and GPU efficiency |
| Batch size | A real hyperparameter affecting both stability and generalization, not just memory usage |
Nearly every training loop you’ll write or use starts from mini-batch gradient descent as the foundation — everything else, from momentum to adaptive learning rates, is a refinement layered on top of this same core update rule.