Epochs, Batch Size, and Iterations: How Training Loops Are Actually Structured
“Train for 50 epochs with a batch size of 32” is a sentence that appears in nearly every deep learning tutorial, and it’s worth understanding precisely what each of those three terms means and how they relate — because getting the arithmetic wrong (assuming an “iteration” and an “epoch” are the same thing, for instance) is a common source of misconfigured training runs and confusing logs.
Epoch: One Complete Pass Through the Entire Training Dataset
An epoch is one full pass through the entire training dataset — every example has been seen by the model exactly once by the end of an epoch.
for epoch in range(num_epochs): for batch in training_data_batches: train_on_batch(model, batch) print(f"Completed epoch {epoch + 1}")Training for multiple epochs means the model sees the same data multiple times, refining its weights further each pass — a single epoch is rarely enough for a model to converge to good performance, which is why training runs typically span dozens or hundreds of epochs.
Batch Size: How Many Examples Are Processed Per Weight Update
Batch size is the number of training examples processed together before the model’s weights are updated once, covered in the context of gradient estimation in Gradient Descent.
batch_size = 32dataset_size = 10000
num_batches_per_epoch = dataset_size // batch_size # 312 batches per epoch (with 16 leftover examples)A larger batch size means fewer, more computationally efficient updates per epoch (better GPU utilization per update), but each update is based on averaging over more examples, which can mean less frequent opportunities for the optimizer to correct course within a single epoch.
Iterations: How Many Weight Updates Actually Happen
An iteration (sometimes called a “step”) is a single weight update — one full pass through a single batch, including forward propagation, loss computation, backpropagation, and the resulting weight update.
iterations_per_epoch = dataset_size // batch_size # same as num_batches_per_epochtotal_iterations = iterations_per_epoch * num_epochs
# Example: 10,000 examples, batch size 32, 50 epochs# iterations_per_epoch = 312# total_iterations = 312 * 50 = 15,600 total weight updates across the entire training runThis is the number that actually matters most for understanding “how much learning has happened” — two training runs with the same number of epochs but different batch sizes perform a very different number of actual weight updates, which is a common source of confusion when comparing training configurations.
The Relationship, Made Concrete
Dataset size: 10,000 examplesBatch size: 32─────────────────────────────────Iterations per epoch: 312 (10,000 / 32, rounded)Epochs: 50─────────────────────────────────Total iterations: 15,600 (312 * 50)def training_loop(dataset, batch_size, num_epochs, model): iterations_per_epoch = len(dataset) // batch_size total_iterations = 0
for epoch in range(num_epochs): for batch_idx in range(iterations_per_epoch): batch = get_batch(dataset, batch_idx, batch_size) train_step(model, batch) total_iterations += 1
print(f"Epoch {epoch+1}/{num_epochs} complete, " f"total iterations so far: {total_iterations}")Why This Distinction Matters in Practice
Learning rate schedules are often defined per iteration, not per epoch. A warmup schedule that says “increase the learning rate over the first 1,000 steps” means 1,000 iterations, not 1,000 epochs — misreading this distinction can produce a schedule that’s either far too short or far too long relative to what was actually intended, covered further in Learning Rate Scheduling.
Comparing training runs fairly requires matching total iterations, not just epochs. A run with batch size 32 for 50 epochs performs a very different number of weight updates than one with batch size 128 for 50 epochs (four times fewer updates per epoch) — comparing their final performance without accounting for this difference can lead to incorrect conclusions about which configuration is actually better.
Very large datasets make “epoch” a less useful unit. For datasets with hundreds of millions of examples (common in large language model pretraining, covered in Large Language Models), practitioners often report progress in total tokens or total iterations processed rather than epochs, since a single full epoch may not even be completed during the entire training run.
Choosing a Batch Size in Practice
| Consideration | Effect of larger batch size |
|---|---|
| GPU memory usage | Higher — more examples held in memory simultaneously |
| GPU compute efficiency per update | Higher — better parallelization |
| Gradient noise | Lower — averaged over more examples |
| Updates per epoch | Fewer |
| Generalization (empirically) | Sometimes slightly worse at very large batch sizes without adjustment |
The practical approach: choose the largest batch size your GPU memory comfortably allows, then adjust the learning rate accordingly (larger batches often benefit from a proportionally larger learning rate) rather than treating batch size and learning rate as independent choices.
The Last Batch of an Epoch: A Small but Real Detail
When a dataset’s size isn’t evenly divisible by the batch size, the final batch of each epoch will contain fewer examples than the rest — 10,000 examples with a batch size of 32 leaves 16 examples in a final, smaller batch (or it can be dropped entirely, depending on configuration). This matters slightly for batch normalization, covered in Batch Normalization, since a smaller final batch produces a noisier estimate of batch statistics than the regular-sized batches throughout the rest of the epoch.
# PyTorch's DataLoader lets you choose whether to drop this smaller final batchtrain_loader = DataLoader(dataset, batch_size=32, drop_last=True) # drops the incomplete batchThis is a minor detail in most cases, but worth knowing about specifically when debugging why a training run’s very last logged batch of an epoch shows unusually different loss or gradient statistics compared to the rest.
Summary
| Term | Definition |
|---|---|
| Epoch | One full pass through the entire training dataset |
| Batch size | Number of examples processed per weight update |
| Iteration | A single weight update (one batch processed) |
Getting these three terms straight is what makes training logs, learning rate schedules, and comparisons between different training configurations actually interpretable, rather than a source of quiet, easily-made mistakes.