Epochs, Batch Size, and Iterations: How Training Loops Are Actually Structured

“Train for 50 epochs with a batch size of 32” is a sentence that appears in nearly every deep learning tutorial, and it’s worth understanding precisely what each of those three terms means and how they relate — because getting the arithmetic wrong (assuming an “iteration” and an “epoch” are the same thing, for instance) is a common source of misconfigured training runs and confusing logs.

Epoch: One Complete Pass Through the Entire Training Dataset

An epoch is one full pass through the entire training dataset — every example has been seen by the model exactly once by the end of an epoch.

for epoch in range(num_epochs):
    for batch in training_data_batches:
        train_on_batch(model, batch)
    print(f"Completed epoch {epoch + 1}")

Training for multiple epochs means the model sees the same data multiple times, refining its weights further each pass — a single epoch is rarely enough for a model to converge to good performance, which is why training runs typically span dozens or hundreds of epochs.

Batch Size: How Many Examples Are Processed Per Weight Update

Batch size is the number of training examples processed together before the model’s weights are updated once, covered in the context of gradient estimation in Gradient Descent.

batch_size = 32
dataset_size = 10000

num_batches_per_epoch = dataset_size // batch_size   # 312 batches per epoch (with 16 leftover examples)

A larger batch size means fewer, more computationally efficient updates per epoch (better GPU utilization per update), but each update is based on averaging over more examples, which can mean less frequent opportunities for the optimizer to correct course within a single epoch.

Iterations: How Many Weight Updates Actually Happen

An iteration (sometimes called a “step”) is a single weight update — one full pass through a single batch, including forward propagation, loss computation, backpropagation, and the resulting weight update.

iterations_per_epoch = dataset_size // batch_size   # same as num_batches_per_epoch
total_iterations = iterations_per_epoch * num_epochs

# Example: 10,000 examples, batch size 32, 50 epochs
# iterations_per_epoch = 312
# total_iterations = 312 * 50 = 15,600 total weight updates across the entire training run

This is the number that actually matters most for understanding “how much learning has happened” — two training runs with the same number of epochs but different batch sizes perform a very different number of actual weight updates, which is a common source of confusion when comparing training configurations.

The Relationship, Made Concrete

Dataset size:        10,000 examples
Batch size:                32
─────────────────────────────────
Iterations per epoch:     312   (10,000 / 32, rounded)
Epochs:                    50
─────────────────────────────────
Total iterations:      15,600   (312 * 50)

def training_loop(dataset, batch_size, num_epochs, model):
    iterations_per_epoch = len(dataset) // batch_size
    total_iterations = 0

    for epoch in range(num_epochs):
        for batch_idx in range(iterations_per_epoch):
            batch = get_batch(dataset, batch_idx, batch_size)
            train_step(model, batch)
            total_iterations += 1

        print(f"Epoch {epoch+1}/{num_epochs} complete, "
              f"total iterations so far: {total_iterations}")

Why This Distinction Matters in Practice

Learning rate schedules are often defined per iteration, not per epoch. A warmup schedule that says “increase the learning rate over the first 1,000 steps” means 1,000 iterations, not 1,000 epochs — misreading this distinction can produce a schedule that’s either far too short or far too long relative to what was actually intended, covered further in Learning Rate Scheduling.

Comparing training runs fairly requires matching total iterations, not just epochs. A run with batch size 32 for 50 epochs performs a very different number of weight updates than one with batch size 128 for 50 epochs (four times fewer updates per epoch) — comparing their final performance without accounting for this difference can lead to incorrect conclusions about which configuration is actually better.

Very large datasets make “epoch” a less useful unit. For datasets with hundreds of millions of examples (common in large language model pretraining, covered in Large Language Models), practitioners often report progress in total tokens or total iterations processed rather than epochs, since a single full epoch may not even be completed during the entire training run.

Choosing a Batch Size in Practice

Consideration	Effect of larger batch size
GPU memory usage	Higher — more examples held in memory simultaneously
GPU compute efficiency per update	Higher — better parallelization
Gradient noise	Lower — averaged over more examples
Updates per epoch	Fewer
Generalization (empirically)	Sometimes slightly worse at very large batch sizes without adjustment

The practical approach: choose the largest batch size your GPU memory comfortably allows, then adjust the learning rate accordingly (larger batches often benefit from a proportionally larger learning rate) rather than treating batch size and learning rate as independent choices.

The Last Batch of an Epoch: A Small but Real Detail

When a dataset’s size isn’t evenly divisible by the batch size, the final batch of each epoch will contain fewer examples than the rest — 10,000 examples with a batch size of 32 leaves 16 examples in a final, smaller batch (or it can be dropped entirely, depending on configuration). This matters slightly for batch normalization, covered in Batch Normalization, since a smaller final batch produces a noisier estimate of batch statistics than the regular-sized batches throughout the rest of the epoch.

# PyTorch's DataLoader lets you choose whether to drop this smaller final batch
train_loader = DataLoader(dataset, batch_size=32, drop_last=True)   # drops the incomplete batch

This is a minor detail in most cases, but worth knowing about specifically when debugging why a training run’s very last logged batch of an epoch shows unusually different loss or gradient statistics compared to the rest.

Summary

Term	Definition
Epoch	One full pass through the entire training dataset
Batch size	Number of examples processed per weight update
Iteration	A single weight update (one batch processed)

Getting these three terms straight is what makes training logs, learning rate schedules, and comparisons between different training configurations actually interpretable, rather than a source of quiet, easily-made mistakes.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.