Building Your First Neural Network: A Complete, Working PyTorch Example
Every concept covered so far in this series — linear algebra, gradients, activation functions, forward propagation, loss functions, backpropagation, weight initialization, and learning rate — comes together in one place: an actual, complete training loop. This guide builds one from scratch in PyTorch, explaining what every line does and why, so the pieces you’ve learned individually finally connect into something you can run yourself.
The Task: Classifying Handwritten Digits
We’ll build a classifier for MNIST, a dataset of handwritten digit images (0–9) — small enough to train quickly, complex enough to be genuinely representative of a real classification task.
import torchimport torch.nn as nnimport torch.optim as optimfrom torch.utils.data import DataLoaderfrom torchvision import datasets, transforms
# Data loading and preprocessing -- directly connects to Feature Engineeringtransform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) # normalize using known dataset statistics])
train_dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)test_dataset = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)The Normalize step here is exactly the statistical normalization covered in Statistics for Deep Learning — using pre-computed mean and standard deviation values for this specific dataset.
Defining the Network Architecture
class DigitClassifier(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.layer1 = nn.Linear(28 * 28, 128) self.layer2 = nn.Linear(128, 64) self.layer3 = nn.Linear(64, 10) # 10 output classes
def forward(self, x): x = self.flatten(x) x = torch.relu(self.layer1(x)) # linear + ReLU activation x = torch.relu(self.layer2(x)) x = self.layer3(x) # raw logits, no activation here return x
model = DigitClassifier()This forward() method is a direct, explicit instance of the forward propagation process covered in Forward Propagation — each layer is a linear transformation followed by an activation function, exactly as described there. PyTorch’s default nn.Linear initialization already applies He-style initialization internally, connecting to Weight Initialization.
Defining the Loss Function and Optimizer
criterion = nn.CrossEntropyLoss() # matches the multi-class classification task, per Loss Functionsoptimizer = optim.Adam(model.parameters(), lr=0.001) # covered in Optimizers and Learning RateCrossEntropyLoss in PyTorch conveniently combines softmax and negative log-likelihood into one numerically stable operation, connecting directly to the numerical stability discussion in Numerical Computation — this is exactly why the model’s forward() method above returns raw logits rather than applying softmax itself.
The Training Loop
num_epochs = 5
for epoch in range(num_epochs): model.train() total_loss = 0
for batch_idx, (images, labels) in enumerate(train_loader): optimizer.zero_grad() # clear gradients from the previous iteration outputs = model(images) # forward propagation loss = criterion(outputs, labels) # compute the loss loss.backward() # backpropagation -- computes all gradients optimizer.step() # apply the gradient update to every weight
total_loss += loss.item()
avg_loss = total_loss / len(train_loader) print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")This loop directly implements the epoch/iteration structure covered in Epochs, Batch Size, and Iterations — the outer loop is epochs, the inner loop is iterations over batches, and each iteration performs exactly the forward-pass-then-backward-pass sequence covered in Backpropagation.
Evaluating on the Test Set
model.eval()correct = 0total = 0
with torch.no_grad(): # disables gradient tracking -- not needed for evaluation for images, labels in test_loader: outputs = model(images) _, predicted = torch.max(outputs, 1) total += labels.size(0) correct += (predicted == labels).sum().item()
accuracy = 100 * correct / totalprint(f"Test Accuracy: {accuracy:.2f}%")This directly applies the accuracy metric covered in Evaluation Metrics — evaluated specifically on the held-out test set, following the train/test separation discipline covered in Dataset Preparation, never on data the model trained on.
What to Watch For When You Run This Yourself
- Training loss should decrease steadily across epochs. If it doesn’t, revisit Learning Rate — this is the most common first thing to check.
- Compare training loss to test accuracy. A large gap between excellent training performance and poor test accuracy is the overfitting signature covered in Overfitting and Underfitting.
- This simple architecture (a Multi-Layer Perceptron) typically reaches 95–98% accuracy on MNIST — a good baseline, though the convolutional architectures covered in Convolutional Neural Networks push meaningfully higher by exploiting the spatial structure of image data that a flattened MLP discards entirely.
Extending This Example Yourself
Once this baseline runs successfully, a few small, informative modifications are worth trying deliberately rather than moving straight to a more complex architecture: change the learning rate up and down by 10x and observe the effect described in Learning Rate, remove the normalization step and observe how much slower or less stable training becomes, or reduce the training set size drastically and watch the gap between training and test accuracy widen, directly reproducing the overfitting signature covered in Overfitting and Underfitting. Deliberately breaking a working example in controlled, specific ways like this is one of the fastest ways to build real intuition for concepts that otherwise stay abstract — seeing a learning rate that’s too high actually diverge, on your own machine, teaches the lesson far more durably than reading about it.
Trying each of these experiments in turn, and observing the concrete, visible effect on the training and test curves, is genuinely the fastest way to internalize the material covered across this entire module — far more effective than reading about any of these failure modes in isolation.
Summary
This complete example ties together data preprocessing, network architecture, loss functions, the training loop, and evaluation — every concept from this module, expressed as runnable code rather than isolated theory. The next module builds on this exact foundation, introducing deeper architectures and the specific challenges (vanishing gradients, the need for normalization and regularization) that come with training them.