Building Your First Neural Network: A Complete, Working PyTorch Example

A complete, working example of building, training, and evaluating your first neural network in PyTorch, tying together every prior concept.

Building Your First Neural Network: A Complete, Working PyTorch Example

Every concept covered so far in this series — linear algebra, gradients, activation functions, forward propagation, loss functions, backpropagation, weight initialization, and learning rate — comes together in one place: an actual, complete training loop. This guide builds one from scratch in PyTorch, explaining what every line does and why, so the pieces you’ve learned individually finally connect into something you can run yourself.


The Task: Classifying Handwritten Digits

We’ll build a classifier for MNIST, a dataset of handwritten digit images (0–9) — small enough to train quickly, complex enough to be genuinely representative of a real classification task.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# Data loading and preprocessing -- directly connects to Feature Engineering
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # normalize using known dataset statistics
])
train_dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

The Normalize step here is exactly the statistical normalization covered in Statistics for Deep Learning — using pre-computed mean and standard deviation values for this specific dataset.


Defining the Network Architecture

class DigitClassifier(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.layer1 = nn.Linear(28 * 28, 128)
self.layer2 = nn.Linear(128, 64)
self.layer3 = nn.Linear(64, 10) # 10 output classes
def forward(self, x):
x = self.flatten(x)
x = torch.relu(self.layer1(x)) # linear + ReLU activation
x = torch.relu(self.layer2(x))
x = self.layer3(x) # raw logits, no activation here
return x
model = DigitClassifier()

This forward() method is a direct, explicit instance of the forward propagation process covered in Forward Propagation — each layer is a linear transformation followed by an activation function, exactly as described there. PyTorch’s default nn.Linear initialization already applies He-style initialization internally, connecting to Weight Initialization.


Defining the Loss Function and Optimizer

criterion = nn.CrossEntropyLoss() # matches the multi-class classification task, per Loss Functions
optimizer = optim.Adam(model.parameters(), lr=0.001) # covered in Optimizers and Learning Rate

CrossEntropyLoss in PyTorch conveniently combines softmax and negative log-likelihood into one numerically stable operation, connecting directly to the numerical stability discussion in Numerical Computation — this is exactly why the model’s forward() method above returns raw logits rather than applying softmax itself.


The Training Loop

num_epochs = 5
for epoch in range(num_epochs):
model.train()
total_loss = 0
for batch_idx, (images, labels) in enumerate(train_loader):
optimizer.zero_grad() # clear gradients from the previous iteration
outputs = model(images) # forward propagation
loss = criterion(outputs, labels) # compute the loss
loss.backward() # backpropagation -- computes all gradients
optimizer.step() # apply the gradient update to every weight
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

This loop directly implements the epoch/iteration structure covered in Epochs, Batch Size, and Iterations — the outer loop is epochs, the inner loop is iterations over batches, and each iteration performs exactly the forward-pass-then-backward-pass sequence covered in Backpropagation.


Evaluating on the Test Set

model.eval()
correct = 0
total = 0
with torch.no_grad(): # disables gradient tracking -- not needed for evaluation
for images, labels in test_loader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

This directly applies the accuracy metric covered in Evaluation Metrics — evaluated specifically on the held-out test set, following the train/test separation discipline covered in Dataset Preparation, never on data the model trained on.


What to Watch For When You Run This Yourself

  • Training loss should decrease steadily across epochs. If it doesn’t, revisit Learning Rate — this is the most common first thing to check.
  • Compare training loss to test accuracy. A large gap between excellent training performance and poor test accuracy is the overfitting signature covered in Overfitting and Underfitting.
  • This simple architecture (a Multi-Layer Perceptron) typically reaches 95–98% accuracy on MNIST — a good baseline, though the convolutional architectures covered in Convolutional Neural Networks push meaningfully higher by exploiting the spatial structure of image data that a flattened MLP discards entirely.

Extending This Example Yourself

Once this baseline runs successfully, a few small, informative modifications are worth trying deliberately rather than moving straight to a more complex architecture: change the learning rate up and down by 10x and observe the effect described in Learning Rate, remove the normalization step and observe how much slower or less stable training becomes, or reduce the training set size drastically and watch the gap between training and test accuracy widen, directly reproducing the overfitting signature covered in Overfitting and Underfitting. Deliberately breaking a working example in controlled, specific ways like this is one of the fastest ways to build real intuition for concepts that otherwise stay abstract — seeing a learning rate that’s too high actually diverge, on your own machine, teaches the lesson far more durably than reading about it.

Trying each of these experiments in turn, and observing the concrete, visible effect on the training and test curves, is genuinely the fastest way to internalize the material covered across this entire module — far more effective than reading about any of these failure modes in isolation.

Summary

This complete example ties together data preprocessing, network architecture, loss functions, the training loop, and evaluation — every concept from this module, expressed as runnable code rather than isolated theory. The next module builds on this exact foundation, introducing deeper architectures and the specific challenges (vanishing gradients, the need for normalization and regularization) that come with training them.