Building Your First Neural Network: A Complete PyTorch Walkthrough, Line by Line

Every concept covered so far in this track — how a neuron combines inputs, why nonlinearity matters, how a prediction flows forward through layers, how a loss measures error, how backpropagation computes gradients, why weight initialization matters, and how the learning rate governs training — has been explained mostly in isolation, one idea at a time. This is the article where all of it finally connects into one runnable piece of code. Nothing here is new theory; everything here is the previous nine articles in this track, expressed as one actual, runnable program you can execute on your own machine and watch train in real time.

The Task: Recognizing Handwritten Digits

We’ll build a classifier for MNIST — a well-known dataset of 70,000 small grayscale images of handwritten digits, 0 through 9. It’s a genuinely good first project: small enough to train quickly on ordinary hardware without a powerful GPU, and complex enough to be a real classification task rather than a contrived toy example.

Step 1: Loading and Preparing the Data

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

print(f"Training examples: {len(train_dataset)}")
print(f"Test examples: {len(test_dataset)}")
image, label = train_dataset[0]
print(f"Single image shape: {image.shape}")   # torch.Size([1, 28, 28])
print(f"Its label: {label}")

Two details here directly connect to earlier concepts. shuffle=True on the training loader matters more than it might seem — presenting examples in the same fixed order every epoch can let a model pick up on spurious ordering patterns instead of genuine features, so shuffling is standard practice. And Normalize((0.1307,), (0.3081,)) is the statistical normalization technique covered in Statistics for Deep Learning — these two numbers are the pre-computed mean and standard deviation of MNIST’s pixel values, and normalizing input this way generally helps training converge faster and more reliably.

Step 2: Defining the Network’s Architecture

class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layer1 = nn.Linear(28 * 28, 128)
        self.layer2 = nn.Linear(128, 64)
        self.layer3 = nn.Linear(64, 10)   # 10 output scores, one per digit class

    def forward(self, x):
        x = self.flatten(x)                  # 28x28 image -> 784-length vector
        x = torch.relu(self.layer1(x))       # linear transform, then ReLU
        x = torch.relu(self.layer2(x))
        x = self.layer3(x)                    # raw scores (logits) -- no activation here
        return x

model = DigitClassifier()
print(model)

Every single line of forward() is a direct, literal instance of the forward propagation process covered in Forward Propagation: each layer performs a linear transformation, immediately followed by an activation function — exactly the pattern discussed there, and exactly why stacking these layers can represent nonlinear relationships rather than collapsing into one linear operation, as explained in Activation Functions. One deliberate choice worth noticing: layer3’s output has no activation function applied. That’s intentional, and the next section explains exactly why.

Also worth knowing: PyTorch’s default initialization for nn.Linear already applies a sensible, modern initialization scheme internally — directly connecting to Weight Initialization — so you get reasonable starting weights without doing anything extra yourself.

Step 3: Choosing a Loss Function and an Optimizer

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

This is exactly why layer3 above had no activation function: nn.CrossEntropyLoss in PyTorch internally combines a softmax operation with negative log-likelihood loss into one single, numerically stable computation. Applying softmax yourself before passing values into CrossEntropyLoss would apply it twice, silently producing incorrect gradients — a genuinely common mistake for people building their first classifier. Feeding it raw, unactivated scores (logits) directly is the correct usage.

optim.Adam is one of several gradient descent variants available; it adapts its effective step size per parameter automatically, which generally makes it a more forgiving default starting point than plain stochastic gradient descent, especially while you’re still getting a feel for how learning rate affects training.

Step 4: The Training Loop, Explained Line by Line

num_epochs = 5

for epoch in range(num_epochs):
    model.train()          # tells layers like BatchNorm/Dropout they're in training mode
    total_loss = 0

    for batch_idx, (images, labels) in enumerate(train_loader):
        optimizer.zero_grad()                # clear gradients from the previous step
        outputs = model(images)               # forward pass
        loss = criterion(outputs, labels)     # compute the loss
        loss.backward()                        # backward pass -- computes every gradient
        optimizer.step()                       # apply the gradient update to every weight

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

This exact five-line sequence — zero gradients, forward, compute loss, backward, step — is the heartbeat of virtually every PyTorch training loop you’ll ever write, regardless of the model’s size or complexity. The outer loop over num_epochs and the inner loop over train_loader together implement precisely the epoch/iteration structure covered in Epochs, Batch Size, and Iterations, and the loss.backward() line is doing exactly the chain-rule computation walked through step by step in Backpropagation — here, automated across a genuinely deep computation graph rather than the tiny hand-derived example used to explain the concept.

Step 5: Evaluating on Data the Model Has Never Seen

model.eval()          # switches to inference mode
correct = 0
total = 0

with torch.no_grad():          # disables gradient tracking -- not needed for evaluation
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

Two details here matter beyond the obvious accuracy calculation. model.eval() matters even in a network this simple, and it becomes critical the moment you add layers like batch normalization or dropout that behave differently during training and inference. torch.no_grad() tells PyTorch not to bother building a computation graph for these operations at all, since we have no intention of calling .backward() during evaluation — this saves both memory and time, and it’s good practice to include it any time you’re only running a forward pass. This evaluation directly applies the accuracy metric discussed in Evaluation Metrics, computed specifically on data the model never trained on, following the strict train/test separation covered in Dataset Preparation — evaluating on training data would tell you almost nothing about how the model performs on genuinely new inputs.

What Realistic Output Looks Like

Running this exact code typically produces output resembling:

Epoch 1/5, Average Loss: 0.3842
Epoch 2/5, Average Loss: 0.1689
Epoch 3/5, Average Loss: 0.1213
Epoch 4/5, Average Loss: 0.0961
Epoch 5/5, Average Loss: 0.0798
Test Accuracy: 97.31%

The loss dropping steadily across epochs, without spiking or stalling, is the visible sign of healthy training — the same signal the “what to watch for” section below builds on. A test accuracy in the mid-to-high 90s for this exact simple architecture on MNIST is a completely normal, expected result — not a sign you’ve done anything special, just solid confirmation that every piece described above is correctly wired together and functioning as intended.

What to Watch For When You Run This Yourself

Training loss should decrease steadily, epoch over epoch. If it doesn’t move at all, or moves erratically, revisit Learning Rate first — an incorrectly set learning rate is the most common root cause of a training loop that simply refuses to learn.
Compare training loss to test accuracy. A large gap — excellent training performance alongside disappointing test accuracy — is the classic overfitting signature covered in Overfitting and Underfitting.
This architecture, a plain Multi-Layer Perceptron, typically reaches roughly 95–98% accuracy on MNIST. That’s a solid baseline, though the convolutional architectures covered in Convolutional Neural Networks push meaningfully higher, specifically because they exploit the two-dimensional spatial structure of image data that flattening into a single 784-length vector, as this example does, throws away entirely.

Saving Your Trained Model So the Work Isn’t Lost

One practical step worth adding before moving on: nothing above saves the trained weights anywhere, which means closing your Python session throws away everything the training loop just accomplished.

# Save just the learned weights (the recommended approach)
torch.save(model.state_dict(), "digit_classifier.pth")

# Later, in a new session: recreate the architecture, then load the weights into it
loaded_model = DigitClassifier()
loaded_model.load_state_dict(torch.load("digit_classifier.pth"))
loaded_model.eval()   # remember this before running inference

Saving state_dict() — a dictionary of the model’s learned parameter values — rather than the entire model object is the standard, recommended practice in the PyTorch community, because it doesn’t tie the saved file to the exact class definition or file structure you happened to have at save time. Whenever you reload a saved model specifically to make predictions rather than continue training, call model.eval() immediately afterward — a habit worth building now, since it becomes essential the moment your architecture includes layers like batch normalization that behave differently in training versus inference mode, a distinction covered in more depth elsewhere in this series.

Deliberately Breaking Things: The Fastest Way to Actually Learn This

Once this baseline trains successfully, a handful of small, deliberate modifications teach far more than reading about them ever could:

Push the learning rate to an extreme. Set lr=1.0 instead of 0.001 and watch the loss diverge or oscillate wildly instead of decreasing — a direct, visible demonstration of the instability described in Learning Rate, happening on your own screen instead of in a textbook diagram.

Remove the normalization step. Delete the Normalize transform and observe how much slower and less stable training becomes — concrete, visible proof that the statistical normalization covered earlier in this series isn’t a formality, it has a measurable effect on real training dynamics.

Shrink the training set artificially. Take only the first 500 training examples instead of the full 60,000, and watch the gap between training and test accuracy widen dramatically — directly reproducing the overfitting signature from Overfitting and Underfitting, and giving you a live, tunable example to experiment with rather than an abstract description.

Each of these experiments takes only a couple of lines of changed code and a few minutes of retraining, and each one converts an abstract warning (“a too-high learning rate causes instability”) into something you’ve personally watched happen on your own screen, with your own trained model, rather than something you simply read about once and moved past.

Summary

This complete example ties together data preprocessing, network architecture, loss functions, the training loop, and evaluation — every individual concept covered across this entire module, expressed as one piece of runnable code rather than isolated theory scattered across separate articles. The next module builds directly on this exact foundation, introducing deeper architectures and the specific new challenges — vanishing gradients, and the real need for normalization and regularization at greater depth — that come with training networks meaningfully larger than the three-layer example built here — but the training loop itself, the five-line heartbeat covered above, stays essentially unchanged no matter how much larger or more sophisticated the architecture eventually becomes.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.