Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are the dominant architecture for image processing tasks. By exploiting spatial locality and weight sharing, they learn visual hierarchies — from edges and textures in early layers to objects and scenes in later layers.

The Convolution Operation

A convolution slides a small filter (kernel) across the input image, computing dot products at each position:

Input (5×5):          Filter (3×3):         Output (3×3):
1 0 1 0 1             1 0 1                 4 3 4
0 1 0 1 0      ⊗      0 1 0         →       3 4 3
1 0 1 0 1             1 0 1                 4 3 4
0 1 0 1 0
1 0 1 0 1

At each position: sum(element-wise multiplication of filter with patch)

The filter weights are learned during training — the network learns what patterns to detect.

Key Components

Convolution Layer

import torch.nn as nn

# 32 filters of size 3×3, taking a 1-channel (grayscale) input
conv = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)
# With padding=1, spatial dimensions are preserved: (H, W) → (H, W)
# Without padding: (H, W) → (H-2, W-2) for 3×3 filter

Pooling Layer

Reduces spatial dimensions while retaining important features:

pool = nn.MaxPool2d(kernel_size=2, stride=2)  # Halves spatial dimensions
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

Receptive Field

Each neuron in a deep CNN “sees” a larger region of the original image as depth increases:

Layer 1 neuron: sees 3×3 patch
Layer 2 neuron: sees 5×5 patch
Layer 3 neuron: sees 7×7 patch
Deep network neurons: see the entire image

Building a CNN in PyTorch

import torch
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # Feature extraction
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),   # 28×28 → 28×28 (32 channels)
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),               # 28×28 → 14×14

            nn.Conv2d(32, 64, 3, padding=1),  # 14×14 → 14×14 (64 channels)
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),               # 14×14 → 7×7
        )

        # Classification head
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

model = CNN(num_classes=10)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Transfer Learning with Pretrained CNNs

Training a CNN from scratch requires millions of images. Transfer learning uses pretrained models:

import torchvision.models as models

# Load ResNet-50 pretrained on ImageNet
resnet = models.resnet50(pretrained=True)

# Freeze all layers
for param in resnet.parameters():
    param.requires_grad = False

# Replace the final classification layer
num_features = resnet.fc.in_features
resnet.fc = nn.Linear(num_features, 5)  # 5-class problem

# Only train the new final layer
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=1e-3)

Fine-tuning (unfreeze later layers):

# Unfreeze last two residual blocks
for name, param in resnet.named_parameters():
    if 'layer4' in name or 'layer3' in name:
        param.requires_grad = True

optimizer = torch.optim.Adam([
    {'params': resnet.layer3.parameters(), 'lr': 1e-5},  # Low lr for pretrained
    {'params': resnet.layer4.parameters(), 'lr': 1e-5},
    {'params': resnet.fc.parameters(), 'lr': 1e-3},      # High lr for new layer
])

Pretrained Models Available

from torchvision import models

resnet50   = models.resnet50(pretrained=True)    # General purpose
efficientnet_b0 = models.efficientnet_b0(pretrained=True)  # Efficient, mobile
vgg16      = models.vgg16(pretrained=True)       # Simple architecture
mobilenet  = models.mobilenet_v3_small(pretrained=True)    # Lightweight mobile
vit_b_16   = models.vit_b_16(pretrained=True)    # Vision Transformer

Beyond Images

CNNs apply to any data with local spatial structure:

1D convolutions: time series, text sequences, audio waveforms
3D convolutions: video, volumetric medical imaging (CT/MRI)
Graph convolutions: molecular structures, social networks (specialized variants)

Transfer learning makes CNNs practical for most computer vision problems — you rarely need to train from scratch unless you have millions of domain-specific images.