Convolutional Neural Networks: Deep Learning for Images and Sequences

Learn convolutional neural networks — convolution layers, pooling, receptive fields, transfer learning, and how CNNs achieve state-of-the-art results on image tasks.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are the dominant architecture for image processing tasks. By exploiting spatial locality and weight sharing, they learn visual hierarchies — from edges and textures in early layers to objects and scenes in later layers.


The Convolution Operation

A convolution slides a small filter (kernel) across the input image, computing dot products at each position:

Input (5×5): Filter (3×3): Output (3×3):
1 0 1 0 1 1 0 1 4 3 4
0 1 0 1 0 ⊗ 0 1 0 → 3 4 3
1 0 1 0 1 1 0 1 4 3 4
0 1 0 1 0
1 0 1 0 1
At each position: sum(element-wise multiplication of filter with patch)

The filter weights are learned during training — the network learns what patterns to detect.


Key Components

Convolution Layer

import torch.nn as nn
# 32 filters of size 3×3, taking a 1-channel (grayscale) input
conv = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)
# With padding=1, spatial dimensions are preserved: (H, W) → (H, W)
# Without padding: (H, W) → (H-2, W-2) for 3×3 filter

Pooling Layer

Reduces spatial dimensions while retaining important features:

pool = nn.MaxPool2d(kernel_size=2, stride=2) # Halves spatial dimensions
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

Receptive Field

Each neuron in a deep CNN “sees” a larger region of the original image as depth increases:

Layer 1 neuron: sees 3×3 patch
Layer 2 neuron: sees 5×5 patch
Layer 3 neuron: sees 7×7 patch
Deep network neurons: see the entire image

Building a CNN in PyTorch

import torch
import torch.nn as nn
class CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Feature extraction
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), # 28×28 → 28×28 (32 channels)
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 28×28 → 14×14
nn.Conv2d(32, 64, 3, padding=1), # 14×14 → 14×14 (64 channels)
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 14×14 → 7×7
)
# Classification head
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 7 * 7, 512),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
return self.classifier(x)
model = CNN(num_classes=10)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Transfer Learning with Pretrained CNNs

Training a CNN from scratch requires millions of images. Transfer learning uses pretrained models:

import torchvision.models as models
# Load ResNet-50 pretrained on ImageNet
resnet = models.resnet50(pretrained=True)
# Freeze all layers
for param in resnet.parameters():
param.requires_grad = False
# Replace the final classification layer
num_features = resnet.fc.in_features
resnet.fc = nn.Linear(num_features, 5) # 5-class problem
# Only train the new final layer
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=1e-3)

Fine-tuning (unfreeze later layers):

# Unfreeze last two residual blocks
for name, param in resnet.named_parameters():
if 'layer4' in name or 'layer3' in name:
param.requires_grad = True
optimizer = torch.optim.Adam([
{'params': resnet.layer3.parameters(), 'lr': 1e-5}, # Low lr for pretrained
{'params': resnet.layer4.parameters(), 'lr': 1e-5},
{'params': resnet.fc.parameters(), 'lr': 1e-3}, # High lr for new layer
])

Pretrained Models Available

from torchvision import models
resnet50 = models.resnet50(pretrained=True) # General purpose
efficientnet_b0 = models.efficientnet_b0(pretrained=True) # Efficient, mobile
vgg16 = models.vgg16(pretrained=True) # Simple architecture
mobilenet = models.mobilenet_v3_small(pretrained=True) # Lightweight mobile
vit_b_16 = models.vit_b_16(pretrained=True) # Vision Transformer

Beyond Images

CNNs apply to any data with local spatial structure:

  • 1D convolutions: time series, text sequences, audio waveforms
  • 3D convolutions: video, volumetric medical imaging (CT/MRI)
  • Graph convolutions: molecular structures, social networks (specialized variants)

Transfer learning makes CNNs practical for most computer vision problems — you rarely need to train from scratch unless you have millions of domain-specific images.