Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are the dominant architecture for image processing tasks. By exploiting spatial locality and weight sharing, they learn visual hierarchies — from edges and textures in early layers to objects and scenes in later layers.
The Convolution Operation
A convolution slides a small filter (kernel) across the input image, computing dot products at each position:
Input (5×5): Filter (3×3): Output (3×3):1 0 1 0 1 1 0 1 4 3 40 1 0 1 0 ⊗ 0 1 0 → 3 4 31 0 1 0 1 1 0 1 4 3 40 1 0 1 01 0 1 0 1
At each position: sum(element-wise multiplication of filter with patch)The filter weights are learned during training — the network learns what patterns to detect.
Key Components
Convolution Layer
import torch.nn as nn
# 32 filters of size 3×3, taking a 1-channel (grayscale) inputconv = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1)# With padding=1, spatial dimensions are preserved: (H, W) → (H, W)# Without padding: (H, W) → (H-2, W-2) for 3×3 filterPooling Layer
Reduces spatial dimensions while retaining important features:
pool = nn.MaxPool2d(kernel_size=2, stride=2) # Halves spatial dimensionsavg_pool = nn.AvgPool2d(kernel_size=2, stride=2)Receptive Field
Each neuron in a deep CNN “sees” a larger region of the original image as depth increases:
Layer 1 neuron: sees 3×3 patchLayer 2 neuron: sees 5×5 patchLayer 3 neuron: sees 7×7 patchDeep network neurons: see the entire imageBuilding a CNN in PyTorch
import torchimport torch.nn as nn
class CNN(nn.Module): def __init__(self, num_classes=10): super().__init__()
# Feature extraction self.features = nn.Sequential( nn.Conv2d(1, 32, 3, padding=1), # 28×28 → 28×28 (32 channels) nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2, 2), # 28×28 → 14×14
nn.Conv2d(32, 64, 3, padding=1), # 14×14 → 14×14 (64 channels) nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2, 2), # 14×14 → 7×7 )
# Classification head self.classifier = nn.Sequential( nn.Flatten(), nn.Linear(64 * 7 * 7, 512), nn.ReLU(), nn.Dropout(0.5), nn.Linear(512, num_classes) )
def forward(self, x): x = self.features(x) return self.classifier(x)
model = CNN(num_classes=10)print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")Transfer Learning with Pretrained CNNs
Training a CNN from scratch requires millions of images. Transfer learning uses pretrained models:
import torchvision.models as models
# Load ResNet-50 pretrained on ImageNetresnet = models.resnet50(pretrained=True)
# Freeze all layersfor param in resnet.parameters(): param.requires_grad = False
# Replace the final classification layernum_features = resnet.fc.in_featuresresnet.fc = nn.Linear(num_features, 5) # 5-class problem
# Only train the new final layeroptimizer = torch.optim.Adam(resnet.fc.parameters(), lr=1e-3)Fine-tuning (unfreeze later layers):
# Unfreeze last two residual blocksfor name, param in resnet.named_parameters(): if 'layer4' in name or 'layer3' in name: param.requires_grad = True
optimizer = torch.optim.Adam([ {'params': resnet.layer3.parameters(), 'lr': 1e-5}, # Low lr for pretrained {'params': resnet.layer4.parameters(), 'lr': 1e-5}, {'params': resnet.fc.parameters(), 'lr': 1e-3}, # High lr for new layer])Pretrained Models Available
from torchvision import models
resnet50 = models.resnet50(pretrained=True) # General purposeefficientnet_b0 = models.efficientnet_b0(pretrained=True) # Efficient, mobilevgg16 = models.vgg16(pretrained=True) # Simple architecturemobilenet = models.mobilenet_v3_small(pretrained=True) # Lightweight mobilevit_b_16 = models.vit_b_16(pretrained=True) # Vision TransformerBeyond Images
CNNs apply to any data with local spatial structure:
- 1D convolutions: time series, text sequences, audio waveforms
- 3D convolutions: video, volumetric medical imaging (CT/MRI)
- Graph convolutions: molecular structures, social networks (specialized variants)
Transfer learning makes CNNs practical for most computer vision problems — you rarely need to train from scratch unless you have millions of domain-specific images.