Popular CNN Architectures: LeNet, AlexNet, VGG, ResNet, and EfficientNet

How LeNet, AlexNet, VGG, ResNet, and EfficientNet each solved a specific limitation of their predecessor, tracing the evolution of CNN design.

Popular CNN Architectures: LeNet, AlexNet, VGG, ResNet, and EfficientNet

Every major CNN architecture in this progression exists because it solved a specific, well-documented limitation of the one before it — this isn’t a list of arbitrary alternatives, it’s a genuine engineering history where each step directly addressed a concrete problem. Understanding why each architecture was introduced, not just its name, is what makes the sequence useful rather than a list to memorize.


LeNet (1998): Proving the Concept

LeNet, designed for recognizing handwritten digits on checks, was one of the first successful demonstrations that convolution, pooling, and gradient-based training could work together end to end for a real vision task.

LeNet-5 structure (simplified):
Input (32x32) → Conv → Pool → Conv → Pool → FC → FC → Output (10 classes)

Small by modern standards (roughly 60,000 parameters), LeNet established the core architectural pattern — convolution and pooling layers, followed by fully-connected layers — that every subsequent CNN in this list still follows in some form.


AlexNet (2012): The Deep Learning Breakthrough

AlexNet’s decisive win in the 2012 ImageNet competition is widely credited as the moment that triggered the modern deep learning boom — it demonstrated that a much deeper network, trained on GPUs with a much larger dataset, dramatically outperformed every non-deep-learning approach that came before it.

AlexNet used, notably for its time:
- ReLU activations (instead of sigmoid/tanh) — faster training, covered in Activation Functions
- Dropout for regularization — covered in Dropout
- GPU-based training — made training a network this large computationally feasible

AlexNet’s specific architectural choices were themselves practical validations of concepts covered earlier in this series — ReLU’s advantage over sigmoid for gradient flow, and dropout’s effectiveness at reducing overfitting, were both demonstrated concretely and convincingly at scale for the first time by this specific architecture.


VGG (2014): Simplicity and Depth

VGG’s key contribution was demonstrating that a very simple, uniform design — stacking many small 3×3 convolutional filters rather than a mix of larger filter sizes — could achieve excellent results, and that depth itself (VGG came in 16- and 19-layer variants) was a meaningfully important factor for accuracy.

import torch.nn as nn
# VGG-style block: multiple small 3x3 convolutions before pooling
vgg_block = nn.Sequential(
nn.Conv2d(64, 64, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)

Two stacked 3×3 filters have the same effective receptive field as one 5×5 filter, but with fewer parameters and an additional nonlinearity in between — VGG’s insight was that this combination of small filters, stacked deeply, was both more parameter-efficient and more expressive than fewer, larger filters.


ResNet (2015): Solving the Degradation Problem With Residual Connections

As networks got deeper following VGG’s lead, a new, counterintuitive problem emerged: even deeper networks sometimes performed worse than shallower ones, and not due to overfitting — they had genuinely higher training error too, a symptom directly tied to the Vanishing Gradient Problem becoming severe at extreme depths.

ResNet’s solution was the residual (skip) connection — instead of forcing each block of layers to learn a completely new transformation, it learns only the residual (the difference) relative to the input, which is then added back.

class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
def forward(self, x):
residual = x
out = torch.relu(self.conv1(x))
out = self.conv2(out)
return torch.relu(out + residual) # skip connection: add input back directly

This skip connection gives gradients a direct, unimpeded path backward through the network, dramatically mitigating vanishing gradients even at depths of 50, 100, or more than 150 layers — a depth that was simply untrainable with prior architectures, and directly enabling the depth-efficiency argument made in Deep Feedforward Networks.


EfficientNet (2019): Principled Scaling

Rather than proposing a fundamentally new building block, EfficientNet’s contribution was a systematic study of how to scale a network up — depth (more layers), width (more channels per layer), and input resolution (larger images) had historically been scaled somewhat arbitrarily and independently. EfficientNet demonstrated that scaling all three dimensions together, according to a specific, empirically-derived ratio, produced substantially better accuracy-per-parameter and accuracy-per-FLOP than scaling any single dimension alone.

EfficientNet family: B0 (baseline) through B7 (largest),
each scaling depth, width, and resolution together
according to a fixed compound scaling formula

This made EfficientNet models particularly attractive for resource-constrained deployment scenarios, covered further in Deep Learning Deployment, where achieving strong accuracy with fewer total parameters and less compute directly translates to faster, cheaper inference.


The Throughline: Each Architecture Solved a Specific Problem

ArchitectureYearKey InnovationProblem It Solved
LeNet1998Convolution + pooling + gradient trainingProved the core CNN concept works
AlexNet2012ReLU, dropout, GPU training at scaleDemonstrated deep learning’s dramatic advantage
VGG2014Deep, uniform stacks of small filtersShowed depth and filter simplicity both matter
ResNet2015Residual/skip connectionsSolved degradation at extreme depth
EfficientNet2019Principled compound scalingOptimal accuracy per parameter/compute

Transfer Learning: Reusing These Architectures Directly

A hugely practical consequence of this architectural history: pretrained versions of ResNet, EfficientNet, and similar architectures — trained on massive datasets like ImageNet — are freely available and can be fine-tuned on a much smaller, task-specific dataset rather than training from scratch.

import torchvision.models as models
model = models.resnet50(pretrained=True)
# Replace the final classification layer for your specific number of classes
model.fc = nn.Linear(model.fc.in_features, num_classes_for_my_task)

This transfer learning approach — reusing a pretrained architecture’s learned features and only fine-tuning the final layers, or the whole network at a lower learning rate — is standard practice for most real-world image classification projects today, since training one of these architectures from scratch requires far more data and compute than most individual projects have access to.

Summary

This progression isn’t a list of interchangeable options — it’s a genuine engineering history where residual connections specifically solved the vanishing gradient problem at scale, and compound scaling specifically solved inefficient resource allocation. Understanding what specific problem each architecture was built to solve is what makes their design choices — and the choices in whatever architecture comes after them — make sense rather than feeling arbitrary.