Convolutional Neural Networks (CNNs) Explained: Filters, Padding, and Pooling
A fully-connected MLP applied directly to an image, covered in Multi-Layer Perceptrons, treats every pixel as an independent feature — completely discarding the fact that nearby pixels are related, and that a “cat ear” pattern should be recognizable regardless of where in the image it appears. Convolutional Neural Networks are built specifically to preserve and exploit this spatial structure, and understanding their core operations — convolution, filters, padding, stride, and pooling — is what makes CNN architecture diagrams actually readable.
The Convolution Operation
A convolution slides a small filter (also called a kernel) across the image, computing a weighted sum at each position — effectively detecting a specific pattern wherever it appears in the image.
import numpy as np
def convolve_2d(image, kernel): kh, kw = kernel.shape ih, iw = image.shape output_h, output_w = ih - kh + 1, iw - kw + 1 output = np.zeros((output_h, output_w))
for i in range(output_h): for j in range(output_w): region = image[i:i+kh, j:j+kw] output[i, j] = np.sum(region * kernel) return output
# A simple vertical edge detector filteredge_filter = np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]])The key property: this same 3×3 filter is applied at every position across the entire image, using the exact same weights — this weight sharing is what makes CNNs dramatically more parameter-efficient than a fully-connected layer for image data, and it’s also what gives CNNs translation invariance: a pattern learned in one part of the image is automatically recognized anywhere else it appears too.
Filters: What a Convolutional Layer Actually Learns
In a trained CNN, filters aren’t hand-designed edge detectors like the example above — they’re learned automatically via backpropagation, exactly like any other weight in the network.
import torch.nn as nn
conv_layer = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)# Learns 32 different 3x3 filters, each potentially detecting a different patternEarly layers in a trained CNN typically learn simple filters — edges, color gradients, simple textures. Later layers combine these into progressively more complex, abstract patterns, directly connecting to the hierarchical feature learning discussed in Deep Feedforward Networks.
Padding: Controlling Output Size and Edge Information
Without padding, each convolution shrinks the output slightly (a 3×3 filter on a 5×5 image produces a 3×3 output), and pixels near the image’s edge get involved in fewer convolution computations than pixels near the center — a genuine loss of edge information over many layers.
# "Same" padding: adds zeros around the border so output size matches input sizeconv_same = nn.Conv2d(3, 32, kernel_size=3, padding=1) # padding=1 preserves spatial dimensions
# "Valid" padding: no padding added, output shrinks with each convolutionconv_valid = nn.Conv2d(3, 32, kernel_size=3, padding=0)“Same” padding is the more common choice in deep architectures specifically to avoid the spatial dimensions shrinking uncontrollably across many stacked convolutional layers.
Stride: Controlling How Far the Filter Moves
Stride determines how many pixels the filter moves between each computation — a stride of 1 moves one pixel at a time (maximum overlap, largest output); a stride of 2 skips every other position, producing a smaller output and reducing computation.
conv_stride1 = nn.Conv2d(3, 32, kernel_size=3, stride=1) # dense, overlapping computationconv_stride2 = nn.Conv2d(3, 32, kernel_size=3, stride=2) # output roughly half the spatial sizeA larger stride is sometimes used as an alternative to pooling for downsampling the spatial dimensions while extracting features simultaneously, trading some spatial resolution for computational efficiency.
Pooling: Downsampling While Preserving Important Information
Pooling layers reduce the spatial dimensions of the data (height and width) while attempting to preserve the most important information — max pooling, the most common variant, keeps only the largest value within each small region.
pooling_layer = nn.MaxPool2d(kernel_size=2, stride=2)
# Conceptually, for a 2x2 pooling window:region = np.array([[1, 3], [2, 4]])max_pooled_value = np.max(region) # 4 -- keeps only the strongest activationPooling serves two purposes: it reduces computational cost for subsequent layers (fewer spatial positions to process), and it introduces a degree of translation invariance at a local level — a feature detected slightly shifted within the pooling window still produces the same pooled output.
A Complete Minimal CNN Architecture
import torch.nn as nn
class SimpleCNN(nn.Module): def __init__(self, num_classes=10): super().__init__() self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(kernel_size=2, stride=2) self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.fc = nn.Linear(64 * 8 * 8, num_classes) # assuming input reduced to 8x8 spatially
def forward(self, x): x = self.pool(torch.relu(self.conv1(x))) x = self.pool(torch.relu(self.conv2(x))) x = x.view(x.size(0), -1) # flatten before the final fully-connected layer return self.fc(x)Notice the pattern: convolution + activation + pooling, repeated, and only flattened into a fully-connected layer at the very end — the convolutional layers extract spatial features, and the final MLP (covered in Multi-Layer Perceptrons) makes the final classification decision from those extracted features.
Receptive Field: What a Single Output Neuron “Sees”
A useful concept when reasoning about CNN depth: the receptive field of a given neuron is the region of the original input image that ultimately influences its value, after passing through all the preceding convolutional and pooling layers. A single 3×3 convolution gives each output neuron a receptive field of just 3×3 pixels — but stacking several such layers compounds this: two stacked 3×3 convolutions give a receptive field of 5×5, three give 7×7, and so on, directly connecting to the VGG design philosophy covered in Popular CNN Architectures. Pooling layers grow the receptive field even faster, since they compress spatial information before the next convolution is applied. Understanding receptive field size is genuinely practical: a network needs a receptive field at least as large as the objects it’s trying to recognize, which is part of why deeper networks (with correspondingly larger receptive fields) tend to handle larger, more complex objects and scenes more effectively than shallow ones.
Summary
| Component | Purpose |
|---|---|
| Convolution + filters | Detects local patterns, with shared weights across spatial positions |
| Padding | Controls output size, preserves edge information |
| Stride | Controls how much the filter moves, affecting output size and computation |
| Pooling | Downsamples spatially, adds local translation invariance |
CNNs aren’t a fundamentally different kind of neural network — they’re MLPs with a specific, deliberate architectural constraint (shared, spatially-local weights) that dramatically improves both parameter efficiency and generalization specifically for image-like data with meaningful spatial structure.