Convolutional Neural Networks (CNNs) Explained: Filters, Padding, and Pooling

How CNNs work — convolution, filters, padding, stride, and pooling explained with code, and why they're built specifically for image data.

Convolutional Neural Networks (CNNs) Explained: Filters, Padding, and Pooling

A fully-connected MLP applied directly to an image, covered in Multi-Layer Perceptrons, treats every pixel as an independent feature — completely discarding the fact that nearby pixels are related, and that a “cat ear” pattern should be recognizable regardless of where in the image it appears. Convolutional Neural Networks are built specifically to preserve and exploit this spatial structure, and understanding their core operations — convolution, filters, padding, stride, and pooling — is what makes CNN architecture diagrams actually readable.


The Convolution Operation

A convolution slides a small filter (also called a kernel) across the image, computing a weighted sum at each position — effectively detecting a specific pattern wherever it appears in the image.

import numpy as np
def convolve_2d(image, kernel):
kh, kw = kernel.shape
ih, iw = image.shape
output_h, output_w = ih - kh + 1, iw - kw + 1
output = np.zeros((output_h, output_w))
for i in range(output_h):
for j in range(output_w):
region = image[i:i+kh, j:j+kw]
output[i, j] = np.sum(region * kernel)
return output
# A simple vertical edge detector filter
edge_filter = np.array([[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]])

The key property: this same 3×3 filter is applied at every position across the entire image, using the exact same weights — this weight sharing is what makes CNNs dramatically more parameter-efficient than a fully-connected layer for image data, and it’s also what gives CNNs translation invariance: a pattern learned in one part of the image is automatically recognized anywhere else it appears too.


Filters: What a Convolutional Layer Actually Learns

In a trained CNN, filters aren’t hand-designed edge detectors like the example above — they’re learned automatically via backpropagation, exactly like any other weight in the network.

import torch.nn as nn
conv_layer = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
# Learns 32 different 3x3 filters, each potentially detecting a different pattern

Early layers in a trained CNN typically learn simple filters — edges, color gradients, simple textures. Later layers combine these into progressively more complex, abstract patterns, directly connecting to the hierarchical feature learning discussed in Deep Feedforward Networks.


Padding: Controlling Output Size and Edge Information

Without padding, each convolution shrinks the output slightly (a 3×3 filter on a 5×5 image produces a 3×3 output), and pixels near the image’s edge get involved in fewer convolution computations than pixels near the center — a genuine loss of edge information over many layers.

# "Same" padding: adds zeros around the border so output size matches input size
conv_same = nn.Conv2d(3, 32, kernel_size=3, padding=1) # padding=1 preserves spatial dimensions
# "Valid" padding: no padding added, output shrinks with each convolution
conv_valid = nn.Conv2d(3, 32, kernel_size=3, padding=0)

“Same” padding is the more common choice in deep architectures specifically to avoid the spatial dimensions shrinking uncontrollably across many stacked convolutional layers.


Stride: Controlling How Far the Filter Moves

Stride determines how many pixels the filter moves between each computation — a stride of 1 moves one pixel at a time (maximum overlap, largest output); a stride of 2 skips every other position, producing a smaller output and reducing computation.

conv_stride1 = nn.Conv2d(3, 32, kernel_size=3, stride=1) # dense, overlapping computation
conv_stride2 = nn.Conv2d(3, 32, kernel_size=3, stride=2) # output roughly half the spatial size

A larger stride is sometimes used as an alternative to pooling for downsampling the spatial dimensions while extracting features simultaneously, trading some spatial resolution for computational efficiency.


Pooling: Downsampling While Preserving Important Information

Pooling layers reduce the spatial dimensions of the data (height and width) while attempting to preserve the most important information — max pooling, the most common variant, keeps only the largest value within each small region.

pooling_layer = nn.MaxPool2d(kernel_size=2, stride=2)
# Conceptually, for a 2x2 pooling window:
region = np.array([[1, 3], [2, 4]])
max_pooled_value = np.max(region) # 4 -- keeps only the strongest activation

Pooling serves two purposes: it reduces computational cost for subsequent layers (fewer spatial positions to process), and it introduces a degree of translation invariance at a local level — a feature detected slightly shifted within the pooling window still produces the same pooled output.


A Complete Minimal CNN Architecture

import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.fc = nn.Linear(64 * 8 * 8, num_classes) # assuming input reduced to 8x8 spatially
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = self.pool(torch.relu(self.conv2(x)))
x = x.view(x.size(0), -1) # flatten before the final fully-connected layer
return self.fc(x)

Notice the pattern: convolution + activation + pooling, repeated, and only flattened into a fully-connected layer at the very end — the convolutional layers extract spatial features, and the final MLP (covered in Multi-Layer Perceptrons) makes the final classification decision from those extracted features.

Receptive Field: What a Single Output Neuron “Sees”

A useful concept when reasoning about CNN depth: the receptive field of a given neuron is the region of the original input image that ultimately influences its value, after passing through all the preceding convolutional and pooling layers. A single 3×3 convolution gives each output neuron a receptive field of just 3×3 pixels — but stacking several such layers compounds this: two stacked 3×3 convolutions give a receptive field of 5×5, three give 7×7, and so on, directly connecting to the VGG design philosophy covered in Popular CNN Architectures. Pooling layers grow the receptive field even faster, since they compress spatial information before the next convolution is applied. Understanding receptive field size is genuinely practical: a network needs a receptive field at least as large as the objects it’s trying to recognize, which is part of why deeper networks (with correspondingly larger receptive fields) tend to handle larger, more complex objects and scenes more effectively than shallow ones.

Summary

ComponentPurpose
Convolution + filtersDetects local patterns, with shared weights across spatial positions
PaddingControls output size, preserves edge information
StrideControls how much the filter moves, affecting output size and computation
PoolingDownsamples spatially, adds local translation invariance

CNNs aren’t a fundamentally different kind of neural network — they’re MLPs with a specific, deliberate architectural constraint (shared, spatially-local weights) that dramatically improves both parameter efficiency and generalization specifically for image-like data with meaningful spatial structure.