Multi-Layer Perceptrons (MLPs): Architecture, Code, and When to Use Them
The Multi-Layer Perceptron is the direct solution to the exact limitation covered in The Perceptron — a single perceptron can only draw one straight decision boundary, but stack several layers of them together with nonlinear activations between, and the combined network can represent arbitrarily complex decision boundaries. It’s also the architectural foundation every other network type in this series builds on or departs from in a specific, motivated way.
The Architecture: Input, Hidden, and Output Layers
An MLP consists of an input layer, one or more hidden layers, and an output layer, with every neuron in one layer connected to every neuron in the next (a “fully connected” or “dense” architecture).
Input layer Hidden layer(s) Output layer ○ ○ ○ ○ ────weights────▶ ○ ───weights───▶ ○ ○ ○ ○ ○ ○import torch.nn as nn
model = nn.Sequential( nn.Linear(784, 256), # input layer -> hidden layer 1 nn.ReLU(), nn.Linear(256, 128), # hidden layer 1 -> hidden layer 2 nn.ReLU(), nn.Linear(128, 10) # hidden layer 2 -> output layer)This is precisely the architecture built in Building Your First Neural Network — an MLP is the direct, practical embodiment of forward propagation and backpropagation as covered throughout Module 3.
Why Hidden Layers Solve the XOR Problem
A single hidden layer with just two neurons is sufficient to solve the XOR problem that stumped a single perceptron — each hidden neuron learns a different linear decision boundary, and the output layer learns to combine them into a correctly non-linear overall boundary.
import torchimport torch.nn as nn
class XORSolver(nn.Module): def __init__(self): super().__init__() self.hidden = nn.Linear(2, 2) # 2 hidden neurons is enough for XOR self.output = nn.Linear(2, 1)
def forward(self, x): h = torch.relu(self.hidden(x)) return torch.sigmoid(self.output(h))
model = XORSolver()# Trained on the 4 XOR examples, this network can learn to solve# the exact problem that a single perceptron mathematically cannotEach hidden neuron effectively learns one of two useful linear boundaries, and the output layer learns to combine them — the network has discovered a non-linear solution built from simple linear pieces, which is the general principle behind why depth adds representational power.
The Universal Approximation Theorem
A mathematical result with real practical relevance: a feedforward network with even a single hidden layer, given enough neurons, can approximate any continuous function to arbitrary precision. This is a genuinely important theoretical guarantee, but it comes with an important caveat worth understanding: “enough neurons” can, for complex functions, mean an impractically large number — the theorem guarantees a wide-enough shallow network can represent a solution, but says nothing about whether that solution is learnable via gradient descent, or how much data and compute it would take to find. This is exactly why deep networks (many layers, each modest in width) tend to be more practical than very wide shallow ones — depth allows complex functions to be represented far more efficiently than width alone, as covered further in Deep Feedforward Networks.
When MLPs Are Still the Right Choice Today
Despite far more sophisticated architectures existing for images (CNNs) and sequences (RNNs, transformers), plain MLPs remain a genuinely good default choice for:
- Tabular, structured data — customer records, sensor readings, financial features — where there’s no meaningful spatial or sequential structure for a more specialized architecture to exploit.
- The final layers of nearly every other architecture. A CNN’s convolutional layers extract features, but its final classification layers are almost always a small MLP. Transformers’ feedforward sublayers are also, structurally, small MLPs applied at each position.
- Quick baselines. Before reaching for a more complex architecture, training a simple MLP is a fast, informative way to establish a performance baseline and confirm the basic training pipeline (data loading, loss, optimizer) works correctly.
MLP Limitations That Motivate More Specialized Architectures
No spatial awareness for images. An MLP applied directly to a flattened image treats every pixel as an independent feature, discarding the fact that nearby pixels are related — this is exactly the gap Convolutional Neural Networks fill, by explicitly modeling local spatial structure.
No sequential awareness for ordered data. An MLP has no built-in notion of “this input came before that one” — feeding a sentence’s words into an MLP loses all information about word order, which is exactly why Recurrent Neural Networks and later Transformers were developed specifically to handle sequences.
Parameter inefficiency at scale. A fully-connected layer between two reasonably large layers requires a weight for every possible pair of neurons — this grows extremely quickly as layer width increases, which is part of why specialized architectures that share weights (convolutional filters, attention mechanisms) scale more efficiently to large inputs.
Sizing an MLP: A Practical Starting Point
When building an MLP for a new tabular problem, a reasonable starting architecture is 2-3 hidden layers, with each layer’s width somewhere between the input dimension and the output dimension, often shrinking gradually toward the output — for a 50-feature input predicting a binary outcome, hidden layers of size 64, then 32, then a single output neuron is a sensible, unremarkable default worth starting from before investing time in more elaborate architecture search. This isn’t a rule with deep theoretical justification — it’s simply a well-tested, reasonable starting point that tends to avoid both the underfitting risk of too few/narrow layers and the unnecessary complexity of an oversized network for a straightforward tabular problem, which can then be adjusted based on the diagnostic signals covered in Overfitting and Underfitting.
Summary
| Aspect | Detail |
|---|---|
| Architecture | Fully connected layers, each followed by a nonlinear activation |
| Key capability | Can represent non-linear decision boundaries (unlike a single perceptron) |
| Best suited for | Tabular data, and as a component within more specialized architectures |
| Key limitation | No built-in awareness of spatial or sequential structure in the input |
The MLP isn’t an outdated architecture superseded entirely by CNNs and transformers — it’s the foundational building block those architectures are built from and extend, and it remains the correct, practical choice whenever your data doesn’t have spatial or sequential structure worth specially exploiting.