Activation Functions Explained: Sigmoid, ReLU, GELU, and When to Use Each
Without an activation function, a neural network with a hundred layers would be mathematically equivalent to a single-layer linear model — stacking linear operations produces just another linear operation, no matter how many layers you add. Activation functions inject the nonlinearity that lets deep networks represent genuinely complex patterns, and choosing the right one for a given layer is a real, practical decision with measurable consequences for training.
Why Nonlinearity Is Non-Negotiable
# Without activation functions, layers collapse into one equivalent linear operation# y = W3(W2(W1*x)) = (W3*W2*W1)*x = W_combined * x -- just one layer's worth of expressivenessThis is exactly the limitation illustrated by the XOR problem in The Perceptron — a purely linear system, no matter how many layers deep, can only ever represent linear (or linearly separable) relationships. Every activation function’s real job is breaking this collapse.
Sigmoid: Historically Important, Rarely the Right Default Today
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))Sigmoid squashes any input into a range between 0 and 1, making it a natural fit for binary classification output layers (interpreting the output directly as a probability). Its major practical drawback: for large positive or negative inputs, its gradient becomes extremely small — directly causing the vanishing gradient problem in deep networks, covered in Vanishing Gradient Problem.
Tanh: Zero-Centered, Same Underlying Problem
def tanh(x): return np.tanh(x)Tanh squashes inputs into a range between -1 and 1, and being zero-centered (unlike sigmoid) generally helps gradient flow slightly compared to sigmoid — but it suffers from the same vanishing gradient issue at extreme input values.
ReLU: The Modern Default for Hidden Layers
def relu(x): return np.maximum(0, x)ReLU (Rectified Linear Unit) outputs the input directly if positive, and zero otherwise. Its gradient is either exactly 1 (for positive inputs) or exactly 0 (for negative inputs) — no vanishing gradient problem for positive activations, and it’s computationally trivial to compute, which matters enormously at the scale of modern networks. ReLU’s main drawback is the “dying ReLU” problem: a neuron whose input is consistently negative outputs zero and has zero gradient, meaning it can get permanently stuck and stop learning entirely.
Leaky ReLU: Fixing the Dying ReLU Problem
def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x)Leaky ReLU allows a small, non-zero gradient for negative inputs instead of exactly zero, giving “dead” neurons a small chance to recover during training rather than being permanently stuck.
GELU: The Smooth, Modern Choice for Transformers
def gelu(x): return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))GELU (Gaussian Error Linear Unit) is a smooth approximation that weights inputs by their value and a probabilistic gate based on a Gaussian distribution, rather than ReLU’s hard cutoff at zero. It has become the standard activation function in transformer architectures (covered in Transformers), including in GPT and BERT-family models, largely because its smoothness empirically improves training stability and final performance for these specific large-scale architectures.
Swish: Self-Gated and Empirically Strong
def swish(x, beta=1.0): return x * sigmoid(beta * x)Swish, discovered partly through automated architecture search, multiplies the input by its own sigmoid — a self-gating mechanism. It has shown consistent, if modest, empirical improvements over ReLU on several deep architectures, particularly at greater depths.
Comparing Activation Functions
| Function | Range | Vanishing gradient risk | Common use today |
|---|---|---|---|
| Sigmoid | (0, 1) | High | Binary classification output layer only |
| Tanh | (-1, 1) | High | Occasionally in RNNs/LSTMs |
| ReLU | [0, ∞) | Low (for positive inputs) | Default for CNN/MLP hidden layers |
| Leaky ReLU | (-∞, ∞) | Low | When dying ReLU is observed to be a problem |
| GELU | (-∞, ∞) approx. | Low | Standard in transformer architectures |
| Swish | (-∞, ∞) approx. | Low | Some modern CNN and mobile architectures |
A Practical Decision Framework
Hidden layers in a standard feedforward or convolutional network: start with ReLU — it’s fast, well-understood, and works well in the overwhelming majority of cases.
Building or fine-tuning a transformer: use GELU, matching what virtually every major pretrained model (BERT, GPT-family, and their successors) already uses internally.
Observing many “dead” neurons during training (activations stuck at exactly zero across many examples): switch to Leaky ReLU or GELU as a direct fix.
Binary classification output layer: sigmoid remains the correct, standard choice, since its output directly maps to a probability.
Multi-class classification output layer: softmax (covered in Probability Distributions), not any of the activation functions listed above — softmax is a distinct function that normalizes an entire output vector into a valid probability distribution, not a per-neuron activation applied independently.
A Note on Activation Functions in Output Layers vs. Hidden Layers
It’s worth being explicit that the guidance above applies to hidden layers — output layers follow entirely different rules driven by the task’s output distribution, covered in Probability Distributions. A regression output typically uses no activation at all (a raw linear output), a binary classifier’s output layer uses sigmoid, and a multi-class classifier’s output layer uses softmax — none of these are chosen for the same reasons ReLU or GELU are chosen in hidden layers. Mixing these up (applying ReLU to a regression output layer, for instance, which would incorrectly clip all negative predictions to zero) is a subtle but real mistake worth checking for explicitly when reviewing an unfamiliar model’s architecture.
Summary
Activation function choice isn’t a minor implementation detail — it directly determines whether gradients flow healthily through a deep network or vanish into uselessness. ReLU remains the sensible default for most hidden layers, GELU is the standard for transformer-based architectures, and sigmoid/softmax remain correctly reserved for output layers where their probabilistic interpretation is actually needed.