Activation Functions
Without activation functions, stacking multiple linear layers still produces a linear transformation — no matter how deep the network. Activation functions introduce nonlinearity at each layer, enabling neural networks to approximate complex functions. Choosing the right activation function affects training speed, gradient flow, and final performance.
Why Nonlinearity Matters
Linear composition: W₃(W₂(W₁x)) = (W₃W₂W₁)x = Wx→ Equivalent to a single linear layer, regardless of depth
With activation function:W₂(ReLU(W₁x)) ≠ Wx→ Now the network can represent nonlinear boundariesCommon Activation Functions
ReLU (Rectified Linear Unit)
The default choice for hidden layers in most modern networks:
f(x) = max(0, x)
x < 0: output = 0 (neuron is inactive)x ≥ 0: output = x (linear pass-through)nn.ReLU()# ortorch.relu(x)Pros: Simple, fast, doesn’t saturate for positive inputs, sparse activations
Cons: Dying ReLU problem — neurons outputting 0 receive zero gradient, can get stuck permanently
Leaky ReLU
Fixes dying ReLU by allowing a small gradient for negative inputs:
f(x) = x if x > 0f(x) = 0.01x if x ≤ 0nn.LeakyReLU(negative_slope=0.01)ELU (Exponential Linear Unit)
Smooth at x=0, negative values push mean activations toward zero:
nn.ELU(alpha=1.0)GELU (Gaussian Error Linear Unit)
The activation of choice in modern Transformers (BERT, GPT, ViT):
f(x) = x × Φ(x) where Φ is the Gaussian CDF ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))nn.GELU()GELU is smooth everywhere, non-monotonic for negative values, and empirically outperforms ReLU in language model pretraining.
Swish / SiLU
f(x) = x × sigmoid(x)nn.SiLU() # SiLU = SwishUsed in EfficientNet and modern vision models. Similar to GELU in practice.
Sigmoid
f(x) = 1 / (1 + e⁻ˣ)Output range: (0, 1)nn.Sigmoid()Use for: Binary classification output layer only. Avoid in hidden layers — saturates and causes vanishing gradients.
Tanh
f(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)Output range: (-1, 1)nn.Tanh()Use for: LSTM/GRU cell state (built-in), some output layers where [-1,1] range is needed. Like sigmoid, avoid in deep hidden layers.
Choosing the Output Activation
| Task | Output Activation | Loss Function |
|---|---|---|
| Binary classification | Sigmoid | BCEWithLogitsLoss |
| Multi-class classification | Softmax (or none) | CrossEntropyLoss |
| Regression | None (linear) | MSELoss / HuberLoss |
| Multi-label classification | Sigmoid | BCEWithLogitsLoss |
| Output in [0, 1] | Sigmoid | MSELoss |
Note: nn.CrossEntropyLoss in PyTorch applies softmax internally — don’t add softmax before it.
The Dying ReLU Problem
When a neuron’s input is always negative, ReLU always outputs 0, gradient is always 0, weights never update — the neuron is permanently dead.
Causes:
- Large negative biases after initialization
- Very high learning rates
Fixes:
# Option 1: Leaky ReLUnn.LeakyReLU(0.01)
# Option 2: ELU — smooth negative regionnn.ELU(alpha=1.0)
# Option 3: Better initializationnn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')nn.init.zeros_(layer.bias)
# Option 4: Lower learning rate + gradient clippingActivation Selection Guide
Hidden layers in CNN/MLP: → ReLU (fast default) Leaky ReLU (if dying ReLU is a problem)Hidden layers in Transformer: → GELUVision models (EfficientNet): → SiLU/SwishLSTM/GRU internals: → Tanh + Sigmoid (built-in, don't change)Output (binary): → None (use BCEWithLogitsLoss)Output (multi-class): → None (use CrossEntropyLoss)GELU and SiLU have become increasingly popular as they consistently outperform plain ReLU in large-scale pretraining. For new architectures, start with ReLU for simplicity and switch to GELU/SiLU if you see performance benefits.