Activation Functions: Nonlinearity in Neural Networks

Learn activation functions — ReLU, GELU, Sigmoid, Tanh, Swish, choosing activations for hidden vs output layers, and the dying ReLU problem with fixes.

Activation Functions

Without activation functions, stacking multiple linear layers still produces a linear transformation — no matter how deep the network. Activation functions introduce nonlinearity at each layer, enabling neural networks to approximate complex functions. Choosing the right activation function affects training speed, gradient flow, and final performance.


Why Nonlinearity Matters

Linear composition: W₃(W₂(W₁x)) = (W₃W₂W₁)x = Wx
→ Equivalent to a single linear layer, regardless of depth
With activation function:
W₂(ReLU(W₁x)) ≠ Wx
→ Now the network can represent nonlinear boundaries

Common Activation Functions

ReLU (Rectified Linear Unit)

The default choice for hidden layers in most modern networks:

f(x) = max(0, x)
x < 0: output = 0 (neuron is inactive)
x ≥ 0: output = x (linear pass-through)
nn.ReLU()
# or
torch.relu(x)

Pros: Simple, fast, doesn’t saturate for positive inputs, sparse activations
Cons: Dying ReLU problem — neurons outputting 0 receive zero gradient, can get stuck permanently

Leaky ReLU

Fixes dying ReLU by allowing a small gradient for negative inputs:

f(x) = x if x > 0
f(x) = 0.01x if x ≤ 0
nn.LeakyReLU(negative_slope=0.01)

ELU (Exponential Linear Unit)

Smooth at x=0, negative values push mean activations toward zero:

nn.ELU(alpha=1.0)

GELU (Gaussian Error Linear Unit)

The activation of choice in modern Transformers (BERT, GPT, ViT):

f(x) = x × Φ(x) where Φ is the Gaussian CDF
≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
nn.GELU()

GELU is smooth everywhere, non-monotonic for negative values, and empirically outperforms ReLU in language model pretraining.

Swish / SiLU

f(x) = x × sigmoid(x)
nn.SiLU() # SiLU = Swish

Used in EfficientNet and modern vision models. Similar to GELU in practice.

Sigmoid

f(x) = 1 / (1 + e⁻ˣ)
Output range: (0, 1)
nn.Sigmoid()

Use for: Binary classification output layer only. Avoid in hidden layers — saturates and causes vanishing gradients.

Tanh

f(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
Output range: (-1, 1)
nn.Tanh()

Use for: LSTM/GRU cell state (built-in), some output layers where [-1,1] range is needed. Like sigmoid, avoid in deep hidden layers.


Choosing the Output Activation

TaskOutput ActivationLoss Function
Binary classificationSigmoidBCEWithLogitsLoss
Multi-class classificationSoftmax (or none)CrossEntropyLoss
RegressionNone (linear)MSELoss / HuberLoss
Multi-label classificationSigmoidBCEWithLogitsLoss
Output in [0, 1]SigmoidMSELoss

Note: nn.CrossEntropyLoss in PyTorch applies softmax internally — don’t add softmax before it.


The Dying ReLU Problem

When a neuron’s input is always negative, ReLU always outputs 0, gradient is always 0, weights never update — the neuron is permanently dead.

Causes:

  • Large negative biases after initialization
  • Very high learning rates

Fixes:

# Option 1: Leaky ReLU
nn.LeakyReLU(0.01)
# Option 2: ELU — smooth negative region
nn.ELU(alpha=1.0)
# Option 3: Better initialization
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.zeros_(layer.bias)
# Option 4: Lower learning rate + gradient clipping

Activation Selection Guide

Hidden layers in CNN/MLP: → ReLU (fast default)
Leaky ReLU (if dying ReLU is a problem)
Hidden layers in Transformer: → GELU
Vision models (EfficientNet): → SiLU/Swish
LSTM/GRU internals: → Tanh + Sigmoid (built-in, don't change)
Output (binary): → None (use BCEWithLogitsLoss)
Output (multi-class): → None (use CrossEntropyLoss)

GELU and SiLU have become increasingly popular as they consistently outperform plain ReLU in large-scale pretraining. For new architectures, start with ReLU for simplicity and switch to GELU/SiLU if you see performance benefits.