Activation Functions

Without activation functions, stacking multiple linear layers still produces a linear transformation — no matter how deep the network. Activation functions introduce nonlinearity at each layer, enabling neural networks to approximate complex functions. Choosing the right activation function affects training speed, gradient flow, and final performance.

Why Nonlinearity Matters

Linear composition: W₃(W₂(W₁x)) = (W₃W₂W₁)x = Wx
→ Equivalent to a single linear layer, regardless of depth

With activation function:
W₂(ReLU(W₁x)) ≠ Wx
→ Now the network can represent nonlinear boundaries

Common Activation Functions

ReLU (Rectified Linear Unit)

The default choice for hidden layers in most modern networks:

f(x) = max(0, x)

x < 0: output = 0 (neuron is inactive)
x ≥ 0: output = x (linear pass-through)

nn.ReLU()
# or
torch.relu(x)

Pros: Simple, fast, doesn’t saturate for positive inputs, sparse activations
Cons: Dying ReLU problem — neurons outputting 0 receive zero gradient, can get stuck permanently

Leaky ReLU

Fixes dying ReLU by allowing a small gradient for negative inputs:

f(x) = x     if x > 0
f(x) = 0.01x if x ≤ 0

nn.LeakyReLU(negative_slope=0.01)

ELU (Exponential Linear Unit)

Smooth at x=0, negative values push mean activations toward zero:

nn.ELU(alpha=1.0)

GELU (Gaussian Error Linear Unit)

The activation of choice in modern Transformers (BERT, GPT, ViT):

f(x) = x × Φ(x)  where Φ is the Gaussian CDF
     ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))

nn.GELU()

GELU is smooth everywhere, non-monotonic for negative values, and empirically outperforms ReLU in language model pretraining.

Swish / SiLU

f(x) = x × sigmoid(x)

nn.SiLU()  # SiLU = Swish

Used in EfficientNet and modern vision models. Similar to GELU in practice.

Sigmoid

f(x) = 1 / (1 + e⁻ˣ)
Output range: (0, 1)

nn.Sigmoid()

Use for: Binary classification output layer only. Avoid in hidden layers — saturates and causes vanishing gradients.

Tanh

f(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
Output range: (-1, 1)

nn.Tanh()

Use for: LSTM/GRU cell state (built-in), some output layers where [-1,1] range is needed. Like sigmoid, avoid in deep hidden layers.

Choosing the Output Activation

Task	Output Activation	Loss Function
Binary classification	Sigmoid	BCEWithLogitsLoss
Multi-class classification	Softmax (or none)	CrossEntropyLoss
Regression	None (linear)	MSELoss / HuberLoss
Multi-label classification	Sigmoid	BCEWithLogitsLoss
Output in [0, 1]	Sigmoid	MSELoss

Note: nn.CrossEntropyLoss in PyTorch applies softmax internally — don’t add softmax before it.

The Dying ReLU Problem

When a neuron’s input is always negative, ReLU always outputs 0, gradient is always 0, weights never update — the neuron is permanently dead.

Causes:

Large negative biases after initialization
Very high learning rates

Fixes:

# Option 1: Leaky ReLU
nn.LeakyReLU(0.01)

# Option 2: ELU — smooth negative region
nn.ELU(alpha=1.0)

# Option 3: Better initialization
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.zeros_(layer.bias)

# Option 4: Lower learning rate + gradient clipping

Activation Selection Guide

Hidden layers in CNN/MLP:     → ReLU (fast default)
                                  Leaky ReLU (if dying ReLU is a problem)
Hidden layers in Transformer: → GELU
Vision models (EfficientNet): → SiLU/Swish
LSTM/GRU internals:           → Tanh + Sigmoid (built-in, don't change)
Output (binary):              → None (use BCEWithLogitsLoss)
Output (multi-class):         → None (use CrossEntropyLoss)

GELU and SiLU have become increasingly popular as they consistently outperform plain ReLU in large-scale pretraining. For new architectures, start with ReLU for simplicity and switch to GELU/SiLU if you see performance benefits.