Weight Initialization: Why Xavier and He Init Prevent Training Failure

How weight initialization affects training stability, and why Xavier and He initialization exist to prevent vanishing and exploding gradients.

Weight Initialization: Why Xavier and He Init Prevent Training Failure

Before a neural network ever sees a single training example, its weights need to start somewhere — and the choice of starting values has an outsized, well-documented effect on whether training succeeds at all. Initialize weights too large, and activations explode through the layers. Too small, and signals vanish before reaching the output. Weight initialization isn’t a minor implementation detail; it’s a genuinely important, principled decision with specific, well-established solutions.


Why Not Just Initialize Everything to Zero

Initializing all weights to zero seems harmless, but it causes a specific, fatal problem: every neuron in a layer would compute the exact same output, receive the exact same gradient, and update identically forever — the network would never break this symmetry, making all neurons in a layer functionally redundant regardless of how much training happens.

import numpy as np
# This never works -- every neuron stays identical throughout training
weights = np.zeros((100, 50))

Weights need to start with some randomness so different neurons can learn different, useful features — the question is what scale that randomness should be.


Why the Scale of Random Initialization Matters

Initializing weights with too large a variance causes activations to grow layer after layer, eventually saturating activation functions (pushing sigmoid/tanh outputs to their nearly-flat extremes) or overflowing numerically, covered in Numerical Computation. Too small a variance causes activations to shrink toward zero as they pass through successive layers, effectively losing signal by the time it reaches later layers.

# Too large -- risk of exploding activations/gradients
weights_too_large = np.random.randn(784, 256) * 1.0
# Too small -- risk of vanishing activations/gradients
weights_too_small = np.random.randn(784, 256) * 0.001

Both extremes directly connect to the Vanishing Gradient Problem and Exploding Gradient Problem — poor initialization can trigger either failure mode before training even has a chance to make progress.


Xavier (Glorot) Initialization

Xavier initialization, designed specifically for sigmoid and tanh activations, sets the variance of the initial weights based on the number of input and output connections for that layer, keeping the variance of activations roughly consistent as signals pass through each layer.

def xavier_init(n_inputs, n_outputs):
limit = np.sqrt(6 / (n_inputs + n_outputs))
return np.random.uniform(-limit, limit, size=(n_inputs, n_outputs))
W1 = xavier_init(784, 256)
# In PyTorch, this is available directly
import torch.nn as nn
layer = nn.Linear(784, 256)
nn.init.xavier_uniform_(layer.weight)

The mathematical derivation behind Xavier initialization assumes a roughly linear activation region around zero (true for sigmoid/tanh near their center), which is exactly why it was designed for those specific activation functions.


He Initialization: Designed Specifically for ReLU

He initialization, published a few years after Xavier, adjusts the variance formula specifically to account for ReLU’s behavior — since ReLU zeros out roughly half of its inputs (all negative values), the remaining variance needs to be scaled up to compensate.

def he_init(n_inputs, n_outputs):
std = np.sqrt(2 / n_inputs)
return np.random.randn(n_inputs, n_outputs) * std
W1 = he_init(784, 256)
import torch.nn as nn
layer = nn.Linear(784, 256)
nn.init.kaiming_normal_(layer.weight, nonlinearity='relu') # "Kaiming" is He's given name

Since ReLU is the default activation for the vast majority of modern hidden layers, covered in Activation Functions, He initialization is the correct default choice for most networks built today, and most deep learning frameworks now apply it automatically when you create a standard linear or convolutional layer.


Choosing the Right Initialization Scheme

Activation functionRecommended initialization
Sigmoid, TanhXavier/Glorot
ReLU, Leaky ReLUHe/Kaiming
GELU, Swish (transformer-style)Typically He or a scheme specific to the architecture (e.g., scaled init for transformers)

Mismatching initialization to activation function is a subtle mistake — using Xavier initialization with ReLU layers, for instance, tends to slightly underestimate the needed variance, which can measurably slow convergence, especially in deeper networks where the compounding effect across many layers becomes significant.


Why This Matters More as Networks Get Deeper

A poorly initialized shallow network (2–3 layers) often trains fine anyway, since there’s little opportunity for activations to compound toward an extreme across so few layers. A poorly initialized deep network (dozens or hundreds of layers) can fail to train at all — the compounding effect of even a slightly-wrong variance, multiplied across many layers, becomes severe. This is precisely why weight initialization research became critical specifically as networks grew deeper throughout the 2010s, and it remains one of the first things to check when a very deep network fails to train, alongside batch normalization (covered in Batch Normalization), which provides an additional, complementary layer of protection against exactly this same class of problem.

Initializing Biases: A Simpler, Separate Decision

While weight initialization requires the careful, variance-preserving schemes covered above, bias terms are typically initialized much more simply — usually to zero, since biases don’t face the same symmetry-breaking requirement that weights do (different neurons already receive different random weights, which is sufficient to break symmetry on its own).

import torch.nn as nn
layer = nn.Linear(784, 256)
nn.init.zeros_(layer.bias) # standard, simple default for bias initialization

One notable exception: in certain architectures, biases for “forget”-type gates (as in LSTM, covered in LSTM and GRU) are sometimes deliberately initialized to a positive value rather than zero, specifically to encourage the network to retain information by default early in training, before it has learned when forgetting is actually appropriate — a small, targeted deviation from the “just use zero” default, motivated by that specific architecture’s behavior.

Summary

SchemeDesigned ForKey Idea
Zero initNever appropriateBreaks symmetry needed for learning
Xavier/GlorotSigmoid, TanhBalances variance assuming near-linear activation region
He/KaimingReLU and variantsCompensates for ReLU zeroing out half its inputs

Weight initialization is one of the cheapest, highest-leverage decisions in training a deep network — choosing correctly costs nothing extra computationally, and choosing incorrectly can be the difference between a network that trains smoothly and one that fails before it ever gets started.