Supplement · Weight Initialization

Why Initialization Matters

12 min read

By the end of this reading you will be able to:

Explain the symmetry problem and why initializing all weights to the same value causes every neuron in a layer to learn the same function
Describe how activation variance propagates through a deep network and what happens when it explodes or collapses
Identify the three failure modes of bad initialization — signal collapse, signal explosion, and symmetry — and link each to a class of initializer
Distinguish the four main categories of initializer (constant, random, variance-scaling, structured) and state the primary use case for each

The Weight Initialization Problem

Before training begins, every weight in a neural network must be assigned a starting value. That choice determines:

Whether the forward pass produces meaningful activations or collapses to zero
Whether gradients flow cleanly backward or explode and vanish
How quickly — and sometimes whether — the network converges at all

Initialization is not a minor implementation detail. It is the first decision that shapes the entire trajectory of training.

Failure Mode 1 — The Symmetry Problem

Suppose you initialize all weights in a layer to the same value — say, zero.

Every neuron in the layer receives identical inputs and produces identical outputs. During backpropagation, every neuron receives identical gradients. On the next update, every weight changes by the same amount. The neurons remain identical — forever.

This is the symmetry problem: neurons that start identical stay identical. A layer with 512 neurons behaves like a layer with 1. All that capacity is wasted.

$h_j^{(l)} = \sigma\!\left(\sum_i w_{ij}^{(l)} h_i^{(l-1)} + b_j^{(l)}\right)$

If $w_{ij}^{(l)} = c$ for all $i, j$ , then $h_j^{(l)}$ is identical for all $j$ . The gradients $\partial \mathcal{L} / \partial w_{ij}^{(l)}$ are also identical, so weights update in lockstep. Symmetry is preserved throughout training.

The fix: break symmetry by initializing weights from a random distribution so that different neurons start from different points.

Failure Mode 2 — Signal Collapse

Random initialization fixes symmetry, but the scale of the random values matters enormously.

Consider a deep network with $L$ layers, each applying $z^{(l)} = W^{(l)} h^{(l-1)}$ (no activation for simplicity). If weights are drawn from $\mathcal{N}(0, \sigma^2)$ , the variance of the output at layer $l$ is:

$\text{Var}(z_j^{(l)}) = n_{l-1} \cdot \sigma^2 \cdot \text{Var}(h^{(l-1)})$

where $n_{l-1}$ is the number of inputs (fan-in). For a 50-layer network with $n = 500$ and $\sigma = 0.01$ :

$\text{Var}(z^{(50)}) = (500 \cdot 0.0001)^{50} = 0.05^{50} \approx 10^{-85}$

Activations become numerically zero. No gradient can flow. Training is dead on arrival.

Failure Mode 3 — Signal Explosion

The opposite happens if $\sigma$ is too large. With $\sigma = 1$ and $n = 500$ :

$\text{Var}(z^{(50)}) = (500 \cdot 1)^{50} = 500^{50} \approx 10^{134}$

Activations overflow to NaN within the first forward pass.

The Goldilocks Condition

For variance to be stable across layers, we need:

$n_{l-1} \cdot \sigma^2 \approx 1 \quad \Longrightarrow \quad \sigma^2 \approx \frac{1}{n_{l-1}}$

This is the core insight behind every variance-scaling initializer. Xavier, He, and LeCun all derive from this condition — they differ in what activation function they account for and whether they consider fan-in, fan-out, or both.

The Same Logic Applies to Gradients

The backward pass has the same structure. The gradient of the loss w.r.t. layer $l-1$ is:

$\frac{\partial \mathcal{L}}{\partial h^{(l-1)}} = W^{(l)\top} \frac{\partial \mathcal{L}}{\partial z^{(l)}}$

For gradient variance to be stable, we need:

$n_l \cdot \sigma^2 \approx 1 \quad \Longrightarrow \quad \sigma^2 \approx \frac{1}{n_l}$

where $n_l$ is the fan-out (number of outputs). Forward and backward stability impose different constraints on $\sigma^2$ — Xavier resolves the tension by using their harmonic mean.

The Four Categories of Initializer

Category	Examples	Primary Use
Constant	zeros, ones, constant, eye, dirac	Biases, identity residuals, specialized conv layers
Random	uniform, normal, truncated normal, sparse	Generic starting points when scale is set externally
Variance-scaling	Xavier/Glorot, He/Kaiming, LeCun	Default for weights in Linear and Conv layers
Structured	Orthogonal	RNN hidden-to-hidden weights; deep linear networks

The readings that follow cover each category in detail — formulas, code, and when to reach for each one.

References

Glorot & Bengio (2010) — Understanding the Difficulty of Training Deep Feedforward Neural Networks — First systematic study of initialization and variance propagation; introduced Xavier initialization

He et al. (2015) — Delving Deep into Rectifiers — Extended the variance analysis to ReLU networks; introduced He/Kaiming initialization

Overview Next →

Why Initialization Matters

The Weight Initialization Problem

Failure Mode 1 — The Symmetry Problem

Failure Mode 2 — Signal Collapse

Failure Mode 3 — Signal Explosion

The Goldilocks Condition

The Same Logic Applies to Gradients

The Four Categories of Initializer

Privacy Policy

What we collect

What we don't collect

Your choices

Contact