Supplement · Weight Initialization

Why Initialization Matters

12 min read
By the end of this reading you will be able to:
  • Explain the symmetry problem and why initializing all weights to the same value causes every neuron in a layer to learn the same function
  • Describe how activation variance propagates through a deep network and what happens when it explodes or collapses
  • Identify the three failure modes of bad initialization — signal collapse, signal explosion, and symmetry — and link each to a class of initializer
  • Distinguish the four main categories of initializer (constant, random, variance-scaling, structured) and state the primary use case for each

The Weight Initialization Problem

Before training begins, every weight in a neural network must be assigned a starting value. That choice determines:

  • Whether the forward pass produces meaningful activations or collapses to zero
  • Whether gradients flow cleanly backward or explode and vanish
  • How quickly — and sometimes whether — the network converges at all

Initialization is not a minor implementation detail. It is the first decision that shapes the entire trajectory of training.

Failure Mode 1 — The Symmetry Problem

Suppose you initialize all weights in a layer to the same value — say, zero.

Every neuron in the layer receives identical inputs and produces identical outputs. During backpropagation, every neuron receives identical gradients. On the next update, every weight changes by the same amount. The neurons remain identical — forever.

This is the symmetry problem: neurons that start identical stay identical. A layer with 512 neurons behaves like a layer with 1. All that capacity is wasted.

hj(l)=σ ⁣(iwij(l)hi(l1)+bj(l))h_j^{(l)} = \sigma\!\left(\sum_i w_{ij}^{(l)} h_i^{(l-1)} + b_j^{(l)}\right)

If wij(l)=cw_{ij}^{(l)} = c for all i,ji, j, then hj(l)h_j^{(l)} is identical for all jj. The gradients L/wij(l)\partial \mathcal{L} / \partial w_{ij}^{(l)} are also identical, so weights update in lockstep. Symmetry is preserved throughout training.

The fix: break symmetry by initializing weights from a random distribution so that different neurons start from different points.

Failure Mode 2 — Signal Collapse

Random initialization fixes symmetry, but the scale of the random values matters enormously.

Consider a deep network with LL layers, each applying z(l)=W(l)h(l1)z^{(l)} = W^{(l)} h^{(l-1)} (no activation for simplicity). If weights are drawn from N(0,σ2)\mathcal{N}(0, \sigma^2), the variance of the output at layer ll is:

Var(zj(l))=nl1σ2Var(h(l1))\text{Var}(z_j^{(l)}) = n_{l-1} \cdot \sigma^2 \cdot \text{Var}(h^{(l-1)})

where nl1n_{l-1} is the number of inputs (fan-in). For a 50-layer network with n=500n = 500 and σ=0.01\sigma = 0.01:

Var(z(50))=(5000.0001)50=0.05501085\text{Var}(z^{(50)}) = (500 \cdot 0.0001)^{50} = 0.05^{50} \approx 10^{-85}

Activations become numerically zero. No gradient can flow. Training is dead on arrival.

Failure Mode 3 — Signal Explosion

The opposite happens if σ\sigma is too large. With σ=1\sigma = 1 and n=500n = 500:

Var(z(50))=(5001)50=5005010134\text{Var}(z^{(50)}) = (500 \cdot 1)^{50} = 500^{50} \approx 10^{134}

Activations overflow to NaN within the first forward pass.

The Goldilocks Condition

For variance to be stable across layers, we need:

nl1σ21σ21nl1n_{l-1} \cdot \sigma^2 \approx 1 \quad \Longrightarrow \quad \sigma^2 \approx \frac{1}{n_{l-1}}

This is the core insight behind every variance-scaling initializer. Xavier, He, and LeCun all derive from this condition — they differ in what activation function they account for and whether they consider fan-in, fan-out, or both.

The Same Logic Applies to Gradients

The backward pass has the same structure. The gradient of the loss w.r.t. layer l1l-1 is:

Lh(l1)=W(l)Lz(l)\frac{\partial \mathcal{L}}{\partial h^{(l-1)}} = W^{(l)\top} \frac{\partial \mathcal{L}}{\partial z^{(l)}}

For gradient variance to be stable, we need:

nlσ21σ21nln_l \cdot \sigma^2 \approx 1 \quad \Longrightarrow \quad \sigma^2 \approx \frac{1}{n_l}

where nln_l is the fan-out (number of outputs). Forward and backward stability impose different constraints on σ2\sigma^2 — Xavier resolves the tension by using their harmonic mean.

The Four Categories of Initializer

Category Examples Primary Use
Constant zeros, ones, constant, eye, dirac Biases, identity residuals, specialized conv layers
Random uniform, normal, truncated normal, sparse Generic starting points when scale is set externally
Variance-scaling Xavier/Glorot, He/Kaiming, LeCun Default for weights in Linear and Conv layers
Structured Orthogonal RNN hidden-to-hidden weights; deep linear networks

The readings that follow cover each category in detail — formulas, code, and when to reach for each one.

References
Glorot & Bengio (2010) — Understanding the Difficulty of Training Deep Feedforward Neural Networks — First systematic study of initialization and variance propagation; introduced Xavier initialization
He et al. (2015) — Delving Deep into Rectifiers — Extended the variance analysis to ReLU networks; introduced He/Kaiming initialization