Why Initialization Matters
- Explain the symmetry problem and why initializing all weights to the same value causes every neuron in a layer to learn the same function
- Describe how activation variance propagates through a deep network and what happens when it explodes or collapses
- Identify the three failure modes of bad initialization — signal collapse, signal explosion, and symmetry — and link each to a class of initializer
- Distinguish the four main categories of initializer (constant, random, variance-scaling, structured) and state the primary use case for each
The Weight Initialization Problem
Before training begins, every weight in a neural network must be assigned a starting value. That choice determines:
- Whether the forward pass produces meaningful activations or collapses to zero
- Whether gradients flow cleanly backward or explode and vanish
- How quickly — and sometimes whether — the network converges at all
Initialization is not a minor implementation detail. It is the first decision that shapes the entire trajectory of training.
Failure Mode 1 — The Symmetry Problem
Suppose you initialize all weights in a layer to the same value — say, zero.
Every neuron in the layer receives identical inputs and produces identical outputs. During backpropagation, every neuron receives identical gradients. On the next update, every weight changes by the same amount. The neurons remain identical — forever.
This is the symmetry problem: neurons that start identical stay identical. A layer with 512 neurons behaves like a layer with 1. All that capacity is wasted.
If for all , then is identical for all . The gradients are also identical, so weights update in lockstep. Symmetry is preserved throughout training.
The fix: break symmetry by initializing weights from a random distribution so that different neurons start from different points.
Failure Mode 2 — Signal Collapse
Random initialization fixes symmetry, but the scale of the random values matters enormously.
Consider a deep network with layers, each applying (no activation for simplicity). If weights are drawn from , the variance of the output at layer is:
where is the number of inputs (fan-in). For a 50-layer network with and :
Activations become numerically zero. No gradient can flow. Training is dead on arrival.
Failure Mode 3 — Signal Explosion
The opposite happens if is too large. With and :
Activations overflow to NaN within the first forward pass.
The Goldilocks Condition
For variance to be stable across layers, we need:
This is the core insight behind every variance-scaling initializer. Xavier, He, and LeCun all derive from this condition — they differ in what activation function they account for and whether they consider fan-in, fan-out, or both.
The Same Logic Applies to Gradients
The backward pass has the same structure. The gradient of the loss w.r.t. layer is:
For gradient variance to be stable, we need:
where is the fan-out (number of outputs). Forward and backward stability impose different constraints on — Xavier resolves the tension by using their harmonic mean.
The Four Categories of Initializer
| Category | Examples | Primary Use |
|---|---|---|
| Constant | zeros, ones, constant, eye, dirac | Biases, identity residuals, specialized conv layers |
| Random | uniform, normal, truncated normal, sparse | Generic starting points when scale is set externally |
| Variance-scaling | Xavier/Glorot, He/Kaiming, LeCun | Default for weights in Linear and Conv layers |
| Structured | Orthogonal | RNN hidden-to-hidden weights; deep linear networks |
The readings that follow cover each category in detail — formulas, code, and when to reach for each one.