Supplement · Weight Initialization

Xavier / Glorot Initialization

15 min read
By the end of this reading you will be able to:
  • Derive the Xavier variance formula from the requirement that forward-pass and backward-pass variance are both preserved, arriving at Var(W) = 2 / (fan_in + fan_out)
  • Distinguish xavier_uniform_ from xavier_normal_ and state the variance of each in terms of fan_in and fan_out
  • Apply xavier_uniform_ and xavier_normal_ in PyTorch and GlorotUniform / GlorotNormal in TensorFlow, including setting the gain parameter
  • Explain why Xavier initialization assumes near-linear activations and therefore breaks down for ReLU networks

The Two Constraints

Xavier initialization (Glorot & Bengio, 2010) is the first principled answer to the question: what variance should weights have? It derives the answer from two simultaneous requirements.

Constraint 1 — forward pass: For the variance of activations to remain constant across layers:

Var(h(l))=Var(h(l1))nl1Var(w)=1\text{Var}(h^{(l)}) = \text{Var}(h^{(l-1)}) \quad \Longrightarrow \quad n_{l-1} \cdot \text{Var}(w) = 1

where nl1n_{l-1} is the fan-in (number of inputs to the layer).

Constraint 2 — backward pass: For gradient variance to remain constant:

Var ⁣(Lh(l1))=Var ⁣(Lh(l))nlVar(w)=1\text{Var}\!\left(\frac{\partial \mathcal{L}}{\partial h^{(l-1)}}\right) = \text{Var}\!\left(\frac{\partial \mathcal{L}}{\partial h^{(l)}}\right) \quad \Longrightarrow \quad n_l \cdot \text{Var}(w) = 1

where nln_l is the fan-out (number of outputs).

These two constraints are incompatible unless fan-in equals fan-out. Xavier resolves the tension by taking their harmonic mean:

Var(w)=2nl1+nl=2fan_in+fan_out\text{Var}(w) = \frac{2}{n_{l-1} + n_l} = \frac{2}{\text{fan\_in} + \text{fan\_out}}

The Linear Approximation Assumption

The derivation above treats the activation function as if it were linear around z=0z = 0. This is approximately true for sigmoid and tanh in their active region (both have derivative 1\approx 1 at z=0z = 0), but it is badly violated by ReLU, which zeros out exactly half its inputs on average. The next reading (He/Kaiming) corrects for this.

Xavier Uniform

torch.nn.init.xavier_uniform_(tensor, gain=1.0) / tf.keras.initializers.GlorotUniform()

Draws from a uniform distribution symmetric around zero, scaled so that Var(w)=2/(fan_in+fan_out)\text{Var}(w) = 2 / (\text{fan\_in} + \text{fan\_out}):

wU ⁣[a,a],a=gain6fan_in+fan_outw \sim \mathcal{U}\!\left[-a,\, a\right], \quad a = \text{gain} \cdot \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}

The factor of 6\sqrt{6} comes from the relationship between the variance of U[a,a]\mathcal{U}[-a, a] and aa: Var(U[a,a])=a2/3\text{Var}(\mathcal{U}[-a,a]) = a^2 / 3, so solving a2/3=2/(fan_in+fan_out)a^2 / 3 = 2 / (\text{fan\_in} + \text{fan\_out}) gives a=6/(fan_in+fan_out)a = \sqrt{6 / (\text{fan\_in} + \text{fan\_out})}.

PyTorch:

import torch
import torch.nn as nn

w = torch.empty(256, 128)  # fan_out=256, fan_in=128

# Uniform: draws from U[-a, a], a = sqrt(6 / (128+256)) = sqrt(6/384) ≈ 0.125
nn.init.xavier_uniform_(w, gain=1.0)

# With gain for tanh (recommended: 5/3)
nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('tanh'))

# Apply to a full model
def init_xavier(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(128, 256), nn.Tanh(),
    nn.Linear(256, 128), nn.Tanh(),
    nn.Linear(128, 10)
)
model.apply(init_xavier)

TensorFlow:

import tensorflow as tf

# GlorotUniform is the DEFAULT kernel initializer for Dense and Conv layers
dense = tf.keras.layers.Dense(256)   # already uses GlorotUniform

# Explicit
dense = tf.keras.layers.Dense(256,
    kernel_initializer=tf.keras.initializers.GlorotUniform(),
    bias_initializer='zeros')

# Standalone
glorot_u = tf.keras.initializers.GlorotUniform()
w = glorot_u(shape=(128, 256))

Xavier Normal

torch.nn.init.xavier_normal_(tensor, gain=1.0) / tf.keras.initializers.GlorotNormal()

Same variance target, drawn from a truncated normal distribution:

wN ⁣(0,σ2),σ=gain2fan_in+fan_outw \sim \mathcal{N}\!\left(0,\, \sigma^2\right), \quad \sigma = \text{gain} \cdot \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}}

(Keras uses truncated normal internally; PyTorch uses a standard normal.)

PyTorch:

w = torch.empty(256, 128)
nn.init.xavier_normal_(w, gain=1.0)
# std ≈ sqrt(2 / 384) ≈ 0.0723

TensorFlow:

glorot_n = tf.keras.initializers.GlorotNormal()
w = glorot_n(shape=(128, 256))

dense = tf.keras.layers.Dense(256,
    kernel_initializer=tf.keras.initializers.GlorotNormal())

The gain Parameter

PyTorch exposes a gain multiplier that scales the computed limits or standard deviation. It accounts for the fact that some activation functions contract or expand variance by a known amount:

# PyTorch gain values by activation
print(nn.init.calculate_gain('linear'))      # 1.0
print(nn.init.calculate_gain('sigmoid'))     # 1.0
print(nn.init.calculate_gain('tanh'))        # 1.6667  (5/3)
print(nn.init.calculate_gain('relu'))        # 1.4142  (sqrt(2))
print(nn.init.calculate_gain('leaky_relu'))  # ≈ 1.4141 (varies with slope)

For most use cases with sigmoid or tanh, gain=1.0 is appropriate. For ReLU, you should use He initialization instead of Xavier with gain.

Uniform vs Normal: Which to Choose?

In practice, the difference is small. Xavier uniform is slightly more conservative (bounded range), while Xavier normal can produce occasional larger values. The key factors:

  • Xavier uniform (GlorotUniform): default for most frameworks, slightly lower variance in practice
  • Xavier normal (GlorotNormal): preferred in some Transformer architectures

When to Use Xavier

Scenario Appropriate?
Linear layers with sigmoid or tanh Yes — this is the primary use case
Linear layers with no activation Yes — linear approximation holds exactly
Convolutional layers with sigmoid/tanh Yes
Linear/conv layers with ReLU No — use He/Kaiming instead
RNN weight matrices Partially — orthogonal is often better for hidden-to-hidden
Embedding layers No — normal or truncated normal with small std

Quick Reference

Variant Formula PyTorch TF/Keras
Xavier uniform U[a,a],  a=gain6/(fin+fout)\mathcal{U}[-a, a],\; a = \text{gain}\sqrt{6/(f_{in}+f_{out})} xavier_uniform_ GlorotUniform
Xavier normal N(0,σ2),  σ=gain2/(fin+fout)\mathcal{N}(0, \sigma^2),\; \sigma = \text{gain}\sqrt{2/(f_{in}+f_{out})} xavier_normal_ GlorotNormal
References
Glorot & Bengio (2010) — Understanding the Difficulty of Training Deep Feedforward Neural Networks — Introduced Xavier initialization with the fan-in/fan-out variance derivation