Supplement · Weight Initialization

Xavier / Glorot Initialization

15 min read

By the end of this reading you will be able to:

Derive the Xavier variance formula from the requirement that forward-pass and backward-pass variance are both preserved, arriving at Var(W) = 2 / (fan_in + fan_out)
Distinguish xavier_uniform_ from xavier_normal_ and state the variance of each in terms of fan_in and fan_out
Apply xavier_uniform_ and xavier_normal_ in PyTorch and GlorotUniform / GlorotNormal in TensorFlow, including setting the gain parameter
Explain why Xavier initialization assumes near-linear activations and therefore breaks down for ReLU networks

The Two Constraints

Xavier initialization (Glorot & Bengio, 2010) is the first principled answer to the question: what variance should weights have? It derives the answer from two simultaneous requirements.

Constraint 1 — forward pass: For the variance of activations to remain constant across layers:

$\text{Var}(h^{(l)}) = \text{Var}(h^{(l-1)}) \quad \Longrightarrow \quad n_{l-1} \cdot \text{Var}(w) = 1$

where $n_{l-1}$ is the fan-in (number of inputs to the layer).

Constraint 2 — backward pass: For gradient variance to remain constant:

$\text{Var}\!\left(\frac{\partial \mathcal{L}}{\partial h^{(l-1)}}\right) = \text{Var}\!\left(\frac{\partial \mathcal{L}}{\partial h^{(l)}}\right) \quad \Longrightarrow \quad n_l \cdot \text{Var}(w) = 1$

where $n_l$ is the fan-out (number of outputs).

These two constraints are incompatible unless fan-in equals fan-out. Xavier resolves the tension by taking their harmonic mean:

$\text{Var}(w) = \frac{2}{n_{l-1} + n_l} = \frac{2}{\text{fan\_in} + \text{fan\_out}}$

The Linear Approximation Assumption

The derivation above treats the activation function as if it were linear around $z = 0$ . This is approximately true for sigmoid and tanh in their active region (both have derivative $\approx 1$ at $z = 0$ ), but it is badly violated by ReLU, which zeros out exactly half its inputs on average. The next reading (He/Kaiming) corrects for this.

Xavier Uniform

torch.nn.init.xavier_uniform_(tensor, gain=1.0) / tf.keras.initializers.GlorotUniform()

Draws from a uniform distribution symmetric around zero, scaled so that $\text{Var}(w) = 2 / (\text{fan\_in} + \text{fan\_out})$ :

$w \sim \mathcal{U}\!\left[-a,\, a\right], \quad a = \text{gain} \cdot \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}$

The factor of $\sqrt{6}$ comes from the relationship between the variance of $\mathcal{U}[-a, a]$ and $a$ : $\text{Var}(\mathcal{U}[-a,a]) = a^2 / 3$ , so solving $a^2 / 3 = 2 / (\text{fan\_in} + \text{fan\_out})$ gives $a = \sqrt{6 / (\text{fan\_in} + \text{fan\_out})}$ .

PyTorch:

import torch
import torch.nn as nn

w = torch.empty(256, 128)  # fan_out=256, fan_in=128

# Uniform: draws from U[-a, a], a = sqrt(6 / (128+256)) = sqrt(6/384) ≈ 0.125
nn.init.xavier_uniform_(w, gain=1.0)

# With gain for tanh (recommended: 5/3)
nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('tanh'))

# Apply to a full model
def init_xavier(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(128, 256), nn.Tanh(),
    nn.Linear(256, 128), nn.Tanh(),
    nn.Linear(128, 10)
)
model.apply(init_xavier)

TensorFlow:

import tensorflow as tf

# GlorotUniform is the DEFAULT kernel initializer for Dense and Conv layers
dense = tf.keras.layers.Dense(256)   # already uses GlorotUniform

# Explicit
dense = tf.keras.layers.Dense(256,
    kernel_initializer=tf.keras.initializers.GlorotUniform(),
    bias_initializer='zeros')

# Standalone
glorot_u = tf.keras.initializers.GlorotUniform()
w = glorot_u(shape=(128, 256))

Xavier Normal

torch.nn.init.xavier_normal_(tensor, gain=1.0) / tf.keras.initializers.GlorotNormal()

Same variance target, drawn from a truncated normal distribution:

$w \sim \mathcal{N}\!\left(0,\, \sigma^2\right), \quad \sigma = \text{gain} \cdot \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}}$

(Keras uses truncated normal internally; PyTorch uses a standard normal.)

PyTorch:

w = torch.empty(256, 128)
nn.init.xavier_normal_(w, gain=1.0)
# std ≈ sqrt(2 / 384) ≈ 0.0723

TensorFlow:

glorot_n = tf.keras.initializers.GlorotNormal()
w = glorot_n(shape=(128, 256))

dense = tf.keras.layers.Dense(256,
    kernel_initializer=tf.keras.initializers.GlorotNormal())

The gain Parameter

PyTorch exposes a gain multiplier that scales the computed limits or standard deviation. It accounts for the fact that some activation functions contract or expand variance by a known amount:

# PyTorch gain values by activation
print(nn.init.calculate_gain('linear'))      # 1.0
print(nn.init.calculate_gain('sigmoid'))     # 1.0
print(nn.init.calculate_gain('tanh'))        # 1.6667  (5/3)
print(nn.init.calculate_gain('relu'))        # 1.4142  (sqrt(2))
print(nn.init.calculate_gain('leaky_relu'))  # ≈ 1.4141 (varies with slope)

For most use cases with sigmoid or tanh, gain=1.0 is appropriate. For ReLU, you should use He initialization instead of Xavier with gain.

Uniform vs Normal: Which to Choose?

In practice, the difference is small. Xavier uniform is slightly more conservative (bounded range), while Xavier normal can produce occasional larger values. The key factors:

Xavier uniform (GlorotUniform): default for most frameworks, slightly lower variance in practice
Xavier normal (GlorotNormal): preferred in some Transformer architectures

When to Use Xavier

Scenario	Appropriate?
Linear layers with sigmoid or tanh	Yes — this is the primary use case
Linear layers with no activation	Yes — linear approximation holds exactly
Convolutional layers with sigmoid/tanh	Yes
Linear/conv layers with ReLU	No — use He/Kaiming instead
RNN weight matrices	Partially — orthogonal is often better for hidden-to-hidden
Embedding layers	No — normal or truncated normal with small std

Quick Reference

Variant	Formula	PyTorch	TF/Keras
Xavier uniform	$\mathcal{U}[-a, a],\; a = \text{gain}\sqrt{6/(f_{in}+f_{out})}$	`xavier_uniform_`	`GlorotUniform`
Xavier normal	$\mathcal{N}(0, \sigma^2),\; \sigma = \text{gain}\sqrt{2/(f_{in}+f_{out})}$	`xavier_normal_`	`GlorotNormal`

References

Glorot & Bengio (2010) — Understanding the Difficulty of Training Deep Feedforward Neural Networks — Introduced Xavier initialization with the fan-in/fan-out variance derivation

Previous Take Quiz →

Xavier / Glorot Initialization

The Two Constraints

The Linear Approximation Assumption

Xavier Uniform

Xavier Normal

The gain Parameter

Uniform vs Normal: Which to Choose?

When to Use Xavier

Quick Reference

Privacy Policy

What we collect

What we don't collect

Your choices

Contact