Xavier / Glorot Initialization
- Derive the Xavier variance formula from the requirement that forward-pass and backward-pass variance are both preserved, arriving at Var(W) = 2 / (fan_in + fan_out)
- Distinguish xavier_uniform_ from xavier_normal_ and state the variance of each in terms of fan_in and fan_out
- Apply xavier_uniform_ and xavier_normal_ in PyTorch and GlorotUniform / GlorotNormal in TensorFlow, including setting the gain parameter
- Explain why Xavier initialization assumes near-linear activations and therefore breaks down for ReLU networks
The Two Constraints
Xavier initialization (Glorot & Bengio, 2010) is the first principled answer to the question: what variance should weights have? It derives the answer from two simultaneous requirements.
Constraint 1 — forward pass: For the variance of activations to remain constant across layers:
where is the fan-in (number of inputs to the layer).
Constraint 2 — backward pass: For gradient variance to remain constant:
where is the fan-out (number of outputs).
These two constraints are incompatible unless fan-in equals fan-out. Xavier resolves the tension by taking their harmonic mean:
The Linear Approximation Assumption
The derivation above treats the activation function as if it were linear around . This is approximately true for sigmoid and tanh in their active region (both have derivative at ), but it is badly violated by ReLU, which zeros out exactly half its inputs on average. The next reading (He/Kaiming) corrects for this.
Xavier Uniform
torch.nn.init.xavier_uniform_(tensor, gain=1.0) / tf.keras.initializers.GlorotUniform()
Draws from a uniform distribution symmetric around zero, scaled so that :
The factor of comes from the relationship between the variance of and : , so solving gives .
PyTorch:
import torch
import torch.nn as nn
w = torch.empty(256, 128) # fan_out=256, fan_in=128
# Uniform: draws from U[-a, a], a = sqrt(6 / (128+256)) = sqrt(6/384) ≈ 0.125
nn.init.xavier_uniform_(w, gain=1.0)
# With gain for tanh (recommended: 5/3)
nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('tanh'))
# Apply to a full model
def init_xavier(m):
if isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
nn.init.zeros_(m.bias)
model = nn.Sequential(
nn.Linear(128, 256), nn.Tanh(),
nn.Linear(256, 128), nn.Tanh(),
nn.Linear(128, 10)
)
model.apply(init_xavier)
TensorFlow:
import tensorflow as tf
# GlorotUniform is the DEFAULT kernel initializer for Dense and Conv layers
dense = tf.keras.layers.Dense(256) # already uses GlorotUniform
# Explicit
dense = tf.keras.layers.Dense(256,
kernel_initializer=tf.keras.initializers.GlorotUniform(),
bias_initializer='zeros')
# Standalone
glorot_u = tf.keras.initializers.GlorotUniform()
w = glorot_u(shape=(128, 256))
Xavier Normal
torch.nn.init.xavier_normal_(tensor, gain=1.0) / tf.keras.initializers.GlorotNormal()
Same variance target, drawn from a truncated normal distribution:
(Keras uses truncated normal internally; PyTorch uses a standard normal.)
PyTorch:
w = torch.empty(256, 128)
nn.init.xavier_normal_(w, gain=1.0)
# std ≈ sqrt(2 / 384) ≈ 0.0723
TensorFlow:
glorot_n = tf.keras.initializers.GlorotNormal()
w = glorot_n(shape=(128, 256))
dense = tf.keras.layers.Dense(256,
kernel_initializer=tf.keras.initializers.GlorotNormal())
The gain Parameter
PyTorch exposes a gain multiplier that scales the computed limits or standard deviation. It accounts for the fact that some activation functions contract or expand variance by a known amount:
# PyTorch gain values by activation
print(nn.init.calculate_gain('linear')) # 1.0
print(nn.init.calculate_gain('sigmoid')) # 1.0
print(nn.init.calculate_gain('tanh')) # 1.6667 (5/3)
print(nn.init.calculate_gain('relu')) # 1.4142 (sqrt(2))
print(nn.init.calculate_gain('leaky_relu')) # ≈ 1.4141 (varies with slope)
For most use cases with sigmoid or tanh, gain=1.0 is appropriate. For ReLU, you should use He initialization instead of Xavier with gain.
Uniform vs Normal: Which to Choose?
In practice, the difference is small. Xavier uniform is slightly more conservative (bounded range), while Xavier normal can produce occasional larger values. The key factors:
- Xavier uniform (GlorotUniform): default for most frameworks, slightly lower variance in practice
- Xavier normal (GlorotNormal): preferred in some Transformer architectures
When to Use Xavier
| Scenario | Appropriate? |
|---|---|
| Linear layers with sigmoid or tanh | Yes — this is the primary use case |
| Linear layers with no activation | Yes — linear approximation holds exactly |
| Convolutional layers with sigmoid/tanh | Yes |
| Linear/conv layers with ReLU | No — use He/Kaiming instead |
| RNN weight matrices | Partially — orthogonal is often better for hidden-to-hidden |
| Embedding layers | No — normal or truncated normal with small std |
Quick Reference
| Variant | Formula | PyTorch | TF/Keras |
|---|---|---|---|
| Xavier uniform | xavier_uniform_ |
GlorotUniform |
|
| Xavier normal | xavier_normal_ |
GlorotNormal |