Supplement · Weight Initialization

Practical Guide & Default Behaviors

13 min read
By the end of this reading you will be able to:
  • State the default initialization used by PyTorch for nn.Linear, nn.Conv2d, nn.Embedding, nn.LSTM, and nn.BatchNorm1d
  • State the default initialization used by TensorFlow/Keras for Dense, Conv2D, LSTM, and BatchNormalization
  • Apply model.apply() in PyTorch and kernel_initializer in TensorFlow to override default initialization across all layers of a model
  • Select the appropriate initializer for a given activation function and layer type using the decision guide

PyTorch Default Initializations

PyTorch initializes every layer type via its reset_parameters() method, called at construction. The defaults are documented but not always obvious:

Layer Weight default Bias default
nn.Linear kaiming_uniform_(a=√5) uniform_(-1/√fan_in, 1/√fan_in)
nn.Conv1d/2d/3d kaiming_uniform_(a=√5) uniform_(-1/√fan_in, 1/√fan_in)
nn.Embedding normal_(mean=0, std=1) N/A
nn.LSTM / nn.GRU uniform_(-1/√H, 1/√H) where HH is hidden size same
nn.RNN uniform_(-1/√H, 1/√H) same
nn.BatchNorm* ones_() (weight γ\gamma) zeros_() (bias β\beta)
nn.LayerNorm ones_() (weight) zeros_() (bias)
nn.MultiheadAttention Same as Linear for in/out projections zeros

A note on PyTorch's kaiming_uniform_ default for Linear and Conv: the a=√5 parameter corresponds to a LeakyReLU negative slope of 52.24\sqrt{5} \approx 2.24, which is not a standard activation. This is a historical artifact that has been kept for backward compatibility. In practice, if your network uses ReLU, explicitly override with a=0.

import torch.nn as nn

# Inspect default initialization behavior
linear = nn.Linear(128, 256)
print(linear.weight[:2, :4])   # kaiming_uniform_(a=sqrt(5)) by default
print(linear.bias[:4])         # uniform_(-1/sqrt(128), 1/sqrt(128))

# Override at construction time
linear.apply(lambda m: nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
    if isinstance(m, nn.Linear) else None)

TensorFlow / Keras Default Initializations

Layer Kernel default Bias default Recurrent default
Dense GlorotUniform Zeros N/A
Conv1D/2D/3D GlorotUniform Zeros N/A
Embedding RandomUniform(-0.05, 0.05) N/A N/A
LSTM GlorotUniform Zeros Orthogonal
GRU GlorotUniform Zeros Orthogonal
SimpleRNN GlorotUniform Zeros Orthogonal
BatchNormalization Ones (γ\gamma) Zeros (β\beta) N/A
LayerNormalization Ones (γ\gamma) Zeros (β\beta) N/A

Note: Keras uses GlorotUniform (Xavier uniform) as the default for all feedforward layers, while PyTorch uses kaiming_uniform_. This is one of the main practical differences between the two frameworks.

Overriding Initialization in PyTorch

The idiomatic PyTorch approach is model.apply(fn), which recursively applies fn to every submodule:

import torch.nn as nn
import math

# Strategy 1: model.apply with isinstance checks
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Embedding):
        nn.init.normal_(m.weight, mean=0, std=0.02)
    elif isinstance(m, (nn.BatchNorm1d, nn.BatchNorm2d)):
        nn.init.ones_(m.weight)
        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(128, 256), nn.ReLU(),
    nn.Linear(256, 10)
)
model.apply(init_weights)

# Strategy 2: override reset_parameters in a custom module
class MyLinear(nn.Linear):
    def reset_parameters(self):
        nn.init.xavier_uniform_(self.weight)
        nn.init.zeros_(self.bias)

Overriding Initialization in TensorFlow

In Keras, pass the initializer at layer construction:

import tensorflow as tf

# Per-layer override
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu',
        kernel_initializer=tf.keras.initializers.HeNormal(),
        bias_initializer='zeros'),
    tf.keras.layers.Dense(128, activation='relu',
        kernel_initializer=tf.keras.initializers.HeNormal(),
        bias_initializer='zeros'),
    tf.keras.layers.Dense(10, activation='softmax',
        kernel_initializer=tf.keras.initializers.GlorotUniform())
])

# Functional API
def make_dense(units, activation):
    return tf.keras.layers.Dense(
        units, activation=activation,
        kernel_initializer='he_normal',   # string shorthand
        bias_initializer='zeros'
    )

# LSTM with orthogonal recurrent init (already the default, shown explicitly)
lstm = tf.keras.layers.LSTM(64,
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros')

Initializer String Shorthands

Both frameworks accept string names for common initializers:

Initializer PyTorch (no strings — use functions) TF/Keras string
Xavier uniform nn.init.xavier_uniform_ 'glorot_uniform'
Xavier normal nn.init.xavier_normal_ 'glorot_normal'
He normal nn.init.kaiming_normal_ 'he_normal'
He uniform nn.init.kaiming_uniform_ 'he_uniform'
Zeros nn.init.zeros_ 'zeros'
Ones nn.init.ones_ 'ones'
Orthogonal nn.init.orthogonal_ 'orthogonal'
Truncated normal nn.init.trunc_normal_ 'truncated_normal'

Decision Guide

Use this table as a starting point — it reflects empirical best practices across modern architectures:

Activation Layer type Recommended init
ReLU, LeakyReLU, PReLU Linear, Conv He / Kaiming (normal, fan_in)
GELU, SiLU, Mish Linear, Conv He or Xavier — both work; He is slightly more principled
Sigmoid, Tanh Linear Xavier / Glorot
Linear (no activation) Linear Xavier
SELU Linear LeCun
Any RNN hidden-to-hidden Orthogonal
Any Embedding normal_(std=0.02) or uniform_(-0.05, 0.05)
Any BatchNorm weight γ\gamma ones_
Any BatchNorm bias β\beta zeros_
Any Output bias (classification) zeros_ or log-prior

Common Mistakes

1. Using PyTorch's default kaiming_uniform_(a=√5) for ReLU layers. The default a=√5 is not appropriate for ReLU. Always override:

nn.init.kaiming_normal_(m.weight, a=0, nonlinearity='relu')

2. Forgetting to initialize biases. Framework defaults usually zero-initialize biases, but when you override weight init manually, bias init is often forgotten.

3. Using Xavier for ReLU networks. Xavier initializes with half the variance that ReLU needs. Training will likely succeed eventually but may require a lower learning rate.

4. Using a fixed small std (e.g. 0.01) for all layers. This ignores fan-in entirely. In a layer with 1024 inputs, σ=0.01\sigma = 0.01 gives an output variance of 1024×0.0001=0.10241024 \times 0.0001 = 0.1024 — signal collapse is avoided but barely. Use variance-scaling instead.

5. Applying orthogonal init to feedforward layers. Orthogonal init is designed for square or nearly-square matrices where repeated multiplication is the concern. For standard feedforward layers, Xavier or He is more appropriate.