Supplement · Weight Initialization

Practical Guide & Default Behaviors

13 min read

By the end of this reading you will be able to:

State the default initialization used by PyTorch for nn.Linear, nn.Conv2d, nn.Embedding, nn.LSTM, and nn.BatchNorm1d
State the default initialization used by TensorFlow/Keras for Dense, Conv2D, LSTM, and BatchNormalization
Apply model.apply() in PyTorch and kernel_initializer in TensorFlow to override default initialization across all layers of a model
Select the appropriate initializer for a given activation function and layer type using the decision guide

PyTorch Default Initializations

PyTorch initializes every layer type via its reset_parameters() method, called at construction. The defaults are documented but not always obvious:

Layer	Weight default	Bias default
`nn.Linear`	`kaiming_uniform_(a=√5)`	`uniform_(-1/√fan_in, 1/√fan_in)`
`nn.Conv1d/2d/3d`	`kaiming_uniform_(a=√5)`	`uniform_(-1/√fan_in, 1/√fan_in)`
`nn.Embedding`	`normal_(mean=0, std=1)`	N/A
`nn.LSTM` / `nn.GRU`	`uniform_(-1/√H, 1/√H)` where $H$ is hidden size	same
`nn.RNN`	`uniform_(-1/√H, 1/√H)`	same
`nn.BatchNorm*`	`ones_()` (weight $\gamma$ )	`zeros_()` (bias $\beta$ )
`nn.LayerNorm`	`ones_()` (weight)	`zeros_()` (bias)
`nn.MultiheadAttention`	Same as Linear for in/out projections	zeros

A note on PyTorch's kaiming_uniform_ default for Linear and Conv: the a=√5 parameter corresponds to a LeakyReLU negative slope of $\sqrt{5} \approx 2.24$ , which is not a standard activation. This is a historical artifact that has been kept for backward compatibility. In practice, if your network uses ReLU, explicitly override with a=0.

import torch.nn as nn

# Inspect default initialization behavior
linear = nn.Linear(128, 256)
print(linear.weight[:2, :4])   # kaiming_uniform_(a=sqrt(5)) by default
print(linear.bias[:4])         # uniform_(-1/sqrt(128), 1/sqrt(128))

# Override at construction time
linear.apply(lambda m: nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
    if isinstance(m, nn.Linear) else None)

TensorFlow / Keras Default Initializations

Layer	Kernel default	Bias default	Recurrent default
`Dense`	`GlorotUniform`	`Zeros`	N/A
`Conv1D/2D/3D`	`GlorotUniform`	`Zeros`	N/A
`Embedding`	`RandomUniform(-0.05, 0.05)`	N/A	N/A
`LSTM`	`GlorotUniform`	`Zeros`	`Orthogonal`
`GRU`	`GlorotUniform`	`Zeros`	`Orthogonal`
`SimpleRNN`	`GlorotUniform`	`Zeros`	`Orthogonal`
`BatchNormalization`	`Ones` ( $\gamma$ )	`Zeros` ( $\beta$ )	N/A
`LayerNormalization`	`Ones` ( $\gamma$ )	`Zeros` ( $\beta$ )	N/A

Note: Keras uses GlorotUniform (Xavier uniform) as the default for all feedforward layers, while PyTorch uses kaiming_uniform_. This is one of the main practical differences between the two frameworks.

Overriding Initialization in PyTorch

The idiomatic PyTorch approach is model.apply(fn), which recursively applies fn to every submodule:

import torch.nn as nn
import math

# Strategy 1: model.apply with isinstance checks
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Embedding):
        nn.init.normal_(m.weight, mean=0, std=0.02)
    elif isinstance(m, (nn.BatchNorm1d, nn.BatchNorm2d)):
        nn.init.ones_(m.weight)
        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(128, 256), nn.ReLU(),
    nn.Linear(256, 10)
)
model.apply(init_weights)

# Strategy 2: override reset_parameters in a custom module
class MyLinear(nn.Linear):
    def reset_parameters(self):
        nn.init.xavier_uniform_(self.weight)
        nn.init.zeros_(self.bias)

Overriding Initialization in TensorFlow

In Keras, pass the initializer at layer construction:

import tensorflow as tf

# Per-layer override
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu',
        kernel_initializer=tf.keras.initializers.HeNormal(),
        bias_initializer='zeros'),
    tf.keras.layers.Dense(128, activation='relu',
        kernel_initializer=tf.keras.initializers.HeNormal(),
        bias_initializer='zeros'),
    tf.keras.layers.Dense(10, activation='softmax',
        kernel_initializer=tf.keras.initializers.GlorotUniform())
])

# Functional API
def make_dense(units, activation):
    return tf.keras.layers.Dense(
        units, activation=activation,
        kernel_initializer='he_normal',   # string shorthand
        bias_initializer='zeros'
    )

# LSTM with orthogonal recurrent init (already the default, shown explicitly)
lstm = tf.keras.layers.LSTM(64,
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros')

Initializer String Shorthands

Both frameworks accept string names for common initializers:

Initializer	PyTorch (no strings — use functions)	TF/Keras string
Xavier uniform	`nn.init.xavier_uniform_`	`'glorot_uniform'`
Xavier normal	`nn.init.xavier_normal_`	`'glorot_normal'`
He normal	`nn.init.kaiming_normal_`	`'he_normal'`
He uniform	`nn.init.kaiming_uniform_`	`'he_uniform'`
Zeros	`nn.init.zeros_`	`'zeros'`
Ones	`nn.init.ones_`	`'ones'`
Orthogonal	`nn.init.orthogonal_`	`'orthogonal'`
Truncated normal	`nn.init.trunc_normal_`	`'truncated_normal'`

Decision Guide

Use this table as a starting point — it reflects empirical best practices across modern architectures:

Activation	Layer type	Recommended init
ReLU, LeakyReLU, PReLU	Linear, Conv	He / Kaiming (normal, fan_in)
GELU, SiLU, Mish	Linear, Conv	He or Xavier — both work; He is slightly more principled
Sigmoid, Tanh	Linear	Xavier / Glorot
Linear (no activation)	Linear	Xavier
SELU	Linear	LeCun
Any	RNN hidden-to-hidden	Orthogonal
Any	Embedding	`normal_(std=0.02)` or `uniform_(-0.05, 0.05)`
Any	BatchNorm weight $\gamma$	`ones_`
Any	BatchNorm bias $\beta$	`zeros_`
Any	Output bias (classification)	`zeros_` or log-prior

Common Mistakes

1. Using PyTorch's default kaiming_uniform_(a=√5) for ReLU layers. The default a=√5 is not appropriate for ReLU. Always override:

nn.init.kaiming_normal_(m.weight, a=0, nonlinearity='relu')

2. Forgetting to initialize biases. Framework defaults usually zero-initialize biases, but when you override weight init manually, bias init is often forgotten.

3. Using Xavier for ReLU networks. Xavier initializes with half the variance that ReLU needs. Training will likely succeed eventually but may require a lower learning rate.

4. Using a fixed small std (e.g. 0.01) for all layers. This ignores fan-in entirely. In a layer with 1024 inputs, $\sigma = 0.01$ gives an output variance of $1024 \times 0.0001 = 0.1024$ — signal collapse is avoided but barely. Use variance-scaling instead.

5. Applying orthogonal init to feedforward layers. Orthogonal init is designed for square or nearly-square matrices where repeated multiplication is the concern. For standard feedforward layers, Xavier or He is more appropriate.

References

PyTorch — torch.nn.init documentation — Full reference for all PyTorch initialization functions including default layer behaviors

TensorFlow — tf.keras.initializers documentation — Full reference for all Keras initializer classes and their default assignments by layer

Previous Take Quiz →

Practical Guide & Default Behaviors

PyTorch Default Initializations

TensorFlow / Keras Default Initializations

Overriding Initialization in PyTorch

Overriding Initialization in TensorFlow

Initializer String Shorthands

Decision Guide

Common Mistakes

Privacy Policy

What we collect

What we don't collect

Your choices

Contact