Practical Guide & Default Behaviors
- State the default initialization used by PyTorch for nn.Linear, nn.Conv2d, nn.Embedding, nn.LSTM, and nn.BatchNorm1d
- State the default initialization used by TensorFlow/Keras for Dense, Conv2D, LSTM, and BatchNormalization
- Apply model.apply() in PyTorch and kernel_initializer in TensorFlow to override default initialization across all layers of a model
- Select the appropriate initializer for a given activation function and layer type using the decision guide
PyTorch Default Initializations
PyTorch initializes every layer type via its reset_parameters() method, called at construction. The defaults are documented but not always obvious:
| Layer | Weight default | Bias default |
|---|---|---|
nn.Linear |
kaiming_uniform_(a=√5) |
uniform_(-1/√fan_in, 1/√fan_in) |
nn.Conv1d/2d/3d |
kaiming_uniform_(a=√5) |
uniform_(-1/√fan_in, 1/√fan_in) |
nn.Embedding |
normal_(mean=0, std=1) |
N/A |
nn.LSTM / nn.GRU |
uniform_(-1/√H, 1/√H) where is hidden size |
same |
nn.RNN |
uniform_(-1/√H, 1/√H) |
same |
nn.BatchNorm* |
ones_() (weight ) |
zeros_() (bias ) |
nn.LayerNorm |
ones_() (weight) |
zeros_() (bias) |
nn.MultiheadAttention |
Same as Linear for in/out projections | zeros |
A note on PyTorch's kaiming_uniform_ default for Linear and Conv: the a=√5 parameter corresponds to a LeakyReLU negative slope of , which is not a standard activation. This is a historical artifact that has been kept for backward compatibility. In practice, if your network uses ReLU, explicitly override with a=0.
import torch.nn as nn
# Inspect default initialization behavior
linear = nn.Linear(128, 256)
print(linear.weight[:2, :4]) # kaiming_uniform_(a=sqrt(5)) by default
print(linear.bias[:4]) # uniform_(-1/sqrt(128), 1/sqrt(128))
# Override at construction time
linear.apply(lambda m: nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
if isinstance(m, nn.Linear) else None)
TensorFlow / Keras Default Initializations
| Layer | Kernel default | Bias default | Recurrent default |
|---|---|---|---|
Dense |
GlorotUniform |
Zeros |
N/A |
Conv1D/2D/3D |
GlorotUniform |
Zeros |
N/A |
Embedding |
RandomUniform(-0.05, 0.05) |
N/A | N/A |
LSTM |
GlorotUniform |
Zeros |
Orthogonal |
GRU |
GlorotUniform |
Zeros |
Orthogonal |
SimpleRNN |
GlorotUniform |
Zeros |
Orthogonal |
BatchNormalization |
Ones () |
Zeros () |
N/A |
LayerNormalization |
Ones () |
Zeros () |
N/A |
Note: Keras uses GlorotUniform (Xavier uniform) as the default for all feedforward layers, while PyTorch uses kaiming_uniform_. This is one of the main practical differences between the two frameworks.
Overriding Initialization in PyTorch
The idiomatic PyTorch approach is model.apply(fn), which recursively applies fn to every submodule:
import torch.nn as nn
import math
# Strategy 1: model.apply with isinstance checks
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
nn.init.zeros_(m.bias)
elif isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
if m.bias is not None:
nn.init.zeros_(m.bias)
elif isinstance(m, nn.Embedding):
nn.init.normal_(m.weight, mean=0, std=0.02)
elif isinstance(m, (nn.BatchNorm1d, nn.BatchNorm2d)):
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
model = nn.Sequential(
nn.Linear(128, 256), nn.ReLU(),
nn.Linear(256, 10)
)
model.apply(init_weights)
# Strategy 2: override reset_parameters in a custom module
class MyLinear(nn.Linear):
def reset_parameters(self):
nn.init.xavier_uniform_(self.weight)
nn.init.zeros_(self.bias)
Overriding Initialization in TensorFlow
In Keras, pass the initializer at layer construction:
import tensorflow as tf
# Per-layer override
model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu',
kernel_initializer=tf.keras.initializers.HeNormal(),
bias_initializer='zeros'),
tf.keras.layers.Dense(128, activation='relu',
kernel_initializer=tf.keras.initializers.HeNormal(),
bias_initializer='zeros'),
tf.keras.layers.Dense(10, activation='softmax',
kernel_initializer=tf.keras.initializers.GlorotUniform())
])
# Functional API
def make_dense(units, activation):
return tf.keras.layers.Dense(
units, activation=activation,
kernel_initializer='he_normal', # string shorthand
bias_initializer='zeros'
)
# LSTM with orthogonal recurrent init (already the default, shown explicitly)
lstm = tf.keras.layers.LSTM(64,
kernel_initializer='glorot_uniform',
recurrent_initializer='orthogonal',
bias_initializer='zeros')
Initializer String Shorthands
Both frameworks accept string names for common initializers:
| Initializer | PyTorch (no strings — use functions) | TF/Keras string |
|---|---|---|
| Xavier uniform | nn.init.xavier_uniform_ |
'glorot_uniform' |
| Xavier normal | nn.init.xavier_normal_ |
'glorot_normal' |
| He normal | nn.init.kaiming_normal_ |
'he_normal' |
| He uniform | nn.init.kaiming_uniform_ |
'he_uniform' |
| Zeros | nn.init.zeros_ |
'zeros' |
| Ones | nn.init.ones_ |
'ones' |
| Orthogonal | nn.init.orthogonal_ |
'orthogonal' |
| Truncated normal | nn.init.trunc_normal_ |
'truncated_normal' |
Decision Guide
Use this table as a starting point — it reflects empirical best practices across modern architectures:
| Activation | Layer type | Recommended init |
|---|---|---|
| ReLU, LeakyReLU, PReLU | Linear, Conv | He / Kaiming (normal, fan_in) |
| GELU, SiLU, Mish | Linear, Conv | He or Xavier — both work; He is slightly more principled |
| Sigmoid, Tanh | Linear | Xavier / Glorot |
| Linear (no activation) | Linear | Xavier |
| SELU | Linear | LeCun |
| Any | RNN hidden-to-hidden | Orthogonal |
| Any | Embedding | normal_(std=0.02) or uniform_(-0.05, 0.05) |
| Any | BatchNorm weight | ones_ |
| Any | BatchNorm bias | zeros_ |
| Any | Output bias (classification) | zeros_ or log-prior |
Common Mistakes
1. Using PyTorch's default kaiming_uniform_(a=√5) for ReLU layers. The default a=√5 is not appropriate for ReLU. Always override:
nn.init.kaiming_normal_(m.weight, a=0, nonlinearity='relu')
2. Forgetting to initialize biases. Framework defaults usually zero-initialize biases, but when you override weight init manually, bias init is often forgotten.
3. Using Xavier for ReLU networks. Xavier initializes with half the variance that ReLU needs. Training will likely succeed eventually but may require a lower learning rate.
4. Using a fixed small std (e.g. 0.01) for all layers. This ignores fan-in entirely. In a layer with 1024 inputs, gives an output variance of — signal collapse is avoided but barely. Use variance-scaling instead.
5. Applying orthogonal init to feedforward layers. Orthogonal init is designed for square or nearly-square matrices where repeated multiplication is the concern. For standard feedforward layers, Xavier or He is more appropriate.