Constant & Identity Initializations
- Distinguish when zero initialization is safe (biases) versus harmful (weights) and explain the symmetry mechanism that makes zero-weight networks untrainable
- Apply zeros_, ones_, constant_, eye_, and dirac_ in PyTorch and their Keras equivalents to initialize specific tensors
- Explain what dirac_ does to a convolutional weight tensor and why it produces an identity-like mapping at initialization
zeros — All Weights Set to Zero
torch.nn.init.zeros_(tensor) / tf.keras.initializers.Zeros()
Sets every element of the tensor to . As established in the previous reading, this breaks the network completely for weights: every neuron in a layer is identical, and symmetry is never broken.
When it is appropriate: biases. A zero bias means the initial decision boundary passes through the origin, which is a reasonable starting point. Many frameworks default to zero bias initialization.
When to avoid: weight matrices of any fully-connected or convolutional layer.
PyTorch:
import torch
import torch.nn as nn
# Weight tensor
w = torch.empty(256, 128)
nn.init.zeros_(w) # all zeros — never use for weights
# Correct use: bias only
linear = nn.Linear(128, 256)
nn.init.zeros_(linear.bias) # safe — bias zero init
TensorFlow:
import tensorflow as tf
# In a layer — zeros for bias (the default), not for kernel
dense = tf.keras.layers.Dense(
256,
kernel_initializer='glorot_uniform', # NOT zeros
bias_initializer='zeros' # safe
)
# Standalone
zero_init = tf.keras.initializers.Zeros()
values = zero_init(shape=(3, 4)) # tensor of zeros
ones — All Weights Set to One
torch.nn.init.ones_(tensor) / tf.keras.initializers.Ones()
Sets every element to . Retains the symmetry problem (every neuron in a layer is identical) and additionally causes activations to grow quickly — a layer with 512 inputs and all-ones weights multiplies the input norm by per layer.
Legitimate uses: initializing scale parameters in normalization layers to 1 (e.g. in BatchNorm) so that the layer acts as an identity at the start of training. Both PyTorch and Keras do this by default.
PyTorch:
bn = nn.BatchNorm1d(128)
# PyTorch does this internally:
nn.init.ones_(bn.weight) # gamma (scale) = 1
nn.init.zeros_(bn.bias) # beta (shift) = 0
TensorFlow:
batch_norm = tf.keras.layers.BatchNormalization()
# Keras defaults: gamma_initializer='ones', beta_initializer='zeros'
constant — Arbitrary Fixed Value
torch.nn.init.constant_(tensor, val) / tf.keras.initializers.Constant(value=0)
Sets every element to a specified constant. Has the same symmetry problem as zeros and ones for weights. Used when you need a specific non-zero bias (e.g., initializing output logit biases to the log frequency of each class).
PyTorch:
w = torch.empty(10)
nn.init.constant_(w, val=0.1) # every element = 0.1
# Use case: output layer bias initialized to class log-frequencies
log_freq = torch.log(class_counts / class_counts.sum())
nn.init.constant_(output_layer.bias, val=0.0) # or copy log_freq manually
TensorFlow:
const_init = tf.keras.initializers.Constant(value=0.1)
bias = const_init(shape=(10,))
# In a layer
dense = tf.keras.layers.Dense(10, bias_initializer=tf.keras.initializers.Constant(0.1))
eye — Identity Matrix
torch.nn.init.eye_(tensor) / tf.keras.initializers.Identity(gain=1.0)
Initializes a 2D weight matrix as the identity matrix (or as close as possible for non-square tensors, filling the rest with zeros). At initialization, the layer acts as a passthrough — input equals output.
This is useful when you want a layer to start by doing nothing and gradually learn a transformation. It is most relevant for:
- Residual connections: if the shortcut and the main branch both start near the identity, the residual sum is , which is still well-conditioned
- Linear probes: when inserting an adapter layer that should not disrupt a pretrained representation at the start of fine-tuning
PyTorch:
w = torch.empty(64, 64)
nn.init.eye_(w) # identity matrix; requires square or padded tensor
# w[i][j] = 1 if i == j, else 0
TensorFlow:
# Only valid for 2D square weight matrices
identity_init = tf.keras.initializers.Identity(gain=1.0)
w = identity_init(shape=(64, 64))
dense = tf.keras.layers.Dense(64, kernel_initializer='identity') # square only
dirac — Identity-Like Initialization for Conv Layers
torch.nn.init.dirac_(tensor, groups=1) (PyTorch only)
Initializes a convolutional weight tensor so that the layer acts as a passthrough at initialization: each output channel is an exact copy of the corresponding input channel, with the filter centered on the current pixel and zero everywhere else.
For a kernel of shape with , this places a at the center of each filter and elsewhere — a Dirac delta function in spatial dimensions.
Why it matters: in very deep convolutional networks (e.g., plain networks without explicit residual connections), dirac initialization allows the model to start as an identity function across layers and learn to deviate from it. It is the convolutional analogue of eye_.
PyTorch:
# groups=1: standard conv; set groups to match nn.Conv2d groups argument
conv_weight = torch.empty(8, 8, 3, 3) # out_channels, in_channels, kH, kW
nn.init.dirac_(conv_weight) # requires out_channels == in_channels
# Applying to an nn.Conv2d layer
conv = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, padding=1)
nn.init.dirac_(conv.weight)
nn.init.zeros_(conv.bias)
TensorFlow: No direct equivalent. A similar effect can be achieved manually:
import numpy as np
import tensorflow as tf
def dirac_initializer(shape, dtype=tf.float32):
"""Identity-like init for conv layers with in_channels == out_channels."""
out_ch, kH, kW, in_ch = shape # TF uses (kH, kW, in_ch, out_ch)
kernel = np.zeros(shape, dtype=np.float32)
min_ch = min(out_ch, in_ch)
center_h, center_w = kH // 2, kW // 2
for c in range(min_ch):
kernel[center_h, center_w, c, c] = 1.0
return tf.constant(kernel, dtype=dtype)
conv = tf.keras.layers.Conv2D(16, 3, padding='same',
kernel_initializer=dirac_initializer)
Summary
| Initializer | PyTorch | TF/Keras | Use for |
|---|---|---|---|
| zeros | zeros_() |
Zeros() |
Biases, BatchNorm beta |
| ones | ones_() |
Ones() |
BatchNorm gamma, LayerNorm scale |
| constant | constant_(val) |
Constant(value) |
Output biases at specific value |
| eye | eye_() |
Identity(gain) |
Square weight matrices as passthrough |
| dirac | dirac_(groups) |
(manual) | Conv layers as passthrough |