Supplement · Weight Initialization

Constant & Identity Initializations

11 min read

By the end of this reading you will be able to:

Distinguish when zero initialization is safe (biases) versus harmful (weights) and explain the symmetry mechanism that makes zero-weight networks untrainable
Apply zeros_, ones_, constant_, eye_, and dirac_ in PyTorch and their Keras equivalents to initialize specific tensors
Explain what dirac_ does to a convolutional weight tensor and why it produces an identity-like mapping at initialization

zeros — All Weights Set to Zero

torch.nn.init.zeros_(tensor) / tf.keras.initializers.Zeros()

Sets every element of the tensor to $0$ . As established in the previous reading, this breaks the network completely for weights: every neuron in a layer is identical, and symmetry is never broken.

When it is appropriate: biases. A zero bias means the initial decision boundary passes through the origin, which is a reasonable starting point. Many frameworks default to zero bias initialization.

When to avoid: weight matrices of any fully-connected or convolutional layer.

PyTorch:

import torch
import torch.nn as nn

# Weight tensor
w = torch.empty(256, 128)
nn.init.zeros_(w)   # all zeros — never use for weights

# Correct use: bias only
linear = nn.Linear(128, 256)
nn.init.zeros_(linear.bias)   # safe — bias zero init

TensorFlow:

import tensorflow as tf

# In a layer — zeros for bias (the default), not for kernel
dense = tf.keras.layers.Dense(
    256,
    kernel_initializer='glorot_uniform',  # NOT zeros
    bias_initializer='zeros'              # safe
)

# Standalone
zero_init = tf.keras.initializers.Zeros()
values = zero_init(shape=(3, 4))   # tensor of zeros

ones — All Weights Set to One

torch.nn.init.ones_(tensor) / tf.keras.initializers.Ones()

Sets every element to $1$ . Retains the symmetry problem (every neuron in a layer is identical) and additionally causes activations to grow quickly — a layer with 512 inputs and all-ones weights multiplies the input norm by $\sqrt{512} \approx 22$ per layer.

Legitimate uses: initializing scale parameters in normalization layers to 1 (e.g. $\gamma = 1$ in BatchNorm) so that the layer acts as an identity at the start of training. Both PyTorch and Keras do this by default.

PyTorch:

bn = nn.BatchNorm1d(128)
# PyTorch does this internally:
nn.init.ones_(bn.weight)   # gamma (scale) = 1
nn.init.zeros_(bn.bias)    # beta (shift) = 0

TensorFlow:

batch_norm = tf.keras.layers.BatchNormalization()
# Keras defaults: gamma_initializer='ones', beta_initializer='zeros'

constant — Arbitrary Fixed Value

torch.nn.init.constant_(tensor, val) / tf.keras.initializers.Constant(value=0)

Sets every element to a specified constant. Has the same symmetry problem as zeros and ones for weights. Used when you need a specific non-zero bias (e.g., initializing output logit biases to the log frequency of each class).

PyTorch:

w = torch.empty(10)
nn.init.constant_(w, val=0.1)   # every element = 0.1

# Use case: output layer bias initialized to class log-frequencies
log_freq = torch.log(class_counts / class_counts.sum())
nn.init.constant_(output_layer.bias, val=0.0)  # or copy log_freq manually

TensorFlow:

const_init = tf.keras.initializers.Constant(value=0.1)
bias = const_init(shape=(10,))

# In a layer
dense = tf.keras.layers.Dense(10, bias_initializer=tf.keras.initializers.Constant(0.1))

eye — Identity Matrix

torch.nn.init.eye_(tensor) / tf.keras.initializers.Identity(gain=1.0)

Initializes a 2D weight matrix as the identity matrix $I$ (or as close as possible for non-square tensors, filling the rest with zeros). At initialization, the layer acts as a passthrough — input equals output.

This is useful when you want a layer to start by doing nothing and gradually learn a transformation. It is most relevant for:

Residual connections: if the shortcut and the main branch both start near the identity, the residual sum is $\approx 2x$ , which is still well-conditioned
Linear probes: when inserting an adapter layer that should not disrupt a pretrained representation at the start of fine-tuning

PyTorch:

w = torch.empty(64, 64)
nn.init.eye_(w)   # identity matrix; requires square or padded tensor
# w[i][j] = 1 if i == j, else 0

TensorFlow:

# Only valid for 2D square weight matrices
identity_init = tf.keras.initializers.Identity(gain=1.0)
w = identity_init(shape=(64, 64))

dense = tf.keras.layers.Dense(64, kernel_initializer='identity')  # square only

dirac — Identity-Like Initialization for Conv Layers

torch.nn.init.dirac_(tensor, groups=1) (PyTorch only)

Initializes a convolutional weight tensor so that the layer acts as a passthrough at initialization: each output channel is an exact copy of the corresponding input channel, with the filter centered on the current pixel and zero everywhere else.

For a kernel of shape $(C_{out}, C_{in}, kH, kW)$ with $C_{out} = C_{in}$ , this places a $1$ at the center of each $kH \times kW$ filter and $0$ elsewhere — a Dirac delta function in spatial dimensions.

Why it matters: in very deep convolutional networks (e.g., plain networks without explicit residual connections), dirac initialization allows the model to start as an identity function across layers and learn to deviate from it. It is the convolutional analogue of eye_.

PyTorch:

# groups=1: standard conv; set groups to match nn.Conv2d groups argument
conv_weight = torch.empty(8, 8, 3, 3)   # out_channels, in_channels, kH, kW
nn.init.dirac_(conv_weight)              # requires out_channels == in_channels

# Applying to an nn.Conv2d layer
conv = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, padding=1)
nn.init.dirac_(conv.weight)
nn.init.zeros_(conv.bias)

TensorFlow: No direct equivalent. A similar effect can be achieved manually:

import numpy as np
import tensorflow as tf

def dirac_initializer(shape, dtype=tf.float32):
    """Identity-like init for conv layers with in_channels == out_channels."""
    out_ch, kH, kW, in_ch = shape   # TF uses (kH, kW, in_ch, out_ch)
    kernel = np.zeros(shape, dtype=np.float32)
    min_ch = min(out_ch, in_ch)
    center_h, center_w = kH // 2, kW // 2
    for c in range(min_ch):
        kernel[center_h, center_w, c, c] = 1.0
    return tf.constant(kernel, dtype=dtype)

conv = tf.keras.layers.Conv2D(16, 3, padding='same',
    kernel_initializer=dirac_initializer)

Summary

Initializer	PyTorch	TF/Keras	Use for
zeros	`zeros_()`	`Zeros()`	Biases, BatchNorm beta
ones	`ones_()`	`Ones()`	BatchNorm gamma, LayerNorm scale
constant	`constant_(val)`	`Constant(value)`	Output biases at specific value
eye	`eye_()`	`Identity(gain)`	Square weight matrices as passthrough
dirac	`dirac_(groups)`	(manual)	Conv layers as passthrough

References

PyTorch — torch.nn.init documentation — Full reference for all PyTorch initialization functions

TensorFlow — tf.keras.initializers documentation — Full reference for all Keras initializer classes

Previous Next →

Constant & Identity Initializations

zeros — All Weights Set to Zero

ones — All Weights Set to One

constant — Arbitrary Fixed Value

eye — Identity Matrix

dirac — Identity-Like Initialization for Conv Layers

Summary

Privacy Policy

What we collect

What we don't collect

Your choices

Contact