Supplement · Weight Initialization

Constant & Identity Initializations

11 min read
By the end of this reading you will be able to:
  • Distinguish when zero initialization is safe (biases) versus harmful (weights) and explain the symmetry mechanism that makes zero-weight networks untrainable
  • Apply zeros_, ones_, constant_, eye_, and dirac_ in PyTorch and their Keras equivalents to initialize specific tensors
  • Explain what dirac_ does to a convolutional weight tensor and why it produces an identity-like mapping at initialization

zeros — All Weights Set to Zero

torch.nn.init.zeros_(tensor) / tf.keras.initializers.Zeros()

Sets every element of the tensor to 00. As established in the previous reading, this breaks the network completely for weights: every neuron in a layer is identical, and symmetry is never broken.

When it is appropriate: biases. A zero bias means the initial decision boundary passes through the origin, which is a reasonable starting point. Many frameworks default to zero bias initialization.

When to avoid: weight matrices of any fully-connected or convolutional layer.

PyTorch:

import torch
import torch.nn as nn

# Weight tensor
w = torch.empty(256, 128)
nn.init.zeros_(w)   # all zeros — never use for weights

# Correct use: bias only
linear = nn.Linear(128, 256)
nn.init.zeros_(linear.bias)   # safe — bias zero init

TensorFlow:

import tensorflow as tf

# In a layer — zeros for bias (the default), not for kernel
dense = tf.keras.layers.Dense(
    256,
    kernel_initializer='glorot_uniform',  # NOT zeros
    bias_initializer='zeros'              # safe
)

# Standalone
zero_init = tf.keras.initializers.Zeros()
values = zero_init(shape=(3, 4))   # tensor of zeros

ones — All Weights Set to One

torch.nn.init.ones_(tensor) / tf.keras.initializers.Ones()

Sets every element to 11. Retains the symmetry problem (every neuron in a layer is identical) and additionally causes activations to grow quickly — a layer with 512 inputs and all-ones weights multiplies the input norm by 51222\sqrt{512} \approx 22 per layer.

Legitimate uses: initializing scale parameters in normalization layers to 1 (e.g. γ=1\gamma = 1 in BatchNorm) so that the layer acts as an identity at the start of training. Both PyTorch and Keras do this by default.

PyTorch:

bn = nn.BatchNorm1d(128)
# PyTorch does this internally:
nn.init.ones_(bn.weight)   # gamma (scale) = 1
nn.init.zeros_(bn.bias)    # beta (shift) = 0

TensorFlow:

batch_norm = tf.keras.layers.BatchNormalization()
# Keras defaults: gamma_initializer='ones', beta_initializer='zeros'

constant — Arbitrary Fixed Value

torch.nn.init.constant_(tensor, val) / tf.keras.initializers.Constant(value=0)

Sets every element to a specified constant. Has the same symmetry problem as zeros and ones for weights. Used when you need a specific non-zero bias (e.g., initializing output logit biases to the log frequency of each class).

PyTorch:

w = torch.empty(10)
nn.init.constant_(w, val=0.1)   # every element = 0.1

# Use case: output layer bias initialized to class log-frequencies
log_freq = torch.log(class_counts / class_counts.sum())
nn.init.constant_(output_layer.bias, val=0.0)  # or copy log_freq manually

TensorFlow:

const_init = tf.keras.initializers.Constant(value=0.1)
bias = const_init(shape=(10,))

# In a layer
dense = tf.keras.layers.Dense(10, bias_initializer=tf.keras.initializers.Constant(0.1))

eye — Identity Matrix

torch.nn.init.eye_(tensor) / tf.keras.initializers.Identity(gain=1.0)

Initializes a 2D weight matrix as the identity matrix II (or as close as possible for non-square tensors, filling the rest with zeros). At initialization, the layer acts as a passthrough — input equals output.

This is useful when you want a layer to start by doing nothing and gradually learn a transformation. It is most relevant for:

  • Residual connections: if the shortcut and the main branch both start near the identity, the residual sum is 2x\approx 2x, which is still well-conditioned
  • Linear probes: when inserting an adapter layer that should not disrupt a pretrained representation at the start of fine-tuning

PyTorch:

w = torch.empty(64, 64)
nn.init.eye_(w)   # identity matrix; requires square or padded tensor
# w[i][j] = 1 if i == j, else 0

TensorFlow:

# Only valid for 2D square weight matrices
identity_init = tf.keras.initializers.Identity(gain=1.0)
w = identity_init(shape=(64, 64))

dense = tf.keras.layers.Dense(64, kernel_initializer='identity')  # square only

dirac — Identity-Like Initialization for Conv Layers

torch.nn.init.dirac_(tensor, groups=1) (PyTorch only)

Initializes a convolutional weight tensor so that the layer acts as a passthrough at initialization: each output channel is an exact copy of the corresponding input channel, with the filter centered on the current pixel and zero everywhere else.

For a kernel of shape (Cout,Cin,kH,kW)(C_{out}, C_{in}, kH, kW) with Cout=CinC_{out} = C_{in}, this places a 11 at the center of each kH×kWkH \times kW filter and 00 elsewhere — a Dirac delta function in spatial dimensions.

Why it matters: in very deep convolutional networks (e.g., plain networks without explicit residual connections), dirac initialization allows the model to start as an identity function across layers and learn to deviate from it. It is the convolutional analogue of eye_.

PyTorch:

# groups=1: standard conv; set groups to match nn.Conv2d groups argument
conv_weight = torch.empty(8, 8, 3, 3)   # out_channels, in_channels, kH, kW
nn.init.dirac_(conv_weight)              # requires out_channels == in_channels

# Applying to an nn.Conv2d layer
conv = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, padding=1)
nn.init.dirac_(conv.weight)
nn.init.zeros_(conv.bias)

TensorFlow: No direct equivalent. A similar effect can be achieved manually:

import numpy as np
import tensorflow as tf

def dirac_initializer(shape, dtype=tf.float32):
    """Identity-like init for conv layers with in_channels == out_channels."""
    out_ch, kH, kW, in_ch = shape   # TF uses (kH, kW, in_ch, out_ch)
    kernel = np.zeros(shape, dtype=np.float32)
    min_ch = min(out_ch, in_ch)
    center_h, center_w = kH // 2, kW // 2
    for c in range(min_ch):
        kernel[center_h, center_w, c, c] = 1.0
    return tf.constant(kernel, dtype=dtype)

conv = tf.keras.layers.Conv2D(16, 3, padding='same',
    kernel_initializer=dirac_initializer)

Summary

Initializer PyTorch TF/Keras Use for
zeros zeros_() Zeros() Biases, BatchNorm beta
ones ones_() Ones() BatchNorm gamma, LayerNorm scale
constant constant_(val) Constant(value) Output biases at specific value
eye eye_() Identity(gain) Square weight matrices as passthrough
dirac dirac_(groups) (manual) Conv layers as passthrough