Supplement · Weight Initialization

Orthogonal & Structured Initializations

12 min read
By the end of this reading you will be able to:
  • Explain why orthogonal weight matrices preserve gradient norms and how this prevents vanishing and exploding gradients in RNNs
  • Describe how orthogonal initialization is computed via SVD of a random normal matrix
  • Apply orthogonal_ in PyTorch and Orthogonal in TensorFlow to RNN hidden-to-hidden weight matrices

Orthogonal Initialization

torch.nn.init.orthogonal_(tensor, gain=1.0) / tf.keras.initializers.Orthogonal(gain=1.0)

Initializes the weight matrix so that its rows (or columns, for tall matrices) form an orthonormal set. Concretely, a square n×nn \times n matrix WW is orthogonal if:

WW=I(equivalently WW=I)W^\top W = I \quad \text{(equivalently } WW^\top = I\text{)}

All singular values of an orthogonal matrix equal 1. This is the key property.

Why Singular Values = 1 Matters

Consider a sequence of linear operations — for example, the hidden-state updates in an RNN across TT timesteps:

ht=σ(Whhht1+Wxhxt)h_t = \sigma(W_{hh}\, h_{t-1} + W_{xh}\, x_t)

Ignoring the activation for a moment, the gradient of the loss at step tt flows backward through TtT - t applications of WhhW_{hh}^\top:

hTht=k=tT1Whh\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1} W_{hh}^\top

If WhhW_{hh} has singular values >1> 1, this product grows exponentially — exploding gradients. If singular values <1< 1, it shrinks exponentially — vanishing gradients. If all singular values =1= 1 (orthogonal), the product has the same norm as any single factor. Gradients neither explode nor vanish purely due to the weight matrix.

Computing an Orthogonal Matrix

An orthogonal matrix is computed via the Singular Value Decomposition (SVD) of a random matrix:

  1. Draw AN(0,1)m×nA \sim \mathcal{N}(0, 1)^{m \times n}
  2. Compute SVD: A=UΣVA = U \Sigma V^\top
  3. If mnm \geq n: use Q=UQ = U (shape m×nm \times n, orthonormal columns)
  4. If m<nm < n: use Q=VQ = V^\top (shape m×nm \times n, orthonormal rows)
  5. Scale: W=gainQW = \text{gain} \cdot Q

PyTorch:

import torch
import torch.nn as nn

w = torch.empty(64, 64)
nn.init.orthogonal_(w, gain=1.0)

# Verify orthogonality: W^T @ W ≈ I
print(torch.allclose(w.T @ w, torch.eye(64), atol=1e-6))  # True

# Non-square: orthonormal columns (m > n) or rows (m < n)
w_tall = torch.empty(128, 64)
nn.init.orthogonal_(w_tall)   # orthonormal columns: w_tall.T @ w_tall ≈ I_64

w_wide = torch.empty(64, 128)
nn.init.orthogonal_(w_wide)   # orthonormal rows: w_wide @ w_wide.T ≈ I_64

TensorFlow:

import tensorflow as tf

ortho_init = tf.keras.initializers.Orthogonal(gain=1.0)
w = ortho_init(shape=(64, 64))

# Verify
import tensorflow as tf
print(tf.reduce_max(tf.abs(tf.transpose(w) @ w - tf.eye(64))))  # ≈ 0

Applying to RNNs

Orthogonal initialization is most commonly used for the hidden-to-hidden weight matrix of RNNs, GRUs, and LSTMs, where the repeated matrix multiplication across timesteps makes gradient stability especially important.

PyTorch — vanilla RNN:

rnn = nn.RNN(input_size=32, hidden_size=64, num_layers=2)

# Initialize all hidden-to-hidden weights orthogonally
for name, param in rnn.named_parameters():
    if 'weight_hh' in name:
        nn.init.orthogonal_(param)
    elif 'weight_ih' in name:
        nn.init.xavier_uniform_(param)
    elif 'bias' in name:
        nn.init.zeros_(param)

PyTorch — LSTM (weight_hh has concatenated gates):

lstm = nn.LSTM(input_size=32, hidden_size=64)

for name, param in lstm.named_parameters():
    if 'weight_hh' in name:
        # weight_hh is (4*hidden, hidden) — apply orthogonal to each gate block
        hidden_size = 64
        for gate_idx in range(4):
            block = param.data[gate_idx*hidden_size:(gate_idx+1)*hidden_size]
            nn.init.orthogonal_(block)
    elif 'weight_ih' in name:
        nn.init.xavier_uniform_(param)
    elif 'bias' in name:
        nn.init.zeros_(param)

TensorFlow — built-in recurrent_initializer:

# SimpleRNN, GRU, and LSTM all accept a recurrent_initializer
rnn = tf.keras.layers.SimpleRNN(64, recurrent_initializer='orthogonal')
gru = tf.keras.layers.GRU(64, recurrent_initializer='orthogonal')
lstm = tf.keras.layers.LSTM(64, recurrent_initializer='orthogonal')

# Keras default for LSTM recurrent_initializer is already Orthogonal

The gain Parameter

Scaling an orthogonal matrix by gain changes all singular values from 1 to gain. Values >1> 1 bias the initialization toward expansion; <1< 1 toward contraction. For most RNN use cases, gain=1.0 is the right choice. When using orthogonal init as a warmstart for a network that will later develop non-unit singular values (e.g., GAN discriminators), a slight gain <1< 1 can help stability.

Orthogonal Initialization for Deep Linear Networks

In deep linear networks (no activation functions), the product of weight matrices determines the effective linear map. Orthogonal initialization makes all layers start as rotations, which keeps the effective Jacobian well-conditioned from the start — a useful property for theoretical analysis and for networks that approximate linear mappings (e.g., some attention-free architectures).

Practical Notes

  • Only for 2D weight tensors: orthogonal initialization is defined for matrices. For conv layers (4D tensors), PyTorch reshapes the tensor to 2D, applies orthogonal init, then reshapes back.
  • Not variance-scaling: orthogonal init sets singular values to 1 regardless of fan-in or fan-out. Combine with a gain value scaled to your layer size if needed.
  • Computational cost: SVD is O(min(m,n)mn)O(\min(m,n) \cdot m \cdot n), which is non-trivial for large weight matrices. For very large layers, He or Xavier is more practical at initialization time.

Summary of Structured Initializations

Initializer What it produces Primary use case
orthogonal_ / Orthogonal WW=IW^\top W = I; all singular values = 1 RNN hidden-to-hidden weights; deep linear networks
eye_ / Identity Identity matrix (square only) Linear adapter layers; residual passthrough
dirac_ Dirac delta conv filter (PyTorch only) Conv layer identity passthrough
References
Saxe et al. (2014) — Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Networks — Showed that orthogonal initialization enables exact analytical solutions for linear network training dynamics
Le et al. (2015) — A Simple Way to Initialize Recurrent Networks of Rectified Linear Units — Proposed identity/orthogonal initialization for ReLU RNNs as a practical alternative to LSTMs