Supplement · Weight Initialization

Orthogonal & Structured Initializations

12 min read

By the end of this reading you will be able to:

Explain why orthogonal weight matrices preserve gradient norms and how this prevents vanishing and exploding gradients in RNNs
Describe how orthogonal initialization is computed via SVD of a random normal matrix
Apply orthogonal_ in PyTorch and Orthogonal in TensorFlow to RNN hidden-to-hidden weight matrices

Orthogonal Initialization

torch.nn.init.orthogonal_(tensor, gain=1.0) / tf.keras.initializers.Orthogonal(gain=1.0)

Initializes the weight matrix so that its rows (or columns, for tall matrices) form an orthonormal set. Concretely, a square $n \times n$ matrix $W$ is orthogonal if:

$W^\top W = I \quad \text{(equivalently } WW^\top = I\text{)}$

All singular values of an orthogonal matrix equal 1. This is the key property.

Why Singular Values = 1 Matters

Consider a sequence of linear operations — for example, the hidden-state updates in an RNN across $T$ timesteps:

$h_t = \sigma(W_{hh}\, h_{t-1} + W_{xh}\, x_t)$

Ignoring the activation for a moment, the gradient of the loss at step $t$ flows backward through $T - t$ applications of $W_{hh}^\top$ :

$\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1} W_{hh}^\top$

If $W_{hh}$ has singular values $> 1$ , this product grows exponentially — exploding gradients. If singular values $< 1$ , it shrinks exponentially — vanishing gradients. If all singular values $= 1$ (orthogonal), the product has the same norm as any single factor. Gradients neither explode nor vanish purely due to the weight matrix.

Computing an Orthogonal Matrix

An orthogonal matrix is computed via the Singular Value Decomposition (SVD) of a random matrix:

Draw $A \sim \mathcal{N}(0, 1)^{m \times n}$
Compute SVD: $A = U \Sigma V^\top$
If $m \geq n$ : use $Q = U$ (shape $m \times n$ , orthonormal columns)
If $m < n$ : use $Q = V^\top$ (shape $m \times n$ , orthonormal rows)
Scale: $W = \text{gain} \cdot Q$

PyTorch:

import torch
import torch.nn as nn

w = torch.empty(64, 64)
nn.init.orthogonal_(w, gain=1.0)

# Verify orthogonality: W^T @ W ≈ I
print(torch.allclose(w.T @ w, torch.eye(64), atol=1e-6))  # True

# Non-square: orthonormal columns (m > n) or rows (m < n)
w_tall = torch.empty(128, 64)
nn.init.orthogonal_(w_tall)   # orthonormal columns: w_tall.T @ w_tall ≈ I_64

w_wide = torch.empty(64, 128)
nn.init.orthogonal_(w_wide)   # orthonormal rows: w_wide @ w_wide.T ≈ I_64

TensorFlow:

import tensorflow as tf

ortho_init = tf.keras.initializers.Orthogonal(gain=1.0)
w = ortho_init(shape=(64, 64))

# Verify
import tensorflow as tf
print(tf.reduce_max(tf.abs(tf.transpose(w) @ w - tf.eye(64))))  # ≈ 0

Applying to RNNs

Orthogonal initialization is most commonly used for the hidden-to-hidden weight matrix of RNNs, GRUs, and LSTMs, where the repeated matrix multiplication across timesteps makes gradient stability especially important.

PyTorch — vanilla RNN:

rnn = nn.RNN(input_size=32, hidden_size=64, num_layers=2)

# Initialize all hidden-to-hidden weights orthogonally
for name, param in rnn.named_parameters():
    if 'weight_hh' in name:
        nn.init.orthogonal_(param)
    elif 'weight_ih' in name:
        nn.init.xavier_uniform_(param)
    elif 'bias' in name:
        nn.init.zeros_(param)

PyTorch — LSTM (weight_hh has concatenated gates):

lstm = nn.LSTM(input_size=32, hidden_size=64)

for name, param in lstm.named_parameters():
    if 'weight_hh' in name:
        # weight_hh is (4*hidden, hidden) — apply orthogonal to each gate block
        hidden_size = 64
        for gate_idx in range(4):
            block = param.data[gate_idx*hidden_size:(gate_idx+1)*hidden_size]
            nn.init.orthogonal_(block)
    elif 'weight_ih' in name:
        nn.init.xavier_uniform_(param)
    elif 'bias' in name:
        nn.init.zeros_(param)

TensorFlow — built-in recurrent_initializer:

# SimpleRNN, GRU, and LSTM all accept a recurrent_initializer
rnn = tf.keras.layers.SimpleRNN(64, recurrent_initializer='orthogonal')
gru = tf.keras.layers.GRU(64, recurrent_initializer='orthogonal')
lstm = tf.keras.layers.LSTM(64, recurrent_initializer='orthogonal')

# Keras default for LSTM recurrent_initializer is already Orthogonal

The gain Parameter

Scaling an orthogonal matrix by gain changes all singular values from 1 to gain. Values $> 1$ bias the initialization toward expansion; $< 1$ toward contraction. For most RNN use cases, gain=1.0 is the right choice. When using orthogonal init as a warmstart for a network that will later develop non-unit singular values (e.g., GAN discriminators), a slight gain $< 1$ can help stability.

Orthogonal Initialization for Deep Linear Networks

In deep linear networks (no activation functions), the product of weight matrices determines the effective linear map. Orthogonal initialization makes all layers start as rotations, which keeps the effective Jacobian well-conditioned from the start — a useful property for theoretical analysis and for networks that approximate linear mappings (e.g., some attention-free architectures).

Practical Notes

Only for 2D weight tensors: orthogonal initialization is defined for matrices. For conv layers (4D tensors), PyTorch reshapes the tensor to 2D, applies orthogonal init, then reshapes back.
Not variance-scaling: orthogonal init sets singular values to 1 regardless of fan-in or fan-out. Combine with a gain value scaled to your layer size if needed.
Computational cost: SVD is $O(\min(m,n) \cdot m \cdot n)$ , which is non-trivial for large weight matrices. For very large layers, He or Xavier is more practical at initialization time.

Summary of Structured Initializations

Initializer	What it produces	Primary use case
`orthogonal_` / `Orthogonal`	$W^\top W = I$ ; all singular values = 1	RNN hidden-to-hidden weights; deep linear networks
`eye_` / `Identity`	Identity matrix (square only)	Linear adapter layers; residual passthrough
`dirac_`	Dirac delta conv filter (PyTorch only)	Conv layer identity passthrough

References

Saxe et al. (2014) — Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Networks — Showed that orthogonal initialization enables exact analytical solutions for linear network training dynamics

Le et al. (2015) — A Simple Way to Initialize Recurrent Networks of Rectified Linear Units — Proposed identity/orthogonal initialization for ReLU RNNs as a practical alternative to LSTMs

Previous Next →

Orthogonal & Structured Initializations

Orthogonal Initialization

Why Singular Values = 1 Matters

Computing an Orthogonal Matrix

Applying to RNNs

The gain Parameter

Orthogonal Initialization for Deep Linear Networks

Practical Notes

Summary of Structured Initializations

Privacy Policy

What we collect

What we don't collect

Your choices

Contact