Orthogonal & Structured Initializations
- Explain why orthogonal weight matrices preserve gradient norms and how this prevents vanishing and exploding gradients in RNNs
- Describe how orthogonal initialization is computed via SVD of a random normal matrix
- Apply orthogonal_ in PyTorch and Orthogonal in TensorFlow to RNN hidden-to-hidden weight matrices
Orthogonal Initialization
torch.nn.init.orthogonal_(tensor, gain=1.0) / tf.keras.initializers.Orthogonal(gain=1.0)
Initializes the weight matrix so that its rows (or columns, for tall matrices) form an orthonormal set. Concretely, a square matrix is orthogonal if:
All singular values of an orthogonal matrix equal 1. This is the key property.
Why Singular Values = 1 Matters
Consider a sequence of linear operations — for example, the hidden-state updates in an RNN across timesteps:
Ignoring the activation for a moment, the gradient of the loss at step flows backward through applications of :
If has singular values , this product grows exponentially — exploding gradients. If singular values , it shrinks exponentially — vanishing gradients. If all singular values (orthogonal), the product has the same norm as any single factor. Gradients neither explode nor vanish purely due to the weight matrix.
Computing an Orthogonal Matrix
An orthogonal matrix is computed via the Singular Value Decomposition (SVD) of a random matrix:
- Draw
- Compute SVD:
- If : use (shape , orthonormal columns)
- If : use (shape , orthonormal rows)
- Scale:
PyTorch:
import torch
import torch.nn as nn
w = torch.empty(64, 64)
nn.init.orthogonal_(w, gain=1.0)
# Verify orthogonality: W^T @ W ≈ I
print(torch.allclose(w.T @ w, torch.eye(64), atol=1e-6)) # True
# Non-square: orthonormal columns (m > n) or rows (m < n)
w_tall = torch.empty(128, 64)
nn.init.orthogonal_(w_tall) # orthonormal columns: w_tall.T @ w_tall ≈ I_64
w_wide = torch.empty(64, 128)
nn.init.orthogonal_(w_wide) # orthonormal rows: w_wide @ w_wide.T ≈ I_64
TensorFlow:
import tensorflow as tf
ortho_init = tf.keras.initializers.Orthogonal(gain=1.0)
w = ortho_init(shape=(64, 64))
# Verify
import tensorflow as tf
print(tf.reduce_max(tf.abs(tf.transpose(w) @ w - tf.eye(64)))) # ≈ 0
Applying to RNNs
Orthogonal initialization is most commonly used for the hidden-to-hidden weight matrix of RNNs, GRUs, and LSTMs, where the repeated matrix multiplication across timesteps makes gradient stability especially important.
PyTorch — vanilla RNN:
rnn = nn.RNN(input_size=32, hidden_size=64, num_layers=2)
# Initialize all hidden-to-hidden weights orthogonally
for name, param in rnn.named_parameters():
if 'weight_hh' in name:
nn.init.orthogonal_(param)
elif 'weight_ih' in name:
nn.init.xavier_uniform_(param)
elif 'bias' in name:
nn.init.zeros_(param)
PyTorch — LSTM (weight_hh has concatenated gates):
lstm = nn.LSTM(input_size=32, hidden_size=64)
for name, param in lstm.named_parameters():
if 'weight_hh' in name:
# weight_hh is (4*hidden, hidden) — apply orthogonal to each gate block
hidden_size = 64
for gate_idx in range(4):
block = param.data[gate_idx*hidden_size:(gate_idx+1)*hidden_size]
nn.init.orthogonal_(block)
elif 'weight_ih' in name:
nn.init.xavier_uniform_(param)
elif 'bias' in name:
nn.init.zeros_(param)
TensorFlow — built-in recurrent_initializer:
# SimpleRNN, GRU, and LSTM all accept a recurrent_initializer
rnn = tf.keras.layers.SimpleRNN(64, recurrent_initializer='orthogonal')
gru = tf.keras.layers.GRU(64, recurrent_initializer='orthogonal')
lstm = tf.keras.layers.LSTM(64, recurrent_initializer='orthogonal')
# Keras default for LSTM recurrent_initializer is already Orthogonal
The gain Parameter
Scaling an orthogonal matrix by gain changes all singular values from 1 to gain. Values bias the initialization toward expansion; toward contraction. For most RNN use cases, gain=1.0 is the right choice. When using orthogonal init as a warmstart for a network that will later develop non-unit singular values (e.g., GAN discriminators), a slight gain can help stability.
Orthogonal Initialization for Deep Linear Networks
In deep linear networks (no activation functions), the product of weight matrices determines the effective linear map. Orthogonal initialization makes all layers start as rotations, which keeps the effective Jacobian well-conditioned from the start — a useful property for theoretical analysis and for networks that approximate linear mappings (e.g., some attention-free architectures).
Practical Notes
- Only for 2D weight tensors: orthogonal initialization is defined for matrices. For conv layers (4D tensors), PyTorch reshapes the tensor to 2D, applies orthogonal init, then reshapes back.
- Not variance-scaling: orthogonal init sets singular values to 1 regardless of fan-in or fan-out. Combine with a
gainvalue scaled to your layer size if needed. - Computational cost: SVD is , which is non-trivial for large weight matrices. For very large layers, He or Xavier is more practical at initialization time.
Summary of Structured Initializations
| Initializer | What it produces | Primary use case |
|---|---|---|
orthogonal_ / Orthogonal |
; all singular values = 1 | RNN hidden-to-hidden weights; deep linear networks |
eye_ / Identity |
Identity matrix (square only) | Linear adapter layers; residual passthrough |
dirac_ |
Dirac delta conv filter (PyTorch only) | Conv layer identity passthrough |