Supplement · Normalization in Deep Learning

Why Normalization? Covariate Shift and the Loss Landscape

13 min read
By the end of this reading you will be able to:
  • Explain internal covariate shift — how simultaneous parameter updates in all layers cause the input distribution of each layer to shift during training — and state why this slows optimization
  • Explain the Santurkar et al. (2018) finding that batch normalization's primary benefit is loss landscape smoothing rather than covariate shift reduction, and state what 'β-smooth' means for an optimizer
  • Identify the four normalization axes — what dimension is normalized, whether statistics are computed per-example or across the batch, whether parameters are fixed or adaptive — and use them to locate BatchNorm, LayerNorm, InstanceNorm, and WeightNorm on a unified map
  • Explain why normalization allows larger learning rates and reduces sensitivity to weight initialization, connecting both to the loss landscape smoothing view

The Problem Before Normalization

Training deep networks in 2014 required careful initialization, very small learning rates, and extensive hyperparameter tuning. The root difficulty was that each layer's behavior depended heavily on the outputs of all previous layers — and those outputs were constantly changing as upstream parameters updated.

Consider what happens during a single gradient step: the optimizer updates parameters in every layer simultaneously. From layer \ell's perspective, not only do its own weights change, but the distribution of its inputs (the outputs of layer 1\ell-1) shifts too. Layer \ell must then adapt to this new input distribution at the same time it is trying to minimize the loss. Every layer is trying to hit a moving target.

This is internal covariate shift: the distribution of each layer's activations changes continuously throughout training, in a way that is correlated with the parameter updates themselves.


Two Concrete Failure Modes

Saturated Activations

For sigmoid and tanh activations, inputs far from zero produce near-zero gradients. Without normalization, activations can drift into saturated regions as earlier layers update, killing gradient flow to downstream layers. The network effectively stops learning.

Scale Sensitivity

If the inputs to a layer have very different scales across features (e.g., feature 1 has mean 0, std 0.01 while feature 2 has mean 10, std 100), the loss landscape has very different curvature in different directions — shallow in some directions, steep in others. A single global learning rate performs poorly: it either undershoots in steep directions or diverges in shallow ones.


The Original Hypothesis: Covariate Shift Reduction

Ioffe & Szegedy (2015) introduced batch normalization with the hypothesis that its benefit came from reducing internal covariate shift — by normalizing layer inputs to zero mean and unit variance, each layer sees a more stable distribution and can optimize more effectively.

This is a compelling intuition and is still widely cited. However, subsequent work challenged whether covariate shift was actually the primary mechanism.


The Revised View: Loss Landscape Smoothing

Santurkar et al. (2018) ran controlled experiments where they added random noise to the activations of a BatchNorm network — reintroducing covariate shift deliberately — and found the network still trained faster than a non-BN baseline. Covariate shift was not the key.

Their finding: BatchNorm makes the loss landscape smoother.

Formally, normalization reduces the Lipschitz constant of the loss and the β-smoothness of the gradients:

  • β-smooth gradient: L(θ1)L(θ2)βθ1θ2\|\nabla\mathcal{L}(\theta_1) - \nabla\mathcal{L}(\theta_2)\| \leq \beta\|\theta_1 - \theta_2\| — the gradient does not change too rapidly
  • Lipschitz loss: L(θ1)L(θ2)Lθ1θ2|\mathcal{L}(\theta_1) - \mathcal{L}(\theta_2)| \leq L\|\theta_1 - \theta_2\| — the loss does not change too rapidly

A smoother loss surface has two direct practical benefits:

  1. Larger learning rates are safe — a step in any direction from any point cannot encounter a sharp cliff or spike
  2. Less sensitivity to initialization — the optimizer starts in a better-conditioned region regardless of where weights are initialized

This explains empirically why BN allows training with 10–30× larger learning rates than non-BN networks.


A Unified Map: Normalization Axes

Every normalization technique can be characterized by four dimensions:

Axis Options
What is normalized Activations (feature maps) vs. weights
Normalization dimension Batch (across examples) vs. feature (across channels/dims) vs. spatial
Statistics scope Global (batch) vs. per-example vs. per-group
Scale/shift parameters Fixed (zero mean, unit variance) vs. learned (γ, β) vs. adaptive (predicted)

Mapped onto the major techniques:

Technique Normalizes Dimension Statistics Parameters
BatchNorm Activations Per-feature across batch Batch Learned γ, β
LayerNorm Activations Per-example across features Per-example Learned γ, β
InstanceNorm Activations Per-example per-channel Per-example/channel Learned or none
GroupNorm Activations Per-example, per group of channels Per-group Learned γ, β
RMSNorm Activations Per-example across features Per-example (no mean) Learned γ only
WeightNorm Weights Per-output-unit Per-weight-vector Learned g (scale)
SpectralNorm Weights Entire matrix Per-matrix None (constrained)
AdaIN Activations Per-example per-channel Adaptive (from style) Predicted

The subsequent readings examine each in depth.


Why Learning Rate and Initialization Sensitivity Decrease

In a non-normalized network, the gradient magnitude depends on the scale of the activations, which in turn depends on initialization and the upstream parameter history. A poorly chosen learning rate explodes or stalls depending on the local scale — and the optimal learning rate changes as training progresses.

Normalization decouples gradient magnitude from activation scale. After normalization, gradients flow at a consistent scale regardless of how the unnormalized activations behave. The optimizer sees a more uniform loss landscape and can use a consistent learning rate throughout training.


PyTorch and TensorFlow

The normalization API is largely consistent across frameworks. This reading is conceptual — the code for each specific technique appears in its own reading. But here is a quick demonstration of the core effect: normalization brings activations to a stable scale regardless of how the weights were initialized.

PyTorch:

import torch
import torch.nn as nn

torch.manual_seed(0)
W = torch.randn(64, 64) * 10.0    # poorly scaled weights
x = torch.randn(32, 64)
h = x @ W.T                        # raw pre-activations

print(f'Before normalization: mean={h.mean():.1f}, std={h.std():.1f}')
# Before normalization: mean~0, std~80  (scale depends entirely on W)

bn  = nn.BatchNorm1d(64)
h_n = bn(h)
print(f'After BatchNorm:      mean={h_n.mean():.4f}, std={h_n.std():.4f}')
# After BatchNorm: mean~0.0000, std~1.0000

# The same gradient-magnitude stability holds for LayerNorm, GroupNorm, etc.
# Each reading introduces the relevant nn.* class for its technique.

TensorFlow:

import tensorflow as tf

tf.random.set_seed(0)
W = tf.random.normal((64, 64), stddev=10.0)
x = tf.random.normal((32, 64))
h = x @ tf.transpose(W)

tf.print('Before normalization: std =', tf.math.reduce_std(h))
# std ~ 80

bn  = tf.keras.layers.BatchNormalization()
h_n = bn(h, training=True)
tf.print('After BatchNorm:      std =', tf.math.reduce_std(h_n))
# std ~ 1.0