Supplement · Normalization in Deep Learning

Why Normalization? Covariate Shift and the Loss Landscape

13 min read

By the end of this reading you will be able to:

Explain internal covariate shift — how simultaneous parameter updates in all layers cause the input distribution of each layer to shift during training — and state why this slows optimization
Explain the Santurkar et al. (2018) finding that batch normalization's primary benefit is loss landscape smoothing rather than covariate shift reduction, and state what 'β-smooth' means for an optimizer
Identify the four normalization axes — what dimension is normalized, whether statistics are computed per-example or across the batch, whether parameters are fixed or adaptive — and use them to locate BatchNorm, LayerNorm, InstanceNorm, and WeightNorm on a unified map
Explain why normalization allows larger learning rates and reduces sensitivity to weight initialization, connecting both to the loss landscape smoothing view

The Problem Before Normalization

Training deep networks in 2014 required careful initialization, very small learning rates, and extensive hyperparameter tuning. The root difficulty was that each layer's behavior depended heavily on the outputs of all previous layers — and those outputs were constantly changing as upstream parameters updated.

Consider what happens during a single gradient step: the optimizer updates parameters in every layer simultaneously. From layer $\ell$ 's perspective, not only do its own weights change, but the distribution of its inputs (the outputs of layer $\ell-1$ ) shifts too. Layer $\ell$ must then adapt to this new input distribution at the same time it is trying to minimize the loss. Every layer is trying to hit a moving target.

This is internal covariate shift: the distribution of each layer's activations changes continuously throughout training, in a way that is correlated with the parameter updates themselves.

Two Concrete Failure Modes

Saturated Activations

For sigmoid and tanh activations, inputs far from zero produce near-zero gradients. Without normalization, activations can drift into saturated regions as earlier layers update, killing gradient flow to downstream layers. The network effectively stops learning.

Scale Sensitivity

If the inputs to a layer have very different scales across features (e.g., feature 1 has mean 0, std 0.01 while feature 2 has mean 10, std 100), the loss landscape has very different curvature in different directions — shallow in some directions, steep in others. A single global learning rate performs poorly: it either undershoots in steep directions or diverges in shallow ones.

The Original Hypothesis: Covariate Shift Reduction

Ioffe & Szegedy (2015) introduced batch normalization with the hypothesis that its benefit came from reducing internal covariate shift — by normalizing layer inputs to zero mean and unit variance, each layer sees a more stable distribution and can optimize more effectively.

This is a compelling intuition and is still widely cited. However, subsequent work challenged whether covariate shift was actually the primary mechanism.

The Revised View: Loss Landscape Smoothing

Santurkar et al. (2018) ran controlled experiments where they added random noise to the activations of a BatchNorm network — reintroducing covariate shift deliberately — and found the network still trained faster than a non-BN baseline. Covariate shift was not the key.

Their finding: BatchNorm makes the loss landscape smoother.

Formally, normalization reduces the Lipschitz constant of the loss and the β-smoothness of the gradients:

β-smooth gradient: $\|\nabla\mathcal{L}(\theta_1) - \nabla\mathcal{L}(\theta_2)\| \leq \beta\|\theta_1 - \theta_2\|$ — the gradient does not change too rapidly
Lipschitz loss: $|\mathcal{L}(\theta_1) - \mathcal{L}(\theta_2)| \leq L\|\theta_1 - \theta_2\|$ — the loss does not change too rapidly

A smoother loss surface has two direct practical benefits:

Larger learning rates are safe — a step in any direction from any point cannot encounter a sharp cliff or spike
Less sensitivity to initialization — the optimizer starts in a better-conditioned region regardless of where weights are initialized

This explains empirically why BN allows training with 10–30× larger learning rates than non-BN networks.

A Unified Map: Normalization Axes

Every normalization technique can be characterized by four dimensions:

Axis	Options
What is normalized	Activations (feature maps) vs. weights
Normalization dimension	Batch (across examples) vs. feature (across channels/dims) vs. spatial
Statistics scope	Global (batch) vs. per-example vs. per-group
Scale/shift parameters	Fixed (zero mean, unit variance) vs. learned (γ, β) vs. adaptive (predicted)

Mapped onto the major techniques:

Technique	Normalizes	Dimension	Statistics	Parameters
BatchNorm	Activations	Per-feature across batch	Batch	Learned γ, β
LayerNorm	Activations	Per-example across features	Per-example	Learned γ, β
InstanceNorm	Activations	Per-example per-channel	Per-example/channel	Learned or none
GroupNorm	Activations	Per-example, per group of channels	Per-group	Learned γ, β
RMSNorm	Activations	Per-example across features	Per-example (no mean)	Learned γ only
WeightNorm	Weights	Per-output-unit	Per-weight-vector	Learned g (scale)
SpectralNorm	Weights	Entire matrix	Per-matrix	None (constrained)
AdaIN	Activations	Per-example per-channel	Adaptive (from style)	Predicted

The subsequent readings examine each in depth.

Why Learning Rate and Initialization Sensitivity Decrease

In a non-normalized network, the gradient magnitude depends on the scale of the activations, which in turn depends on initialization and the upstream parameter history. A poorly chosen learning rate explodes or stalls depending on the local scale — and the optimal learning rate changes as training progresses.

Normalization decouples gradient magnitude from activation scale. After normalization, gradients flow at a consistent scale regardless of how the unnormalized activations behave. The optimizer sees a more uniform loss landscape and can use a consistent learning rate throughout training.

PyTorch and TensorFlow

The normalization API is largely consistent across frameworks. This reading is conceptual — the code for each specific technique appears in its own reading. But here is a quick demonstration of the core effect: normalization brings activations to a stable scale regardless of how the weights were initialized.

PyTorch:

import torch
import torch.nn as nn

torch.manual_seed(0)
W = torch.randn(64, 64) * 10.0    # poorly scaled weights
x = torch.randn(32, 64)
h = x @ W.T                        # raw pre-activations

print(f'Before normalization: mean={h.mean():.1f}, std={h.std():.1f}')
# Before normalization: mean~0, std~80  (scale depends entirely on W)

bn  = nn.BatchNorm1d(64)
h_n = bn(h)
print(f'After BatchNorm:      mean={h_n.mean():.4f}, std={h_n.std():.4f}')
# After BatchNorm: mean~0.0000, std~1.0000

# The same gradient-magnitude stability holds for LayerNorm, GroupNorm, etc.
# Each reading introduces the relevant nn.* class for its technique.

TensorFlow:

import tensorflow as tf

tf.random.set_seed(0)
W = tf.random.normal((64, 64), stddev=10.0)
x = tf.random.normal((32, 64))
h = x @ tf.transpose(W)

tf.print('Before normalization: std =', tf.math.reduce_std(h))
# std ~ 80

bn  = tf.keras.layers.BatchNormalization()
h_n = bn(h, training=True)
tf.print('After BatchNorm:      std =', tf.math.reduce_std(h_n))
# std ~ 1.0

References

Ioffe & Szegedy 2015 — Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Santurkar et al. 2018 — How Does Batch Normalization Help Optimization?

Overview Next →

Why Normalization? Covariate Shift and the Loss Landscape

The Problem Before Normalization

Two Concrete Failure Modes

Saturated Activations

Scale Sensitivity

The Original Hypothesis: Covariate Shift Reduction

The Revised View: Loss Landscape Smoothing

A Unified Map: Normalization Axes

Why Learning Rate and Initialization Sensitivity Decrease

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact