Normalization in Deep Learning

A comprehensive treatment of normalization techniques — from why they work to how to choose between them. Covers the covariate shift and loss landscape views, BatchNorm internals (running stats, train/eval modes, SyncBN), the LayerNorm family (RMSNorm, DeepNorm, pre/post-norm), weight and spectral normalization, small-batch alternatives (GroupNorm, InstanceNorm), and adaptive/conditional normalization (AdaIN, SPADE, FiLM, adaLN-Zero in DiT).

intermediate 1.5h estimated 6 readings 2 quizzes 2 labs 2 drill decks

Readings

Why Normalization? Covariate Shift and the Loss Landscape

Internal covariate shift, Santurkar et al. loss landscape smoothing, β-smoothness, and a unified taxonomy of all normalization techniques across four axes

13 min

Batch Normalization — Algorithm, Placement, and Multi-GPU

Full BN algorithm with running statistics, the model.eval() bug, spatial BN for CNNs, pre-activation vs. post-activation placement, and Synchronized BatchNorm

17 min

Layer Norm, RMSNorm, and the Pre-Norm / Post-Norm Debate

LayerNorm formula and properties, pre-norm vs. post-norm stability analysis, RMSNorm's mean-centering removal and 15–20% speedup, DeepNorm's scaled residual for 1000+ layer transformers

15 min

Weight Normalization and Spectral Normalization