Supplement · Regularization

Regularization in PyTorch

Colab Notebook · ~50 min
Google Colab Notebook
Regularization in PyTorch
Python · ~50 min
Open in Colab
Lab Objectives
1
Implement L1 and L2 weight penalties from scratch and verify against PyTorch's built-in weight decay; demonstrate numerically why Adam+L2 ≠ AdamW and measure the divergence on a small network
2
Implement inverted dropout from scratch, verify expected-activation invariance, and run MC Dropout (T=100 stochastic passes) to estimate predictive uncertainty on a held-out set
3
Implement Batch Normalization from scratch — computing batch statistics in train mode and using running averages in eval mode — and reproduce the common eval-mode bug
4
Implement Mixup and CutMix augmentation pipelines; train a small CNN on CIFAR-10 with and without each technique and compare validation accuracy curves
5
Implement label smoothing from scratch, verify the logit-gap bound analytically, and visualise the difference in learned feature clusters (t-SNE) with and without smoothing
6
Apply spectral normalization to a discriminator-style network, clip gradients by global norm, and verify the Lipschitz constraint empirically using random input perturbations

Lab Overview

This lab translates every regularization technique from the readings into runnable PyTorch code. Each section follows the same structure: build from scratch → compare to the PyTorch built-in → run a targeted experiment that demonstrates the effect.

Sections

Section Topic Key experiment
1 L1 / L2 penalties, AdamW Adam+L2 vs AdamW: track effective per-parameter decay rate
2 Dropout, MC Dropout Uncertainty on a rotated-MNIST OOD set
3 Batch Normalization Reproduce eval-mode bug; from-scratch BN vs nn.BatchNorm1d
4 Mixup & CutMix CIFAR-10 val accuracy with and without augmentation
5 Label smoothing Logit-gap bound; t-SNE clusters with/without smoothing
6 Spectral norm, gradient clipping Lipschitz verification; gradient norm histogram before/after clipping

Section 1 — Weight Penalties and AdamW

Implement l2_penalty(params, lam) and l1_penalty(params, lam) that iterate over parameter tensors and return the scalar penalty term. Train a two-layer MLP on a small regression task three ways:

Adam + L2 added to loss
Adam + weight_decay argument  
AdamW + weight_decay argument

Log the effective decay magnitude per parameter at each step to show that Adam+L2 produces inconsistent effective decay (divided by second-moment estimates) while AdamW is uniform.

Section 2 — Dropout and MC Dropout

Build inverted_dropout(x, p, training) — mask with Bernoulli, scale survivors by 1/(1-p). Assert that E[output] ≈ input over 1000 random calls.

For MC Dropout: train a classifier on MNIST, then at test time run T=100 forward passes with dropout active, compute the mean and variance of the softmax output. Compare predictive entropy on clean vs rotation-augmented (out-of-distribution) test samples.

Section 3 — Batch Normalization

Implement MyBatchNorm1d(num_features) as an nn.Module with:

  • Learnable gamma and beta parameters
  • running_mean and running_var buffers
  • training flag controlling which statistics are used

Verify numerical agreement with nn.BatchNorm1d. Then deliberately omit model.eval() before a test-forward and observe how predictions change with batch size and composition.

Section 4 — Mixup and CutMix

Implement mixup_batch(x, y, alpha) and cutmix_batch(x, y, alpha). Both return mixed inputs and soft label tensors. Key invariant to check: mixed_labels.sum(dim=1) must equal 1.0 for every sample.

Train a small ResNet-9 on CIFAR-10 for 30 epochs:

  • Baseline (no augmentation)
  • Random flips + crops only
    • Mixup (α=0.4)
    • CutMix (α=1.0)

Plot validation accuracy curves; note that the Mixup/CutMix models initially train slower (loss is higher on blended examples) but generalise better.

Section 5 — Label Smoothing

Implement smooth_labels(y_onehot, eps, K) using the formula y~k=(1ε)yk+ε/K\tilde{y}_k = (1-\varepsilon)y_k + \varepsilon/K. Cross-check on a 10-class example that the correct class receives 0.91 (ε=0.1).

Verify the logit-gap bound: after training with label smoothing, log the difference z[correct_class] - z[wrong_class] for 1000 samples and confirm it is bounded by log[(1ε)(K1)/ε]\log[(1-\varepsilon)(K-1)/\varepsilon]. Without label smoothing, this gap grows without bound through training.

Compute t-SNE embeddings from the penultimate layer of both models; observe that label-smoothed features form tighter, more separated clusters.

Section 6 — Spectral Normalization and Gradient Clipping

Apply nn.utils.spectral_norm to every linear layer in a discriminator-style MLP. Verify that the spectral norm of each weight matrix stays ≤ 1.0 throughout training by calling torch.linalg.matrix_norm(layer.weight, ord=2) after each step.

For gradient clipping: train a deep RNN on a character-language-model task without clipping, observe gradient norm spikes, then re-run with clip_grad_norm_(model.parameters(), max_norm=1.0). Plot the gradient norm histogram before and after.