Regularization in PyTorch
Lab Overview
This lab translates every regularization technique from the readings into runnable PyTorch code. Each section follows the same structure: build from scratch → compare to the PyTorch built-in → run a targeted experiment that demonstrates the effect.
Sections
| Section | Topic | Key experiment |
|---|---|---|
| 1 | L1 / L2 penalties, AdamW | Adam+L2 vs AdamW: track effective per-parameter decay rate |
| 2 | Dropout, MC Dropout | Uncertainty on a rotated-MNIST OOD set |
| 3 | Batch Normalization | Reproduce eval-mode bug; from-scratch BN vs nn.BatchNorm1d |
| 4 | Mixup & CutMix | CIFAR-10 val accuracy with and without augmentation |
| 5 | Label smoothing | Logit-gap bound; t-SNE clusters with/without smoothing |
| 6 | Spectral norm, gradient clipping | Lipschitz verification; gradient norm histogram before/after clipping |
Section 1 — Weight Penalties and AdamW
Implement l2_penalty(params, lam) and l1_penalty(params, lam) that iterate over parameter tensors and return the scalar penalty term. Train a two-layer MLP on a small regression task three ways:
Adam + L2 added to loss
Adam + weight_decay argument
AdamW + weight_decay argument
Log the effective decay magnitude per parameter at each step to show that Adam+L2 produces inconsistent effective decay (divided by second-moment estimates) while AdamW is uniform.
Section 2 — Dropout and MC Dropout
Build inverted_dropout(x, p, training) — mask with Bernoulli, scale survivors by 1/(1-p). Assert that E[output] ≈ input over 1000 random calls.
For MC Dropout: train a classifier on MNIST, then at test time run T=100 forward passes with dropout active, compute the mean and variance of the softmax output. Compare predictive entropy on clean vs rotation-augmented (out-of-distribution) test samples.
Section 3 — Batch Normalization
Implement MyBatchNorm1d(num_features) as an nn.Module with:
- Learnable
gammaandbetaparameters running_meanandrunning_varbufferstrainingflag controlling which statistics are used
Verify numerical agreement with nn.BatchNorm1d. Then deliberately omit model.eval() before a test-forward and observe how predictions change with batch size and composition.
Section 4 — Mixup and CutMix
Implement mixup_batch(x, y, alpha) and cutmix_batch(x, y, alpha). Both return mixed inputs and soft label tensors. Key invariant to check: mixed_labels.sum(dim=1) must equal 1.0 for every sample.
Train a small ResNet-9 on CIFAR-10 for 30 epochs:
- Baseline (no augmentation)
- Random flips + crops only
- Mixup (α=0.4)
- CutMix (α=1.0)
Plot validation accuracy curves; note that the Mixup/CutMix models initially train slower (loss is higher on blended examples) but generalise better.
Section 5 — Label Smoothing
Implement smooth_labels(y_onehot, eps, K) using the formula . Cross-check on a 10-class example that the correct class receives 0.91 (ε=0.1).
Verify the logit-gap bound: after training with label smoothing, log the difference z[correct_class] - z[wrong_class] for 1000 samples and confirm it is bounded by . Without label smoothing, this gap grows without bound through training.
Compute t-SNE embeddings from the penultimate layer of both models; observe that label-smoothed features form tighter, more separated clusters.
Section 6 — Spectral Normalization and Gradient Clipping
Apply nn.utils.spectral_norm to every linear layer in a discriminator-style MLP. Verify that the spectral norm of each weight matrix stays ≤ 1.0 throughout training by calling torch.linalg.matrix_norm(layer.weight, ord=2) after each step.
For gradient clipping: train a deep RNN on a character-language-model task without clipping, observe gradient norm spikes, then re-run with clip_grad_norm_(model.parameters(), max_norm=1.0). Plot the gradient norm histogram before and after.