Supplement · Optimizers
Optimizers in PyTorch
Google Colab Notebook
Optimizers in PyTorch
Lab Objectives
1
Implement all 13 PyTorch optimizers and observe their update dynamics
2
Visualize optimizer trajectories on a non-convex 2D loss surface
3
Diagnose and fix gradient explosion using clip_grad_norm_
4
Build a param-group config with layer-wise learning rate decay
5
Compare StepLR, CosineAnnealingLR, OneCycleLR, and ReduceLROnPlateau empirically
6
Reproduce super-convergence on CIFAR-10 using OneCycleLR + SGD
7
Implement the warmup + cosine decay recipe with SequentialLR
Lab Overview
This lab gives you hands-on experience with all 13 PyTorch optimizers and 15 learning rate schedulers through a series of experiments with increasing complexity.
What You'll Build
- Optimizer trajectory visualization — plot gradient descent paths on the 2D Rosenbrock function for SGD, Adam, AdamW, and RMSprop
- Convergence speed comparison — train a 4-layer MLP on synthetic regression data with every optimizer; plot loss curves
- Adaptive vs. SGD on sparse inputs — compare Adagrad, RMSprop, and Adam on a sparse NLP bag-of-words task
- L-BFGS with closure — fit a physics curve with LBFGS and compare convergence to Adam
- Param groups and LLRD — fine-tune a pretrained ResNet-18 with layer-wise learning rate decay; measure accuracy vs. uniform LR
- Gradient clipping experiment — induce gradient explosion in a deep ReLU net; compare unclipped, clip_grad_value_, and clip_grad_norm_
- Scheduler comparison — train the same network with StepLR, CosineAnnealingLR, OneCycleLR, and ReduceLROnPlateau; plot LR curves and final accuracy
- Super-convergence with OneCycleLR — reproduce Smith's fast training result on CIFAR-10 ResNet-18: 93% accuracy in 30 epochs
- SequentialLR warmup + cosine — implement the standard transformer training recipe and compare to no-warmup baseline
Prerequisites
- Familiarity with PyTorch autograd and
nn.Module - Basic understanding of gradient descent
- Access to a GPU runtime (recommended for sections 8–9)
Key Concepts Reinforced
- How the optimizer interface (
zero_grad,step,state_dict) works - Why decoupled weight decay (AdamW) matters in practice
- The closure pattern required by LBFGS
- Layer-wise learning rate decay for fine-tuning
- Per-batch vs. per-epoch scheduler calling conventions
- OneCycleLR peak LR selection via the LR range test