Supplement · Optimizers

Optimizers in PyTorch

Colab Notebook · ~55 min
Google Colab Notebook
Optimizers in PyTorch
Python · ~55 min
Open in Colab
Lab Objectives
1
Implement all 13 PyTorch optimizers and observe their update dynamics
2
Visualize optimizer trajectories on a non-convex 2D loss surface
3
Diagnose and fix gradient explosion using clip_grad_norm_
4
Build a param-group config with layer-wise learning rate decay
5
Compare StepLR, CosineAnnealingLR, OneCycleLR, and ReduceLROnPlateau empirically
6
Reproduce super-convergence on CIFAR-10 using OneCycleLR + SGD
7
Implement the warmup + cosine decay recipe with SequentialLR

Lab Overview

This lab gives you hands-on experience with all 13 PyTorch optimizers and 15 learning rate schedulers through a series of experiments with increasing complexity.

What You'll Build

  1. Optimizer trajectory visualization — plot gradient descent paths on the 2D Rosenbrock function for SGD, Adam, AdamW, and RMSprop
  2. Convergence speed comparison — train a 4-layer MLP on synthetic regression data with every optimizer; plot loss curves
  3. Adaptive vs. SGD on sparse inputs — compare Adagrad, RMSprop, and Adam on a sparse NLP bag-of-words task
  4. L-BFGS with closure — fit a physics curve with LBFGS and compare convergence to Adam
  5. Param groups and LLRD — fine-tune a pretrained ResNet-18 with layer-wise learning rate decay; measure accuracy vs. uniform LR
  6. Gradient clipping experiment — induce gradient explosion in a deep ReLU net; compare unclipped, clip_grad_value_, and clip_grad_norm_
  7. Scheduler comparison — train the same network with StepLR, CosineAnnealingLR, OneCycleLR, and ReduceLROnPlateau; plot LR curves and final accuracy
  8. Super-convergence with OneCycleLR — reproduce Smith's fast training result on CIFAR-10 ResNet-18: 93% accuracy in 30 epochs
  9. SequentialLR warmup + cosine — implement the standard transformer training recipe and compare to no-warmup baseline

Prerequisites

  • Familiarity with PyTorch autograd and nn.Module
  • Basic understanding of gradient descent
  • Access to a GPU runtime (recommended for sections 8–9)

Key Concepts Reinforced

  • How the optimizer interface (zero_grad, step, state_dict) works
  • Why decoupled weight decay (AdamW) matters in practice
  • The closure pattern required by LBFGS
  • Layer-wise learning rate decay for fine-tuning
  • Per-batch vs. per-epoch scheduler calling conventions
  • OneCycleLR peak LR selection via the LR range test