Supplement · Optimizers

Optimizers in PyTorch

Colab Notebook · ~55 min

Google Colab Notebook

Python · ~55 min

Open in Colab

Lab Objectives

Implement all 13 PyTorch optimizers and observe their update dynamics

Visualize optimizer trajectories on a non-convex 2D loss surface

Diagnose and fix gradient explosion using clip_grad_norm_

Build a param-group config with layer-wise learning rate decay

Compare StepLR, CosineAnnealingLR, OneCycleLR, and ReduceLROnPlateau empirically

Reproduce super-convergence on CIFAR-10 using OneCycleLR + SGD

Implement the warmup + cosine decay recipe with SequentialLR

Lab Overview

This lab gives you hands-on experience with all 13 PyTorch optimizers and 15 learning rate schedulers through a series of experiments with increasing complexity.

What You'll Build

Optimizer trajectory visualization — plot gradient descent paths on the 2D Rosenbrock function for SGD, Adam, AdamW, and RMSprop
Convergence speed comparison — train a 4-layer MLP on synthetic regression data with every optimizer; plot loss curves
Adaptive vs. SGD on sparse inputs — compare Adagrad, RMSprop, and Adam on a sparse NLP bag-of-words task
L-BFGS with closure — fit a physics curve with LBFGS and compare convergence to Adam
Param groups and LLRD — fine-tune a pretrained ResNet-18 with layer-wise learning rate decay; measure accuracy vs. uniform LR
Gradient clipping experiment — induce gradient explosion in a deep ReLU net; compare unclipped, clip_grad_value_, and clip_grad_norm_
Scheduler comparison — train the same network with StepLR, CosineAnnealingLR, OneCycleLR, and ReduceLROnPlateau; plot LR curves and final accuracy
Super-convergence with OneCycleLR — reproduce Smith's fast training result on CIFAR-10 ResNet-18: 93% accuracy in 30 epochs
SequentialLR warmup + cosine — implement the standard transformer training recipe and compare to no-warmup baseline

Prerequisites

Familiarity with PyTorch autograd and nn.Module
Basic understanding of gradient descent
Access to a GPU runtime (recommended for sections 8–9)

Key Concepts Reinforced

How the optimizer interface (zero_grad, step, state_dict) works
Why decoupled weight decay (AdamW) matters in practice
The closure pattern required by LBFGS
Layer-wise learning rate decay for fine-tuning
Per-batch vs. per-epoch scheduler calling conventions
OneCycleLR peak LR selection via the LR range test

Previous Next →

Optimizers in PyTorch

Lab Overview

What You'll Build

Prerequisites

Key Concepts Reinforced

Privacy Policy

What we collect

What we don't collect

Your choices

Contact