Supplement · Optimizers

What Is an Optimizer?

12 min read
By the end of this reading you will be able to:
  • Trace the zero_grad → forward → backward → step cycle and explain why gradients must be cleared before each iteration
  • Configure param groups to assign different learning rates and weight decay to different parts of a model
  • Apply gradient clipping by norm and by value and explain why norm clipping is preferred for preserving gradient direction
  • Select an optimizer from the practical selection guide given an architecture, task, and training budget

The Optimization Problem

Training a neural network reduces to minimizing a loss function L(θ)L(\theta) over the model parameters θ\theta. At each step you observe a mini-batch gradient g=θLg = \nabla_{\theta} L and need a rule for updating θ\theta. That rule is the optimizer.

The simplest rule is steepest descent: θt+1=θtηgt\theta_{t+1} = \theta_t - \eta \, g_t

But raw gradient descent on deep networks suffers from ill-conditioning, saddle points, and noisy mini-batch estimates. Modern optimizers address all three.

The PyTorch Optimizer Interface

Every torch.optim optimizer inherits from Optimizer and exposes a uniform API:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for x, y in dataloader:
    optimizer.zero_grad()      # clear gradients from previous step
    loss = criterion(model(x), y)
    loss.backward()            # compute gradients via autograd
    optimizer.step()           # apply update rule

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

for x, y in dataset:
    with tf.GradientTape() as tape:
        loss = loss_fn(model(x, training=True), y)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

zero_grad

PyTorch accumulates gradients by default. You must clear them each step. The set_to_none=True flag (default since PyTorch 2.0) sets .grad to None instead of zeroing — it's faster because it skips the memory write and lets the allocator reuse the buffer:

optimizer.zero_grad(set_to_none=True)  # faster than zero_grad()

TensorFlow: TF/Keras does not accumulate gradients by default — GradientTape computes fresh gradients each call, so there is no equivalent of zero_grad().

State Persistence

Optimizers maintain running statistics (momentum buffers, squared-gradient accumulators) as internal state. Save and restore it with:

checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': epoch,
}
torch.save(checkpoint, 'ckpt.pt')

# Resume
ckpt = torch.load('ckpt.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])

Param Groups

A param group is a dict specifying which parameters use which hyperparameters. This enables per-layer settings:

optimizer = torch.optim.AdamW([
    {'params': backbone.parameters(), 'lr': 1e-4, 'weight_decay': 0.01},
    {'params': head.parameters(),     'lr': 1e-3, 'weight_decay': 0.0},
])

TensorFlow: Keras does not support per-variable optimizer hyperparameters natively. The standard approach is a custom training loop with separate optimizers:

backbone_opt = tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=0.01)
head_opt     = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.0)

with tf.GradientTape() as tape:
    loss = ...
all_grads = tape.gradient(loss, model.trainable_variables)
backbone_opt.apply_gradients(zip(backbone_grads, backbone.trainable_variables))
head_opt.apply_gradients(zip(head_grads, head.trainable_variables))

Common use cases:

  • Fine-tuning — lower LR for pretrained backbone, higher LR for new head
  • Embedding layers — disable weight decay to avoid shrinking embeddings
  • Frozen layers — set requires_grad=False to skip their update entirely

Key Hyperparameters

Hyperparameter Symbol Typical range Role
Learning rate η\eta 1e-4 – 1e-1 Step size scale
Momentum β\beta 0.85 – 0.99 Exponential average of gradients
Weight decay λ\lambda 1e-4 – 1e-2 L2 regularization penalty
Epsilon ε\varepsilon 1e-8 – 1e-6 Numerical stability in divisions
Betas (β1,β2)(\beta_1, \beta_2) (0.9, 0.999) Adam moment decay rates

Gradient Clipping

Exploding gradients — common in RNNs and transformers — are handled before optimizer.step():

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # L2 norm clip
# or
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)  # element-wise clip
optimizer.step()

TensorFlow:

# Via optimizer constructor (applied automatically each step):
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, clipnorm=1.0)
# or: clipvalue=0.5

# Manual clip in a custom training loop:
grads = tape.gradient(loss, model.trainable_variables)
grads, _ = tf.clip_by_global_norm(grads, clip_norm=1.0)  # equivalent to clip_grad_norm_
optimizer.apply_gradients(zip(grads, model.trainable_variables))

clip_grad_norm_ rescales the entire gradient vector so its L2 norm does not exceed max_norm. clip_grad_value_ clips each element independently to [-clip_value, clip_value]. Norm clipping is preferred because it preserves gradient direction.

Optimizer Selection Guide

Situation Recommended optimizer
General deep learning baseline AdamW
Large-scale vision (ResNet, ViT) SGD + momentum or AdamW
Transformers, LLMs AdamW with cosine + warmup
Sparse embeddings (NLP) SparseAdam
Small models, L-BFGS tolerance LBFGS
Convex problems, averaging needed ASGD
No learning-rate tuning desired Adadelta
Super-convergence experiments SGD + OneCycleLR