Supplement · Optimizers

What Is an Optimizer?

12 min read

By the end of this reading you will be able to:

Trace the zero_grad → forward → backward → step cycle and explain why gradients must be cleared before each iteration
Configure param groups to assign different learning rates and weight decay to different parts of a model
Apply gradient clipping by norm and by value and explain why norm clipping is preferred for preserving gradient direction
Select an optimizer from the practical selection guide given an architecture, task, and training budget

The Optimization Problem

Training a neural network reduces to minimizing a loss function $L(\theta)$ over the model parameters $\theta$ . At each step you observe a mini-batch gradient $g = \nabla_{\theta} L$ and need a rule for updating $\theta$ . That rule is the optimizer.

The simplest rule is steepest descent: $\theta_{t+1} = \theta_t - \eta \, g_t$

But raw gradient descent on deep networks suffers from ill-conditioning, saddle points, and noisy mini-batch estimates. Modern optimizers address all three.

The PyTorch Optimizer Interface

Every torch.optim optimizer inherits from Optimizer and exposes a uniform API:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for x, y in dataloader:
    optimizer.zero_grad()      # clear gradients from previous step
    loss = criterion(model(x), y)
    loss.backward()            # compute gradients via autograd
    optimizer.step()           # apply update rule

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

for x, y in dataset:
    with tf.GradientTape() as tape:
        loss = loss_fn(model(x, training=True), y)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

zero_grad

PyTorch accumulates gradients by default. You must clear them each step. The set_to_none=True flag (default since PyTorch 2.0) sets .grad to None instead of zeroing — it's faster because it skips the memory write and lets the allocator reuse the buffer:

optimizer.zero_grad(set_to_none=True)  # faster than zero_grad()

TensorFlow: TF/Keras does not accumulate gradients by default — GradientTape computes fresh gradients each call, so there is no equivalent of zero_grad().

State Persistence

Optimizers maintain running statistics (momentum buffers, squared-gradient accumulators) as internal state. Save and restore it with:

checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': epoch,
}
torch.save(checkpoint, 'ckpt.pt')

# Resume
ckpt = torch.load('ckpt.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])

Param Groups

A param group is a dict specifying which parameters use which hyperparameters. This enables per-layer settings:

optimizer = torch.optim.AdamW([
    {'params': backbone.parameters(), 'lr': 1e-4, 'weight_decay': 0.01},
    {'params': head.parameters(),     'lr': 1e-3, 'weight_decay': 0.0},
])

TensorFlow: Keras does not support per-variable optimizer hyperparameters natively. The standard approach is a custom training loop with separate optimizers:

backbone_opt = tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=0.01)
head_opt     = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.0)

with tf.GradientTape() as tape:
    loss = ...
all_grads = tape.gradient(loss, model.trainable_variables)
backbone_opt.apply_gradients(zip(backbone_grads, backbone.trainable_variables))
head_opt.apply_gradients(zip(head_grads, head.trainable_variables))

Common use cases:

Fine-tuning — lower LR for pretrained backbone, higher LR for new head
Embedding layers — disable weight decay to avoid shrinking embeddings
Frozen layers — set requires_grad=False to skip their update entirely

Key Hyperparameters

Hyperparameter	Symbol	Typical range	Role
Learning rate	$\eta$	1e-4 – 1e-1	Step size scale
Momentum	$\beta$	0.85 – 0.99	Exponential average of gradients
Weight decay	$\lambda$	1e-4 – 1e-2	L2 regularization penalty
Epsilon	$\varepsilon$	1e-8 – 1e-6	Numerical stability in divisions
Betas	$(\beta_1, \beta_2)$	(0.9, 0.999)	Adam moment decay rates

Gradient Clipping

Exploding gradients — common in RNNs and transformers — are handled before optimizer.step():

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # L2 norm clip
# or
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)  # element-wise clip
optimizer.step()

TensorFlow:

# Via optimizer constructor (applied automatically each step):
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, clipnorm=1.0)
# or: clipvalue=0.5

# Manual clip in a custom training loop:
grads = tape.gradient(loss, model.trainable_variables)
grads, _ = tf.clip_by_global_norm(grads, clip_norm=1.0)  # equivalent to clip_grad_norm_
optimizer.apply_gradients(zip(grads, model.trainable_variables))

clip_grad_norm_ rescales the entire gradient vector so its L2 norm does not exceed max_norm. clip_grad_value_ clips each element independently to [-clip_value, clip_value]. Norm clipping is preferred because it preserves gradient direction.

Optimizer Selection Guide

Situation	Recommended optimizer
General deep learning baseline	AdamW
Large-scale vision (ResNet, ViT)	SGD + momentum or AdamW
Transformers, LLMs	AdamW with cosine + warmup
Sparse embeddings (NLP)	SparseAdam
Small models, L-BFGS tolerance	LBFGS
Convex problems, averaging needed	ASGD
No learning-rate tuning desired	Adadelta
Super-convergence experiments	SGD + OneCycleLR

References

PyTorch Docs — torch.optim — PyTorch 2.x documentation

Ruder 2016 — An overview of gradient descent optimization algorithms

Overview Next →

What Is an Optimizer?

The Optimization Problem

The PyTorch Optimizer Interface

zero_grad

State Persistence

Param Groups

Key Hyperparameters

Gradient Clipping

Optimizer Selection Guide

Privacy Policy

What we collect

What we don't collect

Your choices

Contact