What Is an Optimizer?
- Trace the zero_grad → forward → backward → step cycle and explain why gradients must be cleared before each iteration
- Configure param groups to assign different learning rates and weight decay to different parts of a model
- Apply gradient clipping by norm and by value and explain why norm clipping is preferred for preserving gradient direction
- Select an optimizer from the practical selection guide given an architecture, task, and training budget
The Optimization Problem
Training a neural network reduces to minimizing a loss function over the model parameters . At each step you observe a mini-batch gradient and need a rule for updating . That rule is the optimizer.
The simplest rule is steepest descent:
But raw gradient descent on deep networks suffers from ill-conditioning, saddle points, and noisy mini-batch estimates. Modern optimizers address all three.
The PyTorch Optimizer Interface
Every torch.optim optimizer inherits from Optimizer and exposes a uniform API:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for x, y in dataloader:
optimizer.zero_grad() # clear gradients from previous step
loss = criterion(model(x), y)
loss.backward() # compute gradients via autograd
optimizer.step() # apply update rule
TensorFlow:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
for x, y in dataset:
with tf.GradientTape() as tape:
loss = loss_fn(model(x, training=True), y)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
zero_grad
PyTorch accumulates gradients by default. You must clear them each step. The set_to_none=True flag (default since PyTorch 2.0) sets .grad to None instead of zeroing — it's faster because it skips the memory write and lets the allocator reuse the buffer:
optimizer.zero_grad(set_to_none=True) # faster than zero_grad()
TensorFlow: TF/Keras does not accumulate gradients by default — GradientTape computes fresh gradients each call, so there is no equivalent of zero_grad().
State Persistence
Optimizers maintain running statistics (momentum buffers, squared-gradient accumulators) as internal state. Save and restore it with:
checkpoint = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch,
}
torch.save(checkpoint, 'ckpt.pt')
# Resume
ckpt = torch.load('ckpt.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])
Param Groups
A param group is a dict specifying which parameters use which hyperparameters. This enables per-layer settings:
optimizer = torch.optim.AdamW([
{'params': backbone.parameters(), 'lr': 1e-4, 'weight_decay': 0.01},
{'params': head.parameters(), 'lr': 1e-3, 'weight_decay': 0.0},
])
TensorFlow: Keras does not support per-variable optimizer hyperparameters natively. The standard approach is a custom training loop with separate optimizers:
backbone_opt = tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=0.01)
head_opt = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.0)
with tf.GradientTape() as tape:
loss = ...
all_grads = tape.gradient(loss, model.trainable_variables)
backbone_opt.apply_gradients(zip(backbone_grads, backbone.trainable_variables))
head_opt.apply_gradients(zip(head_grads, head.trainable_variables))
Common use cases:
- Fine-tuning — lower LR for pretrained backbone, higher LR for new head
- Embedding layers — disable weight decay to avoid shrinking embeddings
- Frozen layers — set
requires_grad=Falseto skip their update entirely
Key Hyperparameters
| Hyperparameter | Symbol | Typical range | Role |
|---|---|---|---|
| Learning rate | 1e-4 – 1e-1 | Step size scale | |
| Momentum | 0.85 – 0.99 | Exponential average of gradients | |
| Weight decay | 1e-4 – 1e-2 | L2 regularization penalty | |
| Epsilon | 1e-8 – 1e-6 | Numerical stability in divisions | |
| Betas | (0.9, 0.999) | Adam moment decay rates |
Gradient Clipping
Exploding gradients — common in RNNs and transformers — are handled before optimizer.step():
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # L2 norm clip
# or
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5) # element-wise clip
optimizer.step()
TensorFlow:
# Via optimizer constructor (applied automatically each step):
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, clipnorm=1.0)
# or: clipvalue=0.5
# Manual clip in a custom training loop:
grads = tape.gradient(loss, model.trainable_variables)
grads, _ = tf.clip_by_global_norm(grads, clip_norm=1.0) # equivalent to clip_grad_norm_
optimizer.apply_gradients(zip(grads, model.trainable_variables))
clip_grad_norm_ rescales the entire gradient vector so its L2 norm does not exceed max_norm. clip_grad_value_ clips each element independently to [-clip_value, clip_value]. Norm clipping is preferred because it preserves gradient direction.
Optimizer Selection Guide
| Situation | Recommended optimizer |
|---|---|
| General deep learning baseline | AdamW |
| Large-scale vision (ResNet, ViT) | SGD + momentum or AdamW |
| Transformers, LLMs | AdamW with cosine + warmup |
| Sparse embeddings (NLP) | SparseAdam |
| Small models, L-BFGS tolerance | LBFGS |
| Convex problems, averaging needed | ASGD |
| No learning-rate tuning desired | Adadelta |
| Super-convergence experiments | SGD + OneCycleLR |