SGD & Momentum Methods
- State the SGD with momentum update rule and explain how the velocity term damps oscillations across high-curvature directions
- Distinguish Nesterov momentum from classical momentum and explain why evaluating the gradient at the lookahead position improves convergence
- Apply weight decay in SGD as L2 regularization and state why L2 regularization and weight decay are equivalent for SGD but not for adaptive optimizers
- Distinguish ASGD (parameter averaging for variance reduction) from Rprop (sign-based per-parameter step sizes) and identify when each is appropriate
Stochastic Gradient Descent
At its core, SGD replaces the full-dataset gradient with a mini-batch estimate:
The noise from sampling is not purely harmful — it acts as implicit regularization and helps escape sharp minima. But vanilla SGD converges slowly on ill-conditioned loss surfaces because the optimal step size varies across directions.
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
TensorFlow:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
Momentum
Momentum adds a velocity term that accumulates gradients exponentially, smoothing oscillations across high-curvature directions:
With , the effective gradient is a 10-step exponential moving average. This damps oscillations perpendicular to the optimum and accelerates along consistent gradient directions.
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
TensorFlow:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
Dampening (dampening > 0) reduces the contribution of the current gradient to the velocity — useful when the gradient is noisy:
Nesterov Accelerated Gradient
Nesterov momentum evaluates the gradient at a lookahead position — where we would be after applying the current velocity — rather than at the current parameters:
This makes momentum anticipatory rather than corrective. Nesterov consistently outperforms classical momentum on convex problems and is widely used in practice:
optimizer = torch.optim.SGD(
model.parameters(), lr=0.01, momentum=0.9, nesterov=True
)
TensorFlow:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)
Note:
nesterov=Truerequiresmomentum > 0anddampening = 0.
Weight Decay in SGD
SGD applies L2 regularization by adding to the gradient before the update:
For SGD, L2 regularization and weight decay are equivalent. This equivalence breaks for adaptive optimizers (see AdamW).
optimizer = torch.optim.SGD(
model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4
)
TensorFlow:
# Keras SGD does not have weight_decay; apply via kernel_regularizer on each layer
# or use the newer tf.keras.optimizers.SGD with weight_decay (Keras 3 / TF 2.13+):
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, weight_decay=1e-4)
ASGD — Averaged SGD
ASGD runs standard SGD but maintains a running average of parameter iterates, activated after t0 steps:
Averaging reduces variance from late-stage gradient noise. It's theoretically optimal for convex problems and occasionally used in NLP (classic LSTM training).
optimizer = torch.optim.ASGD(
model.parameters(), lr=0.01, lambd=1e-4, alpha=0.75, t0=1e6
)
# Use optimizer.state[p]['ax'] for the averaged parameters
TensorFlow: No built-in ASGD. Use tf.train.experimental.enable_mixed_precision_graph_rewrite or implement parameter averaging manually with tf.train.ExponentialMovingAverage:
ema = tf.train.ExponentialMovingAverage(decay=0.999)
ema.apply(model.trainable_variables) # call after each optimizer step
# For inference: use ema.average(var) instead of var
The lambd parameter adds a small per-step decay to the effective learning rate; alpha controls how fast that decay accelerates.
Rprop — Resilient Backpropagation
Rprop ignores gradient magnitude entirely and updates each parameter by a fixed step size whose sign matches the gradient sign. Step sizes grow or shrink based on whether the gradient sign is consistent across consecutive steps:
- If
sign(g_t) == sign(g_{t-1}): increase step size byetas[1](default 1.2) - If
sign(g_t) != sign(g_{t-1}): decrease step size byetas[0](default 0.5) - Step sizes are clipped to
step_sizesbounds (default 1e-6 to 50)
optimizer = torch.optim.Rprop(
model.parameters(), lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-6, 50)
)
TensorFlow: No built-in Rprop. RMSprop (tf.keras.optimizers.RMSprop) is Rprop's direct descendant and is the practical TF substitute for mini-batch settings.
Rprop is effectively RMSprop's ancestor. It works well for full-batch training and small networks but poorly for mini-batch settings where sign changes are noisy.
Choosing Between SGD Variants
| Variant | Best for |
|---|---|
| SGD (no momentum) | Convex baselines, simple experiments |
| SGD + momentum | Most supervised deep learning |
| SGD + Nesterov | When classical momentum shows oscillations |
| ASGD | Convex/quasi-convex language models, final-phase averaging |
| Rprop | Full-batch training, small networks |
For large-scale vision models (ResNets, ViTs), well-tuned SGD with momentum + cosine LR schedule often matches or beats Adam.