Supplement · Optimizers

SGD & Momentum Methods

14 min read
By the end of this reading you will be able to:
  • State the SGD with momentum update rule and explain how the velocity term damps oscillations across high-curvature directions
  • Distinguish Nesterov momentum from classical momentum and explain why evaluating the gradient at the lookahead position improves convergence
  • Apply weight decay in SGD as L2 regularization and state why L2 regularization and weight decay are equivalent for SGD but not for adaptive optimizers
  • Distinguish ASGD (parameter averaging for variance reduction) from Rprop (sign-based per-parameter step sizes) and identify when each is appropriate

Stochastic Gradient Descent

At its core, SGD replaces the full-dataset gradient with a mini-batch estimate: θt+1=θtηgt,gt=1BiBθi\theta_{t+1} = \theta_t - \eta \, g_t, \quad g_t = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla_{\theta} \ell_i

The noise from sampling is not purely harmful — it acts as implicit regularization and helps escape sharp minima. But vanilla SGD converges slowly on ill-conditioned loss surfaces because the optimal step size varies across directions.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

Momentum

Momentum adds a velocity term vv that accumulates gradients exponentially, smoothing oscillations across high-curvature directions: vt+1=βvt+gtv_{t+1} = \beta v_t + g_t

θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

With β=0.9\beta = 0.9, the effective gradient is a 10-step exponential moving average. This damps oscillations perpendicular to the optimum and accelerates along consistent gradient directions.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

Dampening (dampening > 0) reduces the contribution of the current gradient to the velocity — useful when the gradient is noisy: vt+1=βvt+(1dampening)gtv_{t+1} = \beta v_t + (1 - \text{dampening}) \cdot g_t

Nesterov Accelerated Gradient

Nesterov momentum evaluates the gradient at a lookahead position — where we would be after applying the current velocity — rather than at the current parameters: θlook=θtηβvt\theta_{\text{look}} = \theta_t - \eta \beta v_t

vt+1=βvt+L(θlook)v_{t+1} = \beta v_t + \nabla L(\theta_{\text{look}})

θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

This makes momentum anticipatory rather than corrective. Nesterov consistently outperforms classical momentum on convex problems and is widely used in practice:

optimizer = torch.optim.SGD(
    model.parameters(), lr=0.01, momentum=0.9, nesterov=True
)

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)

Note: nesterov=True requires momentum > 0 and dampening = 0.

Weight Decay in SGD

SGD applies L2 regularization by adding λθ\lambda \theta to the gradient before the update: θt+1=θtη(gt+λθt)\theta_{t+1} = \theta_t - \eta (g_t + \lambda \theta_t)

For SGD, L2 regularization and weight decay are equivalent. This equivalence breaks for adaptive optimizers (see AdamW).

optimizer = torch.optim.SGD(
    model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4
)

TensorFlow:

# Keras SGD does not have weight_decay; apply via kernel_regularizer on each layer
# or use the newer tf.keras.optimizers.SGD with weight_decay (Keras 3 / TF 2.13+):
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, weight_decay=1e-4)

ASGD — Averaged SGD

ASGD runs standard SGD but maintains a running average of parameter iterates, activated after t0 steps: θˉt=1tt0i=t0tθi\bar{\theta}_t = \frac{1}{t - t_0} \sum_{i=t_0}^{t} \theta_i

Averaging reduces variance from late-stage gradient noise. It's theoretically optimal for convex problems and occasionally used in NLP (classic LSTM training).

optimizer = torch.optim.ASGD(
    model.parameters(), lr=0.01, lambd=1e-4, alpha=0.75, t0=1e6
)
# Use optimizer.state[p]['ax'] for the averaged parameters

TensorFlow: No built-in ASGD. Use tf.train.experimental.enable_mixed_precision_graph_rewrite or implement parameter averaging manually with tf.train.ExponentialMovingAverage:

ema = tf.train.ExponentialMovingAverage(decay=0.999)
ema.apply(model.trainable_variables)  # call after each optimizer step
# For inference: use ema.average(var) instead of var

The lambd parameter adds a small per-step decay to the effective learning rate; alpha controls how fast that decay accelerates.

Rprop — Resilient Backpropagation

Rprop ignores gradient magnitude entirely and updates each parameter by a fixed step size whose sign matches the gradient sign. Step sizes grow or shrink based on whether the gradient sign is consistent across consecutive steps:

  • If sign(g_t) == sign(g_{t-1}): increase step size by etas[1] (default 1.2)
  • If sign(g_t) != sign(g_{t-1}): decrease step size by etas[0] (default 0.5)
  • Step sizes are clipped to step_sizes bounds (default 1e-6 to 50)
optimizer = torch.optim.Rprop(
    model.parameters(), lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-6, 50)
)

TensorFlow: No built-in Rprop. RMSprop (tf.keras.optimizers.RMSprop) is Rprop's direct descendant and is the practical TF substitute for mini-batch settings.

Rprop is effectively RMSprop's ancestor. It works well for full-batch training and small networks but poorly for mini-batch settings where sign changes are noisy.

Choosing Between SGD Variants

Variant Best for
SGD (no momentum) Convex baselines, simple experiments
SGD + momentum Most supervised deep learning
SGD + Nesterov When classical momentum shows oscillations
ASGD Convex/quasi-convex language models, final-phase averaging
Rprop Full-batch training, small networks

For large-scale vision models (ResNets, ViTs), well-tuned SGD with momentum + cosine LR schedule often matches or beats Adam.