Supplement · Optimizers

SGD & Momentum Methods

14 min read

By the end of this reading you will be able to:

State the SGD with momentum update rule and explain how the velocity term damps oscillations across high-curvature directions
Distinguish Nesterov momentum from classical momentum and explain why evaluating the gradient at the lookahead position improves convergence
Apply weight decay in SGD as L2 regularization and state why L2 regularization and weight decay are equivalent for SGD but not for adaptive optimizers
Distinguish ASGD (parameter averaging for variance reduction) from Rprop (sign-based per-parameter step sizes) and identify when each is appropriate

Stochastic Gradient Descent

At its core, SGD replaces the full-dataset gradient with a mini-batch estimate: $\theta_{t+1} = \theta_t - \eta \, g_t, \quad g_t = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla_{\theta} \ell_i$

The noise from sampling is not purely harmful — it acts as implicit regularization and helps escape sharp minima. But vanilla SGD converges slowly on ill-conditioned loss surfaces because the optimal step size varies across directions.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

Momentum

Momentum adds a velocity term $v$ that accumulates gradients exponentially, smoothing oscillations across high-curvature directions: $v_{t+1} = \beta v_t + g_t$

$\theta_{t+1} = \theta_t - \eta v_{t+1}$

With $\beta = 0.9$ , the effective gradient is a 10-step exponential moving average. This damps oscillations perpendicular to the optimum and accelerates along consistent gradient directions.

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

Dampening (dampening > 0) reduces the contribution of the current gradient to the velocity — useful when the gradient is noisy: $v_{t+1} = \beta v_t + (1 - \text{dampening}) \cdot g_t$

Nesterov Accelerated Gradient

Nesterov momentum evaluates the gradient at a lookahead position — where we would be after applying the current velocity — rather than at the current parameters: $\theta_{\text{look}} = \theta_t - \eta \beta v_t$

$v_{t+1} = \beta v_t + \nabla L(\theta_{\text{look}})$

$\theta_{t+1} = \theta_t - \eta v_{t+1}$

This makes momentum anticipatory rather than corrective. Nesterov consistently outperforms classical momentum on convex problems and is widely used in practice:

optimizer = torch.optim.SGD(
    model.parameters(), lr=0.01, momentum=0.9, nesterov=True
)

TensorFlow:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)

Note: nesterov=True requires momentum > 0 and dampening = 0.

Weight Decay in SGD

SGD applies L2 regularization by adding $\lambda \theta$ to the gradient before the update: $\theta_{t+1} = \theta_t - \eta (g_t + \lambda \theta_t)$

For SGD, L2 regularization and weight decay are equivalent. This equivalence breaks for adaptive optimizers (see AdamW).

optimizer = torch.optim.SGD(
    model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4
)

TensorFlow:

# Keras SGD does not have weight_decay; apply via kernel_regularizer on each layer
# or use the newer tf.keras.optimizers.SGD with weight_decay (Keras 3 / TF 2.13+):
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, weight_decay=1e-4)

ASGD — Averaged SGD

ASGD runs standard SGD but maintains a running average of parameter iterates, activated after t0 steps: $\bar{\theta}_t = \frac{1}{t - t_0} \sum_{i=t_0}^{t} \theta_i$

Averaging reduces variance from late-stage gradient noise. It's theoretically optimal for convex problems and occasionally used in NLP (classic LSTM training).

optimizer = torch.optim.ASGD(
    model.parameters(), lr=0.01, lambd=1e-4, alpha=0.75, t0=1e6
)
# Use optimizer.state[p]['ax'] for the averaged parameters

TensorFlow: No built-in ASGD. Use tf.train.experimental.enable_mixed_precision_graph_rewrite or implement parameter averaging manually with tf.train.ExponentialMovingAverage:

ema = tf.train.ExponentialMovingAverage(decay=0.999)
ema.apply(model.trainable_variables)  # call after each optimizer step
# For inference: use ema.average(var) instead of var

The lambd parameter adds a small per-step decay to the effective learning rate; alpha controls how fast that decay accelerates.

Rprop — Resilient Backpropagation

Rprop ignores gradient magnitude entirely and updates each parameter by a fixed step size whose sign matches the gradient sign. Step sizes grow or shrink based on whether the gradient sign is consistent across consecutive steps:

If sign(g_t) == sign(g_{t-1}): increase step size by etas[1] (default 1.2)
If sign(g_t) != sign(g_{t-1}): decrease step size by etas[0] (default 0.5)
Step sizes are clipped to step_sizes bounds (default 1e-6 to 50)

optimizer = torch.optim.Rprop(
    model.parameters(), lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-6, 50)
)

TensorFlow: No built-in Rprop. RMSprop (tf.keras.optimizers.RMSprop) is Rprop's direct descendant and is the practical TF substitute for mini-batch settings.

Rprop is effectively RMSprop's ancestor. It works well for full-batch training and small networks but poorly for mini-batch settings where sign changes are noisy.

Choosing Between SGD Variants

Variant	Best for
SGD (no momentum)	Convex baselines, simple experiments
SGD + momentum	Most supervised deep learning
SGD + Nesterov	When classical momentum shows oscillations
ASGD	Convex/quasi-convex language models, final-phase averaging
Rprop	Full-batch training, small networks

For large-scale vision models (ResNets, ViTs), well-tuned SGD with momentum + cosine LR schedule often matches or beats Adam.

References

Sutskever et al. 2013 — On the importance of initialization and momentum in deep learning

Polyak & Juditsky 1992 — Acceleration of stochastic approximation by averaging

Previous Next →

SGD & Momentum Methods

Stochastic Gradient Descent

Momentum

Nesterov Accelerated Gradient

Weight Decay in SGD

ASGD — Averaged SGD

Rprop — Resilient Backpropagation

Choosing Between SGD Variants

Privacy Policy

What we collect

What we don't collect

Your choices

Contact