Supplement · Optimizers

Adaptive Learning Rate Methods

14 min read

By the end of this reading you will be able to:

Derive the Adagrad update rule and explain why its monotonically increasing squared-gradient accumulator causes the effective learning rate to approach zero
Explain how RMSprop's exponential moving average fixes Adagrad's dying-rate problem and identify the effect of the decay factor α on gradient memory
Trace the Adadelta update through its two EMAs and explain how the numerator's update-scale EMA eliminates the global learning rate hyperparameter
Select among Adagrad, RMSprop, and Adadelta given constraints on gradient sparsity, stationarity, and hyperparameter tuning budget

The Motivation for Per-Parameter Learning Rates

A single global learning rate treats every parameter identically. But parameters differ dramatically: input-layer weights in a language model see many zero gradients (sparse tokens), while output weights receive dense updates every step. Adaptive methods assign each parameter its own effective learning rate based on its gradient history.

Adagrad

Adagrad accumulates the sum of squared gradients $G_t$ for each parameter and divides the learning rate by its square root: $G_{t,i} = G_{t-1,i} + g_{t,i}^2$

$\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i} + \varepsilon}} \, g_{t,i}$

Parameters that receive large gradients accumulate a large $G$ , reducing their effective learning rate. Parameters with rare, small gradients keep a small $G$ , retaining a large effective rate. This makes Adagrad ideal for sparse features (e.g., word embeddings with infrequent tokens).

optimizer = torch.optim.Adagrad(
    model.parameters(), lr=0.01, eps=1e-10, weight_decay=0
)

TensorFlow:

optimizer = tf.keras.optimizers.Adagrad(
    learning_rate=0.01, epsilon=1e-10
)

The fatal flaw: $G_t$ is monotonically increasing and never forgets. After many steps, every parameter's effective learning rate approaches zero and learning stops. Adagrad is mostly useful for convex shallow models or short training runs.

RMSprop

RMSprop (Hinton, unpublished 2012) fixes Adagrad's dying-rate problem by using an exponential moving average of squared gradients instead of the cumulative sum: $E[g^2]_t = \rho \, E[g^2]_{t-1} + (1 - \rho) \, g_t^2$

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \varepsilon}} \, g_t$

The decay factor $\rho$ (typically 0.99) controls how quickly the history is forgotten — gradients from ~100 steps ago receive nearly zero weight. The effective learning rate adapts but never collapses.

optimizer = torch.optim.RMSprop(
    model.parameters(), lr=0.01, alpha=0.99, eps=1e-8,
    momentum=0, centered=False, weight_decay=0
)

TensorFlow:

optimizer = tf.keras.optimizers.RMSprop(
    learning_rate=0.001, rho=0.99, epsilon=1e-8,
    momentum=0.0, centered=False
)
# Note: PyTorch param is 'alpha'; TF param is 'rho' — same concept

Centered RMSprop (centered=True) normalizes by the variance of the gradient rather than the raw second moment: $\text{Var}[g]_t = E[g^2]_t - (E[g]_t)^2$

This removes the mean gradient contribution, making the normalizer a true variance estimate and often improving convergence on recurrent networks.

Adadelta

Adadelta (Zeiler 2012) pushes the adaptive idea further by eliminating the learning rate hyperparameter entirely. It tracks both a running average of squared gradients and a running average of squared parameter updates: $E[g^2]_t = \rho \, E[g^2]_{t-1} + (1-\rho) \, g_t^2$

$\Delta\theta_t = -\frac{\sqrt{E[\Delta\theta^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} \, g_t$

$E[\Delta\theta^2]_t = \rho \, E[\Delta\theta^2]_{t-1} + (1-\rho) \, \Delta\theta_t^2$

The numerator $\sqrt{E[\Delta\theta^2] + \varepsilon}$ self-calibrates the update scale based on recent parameter changes, giving each parameter an automatically tuned effective step size.

optimizer = torch.optim.Adadelta(
    model.parameters(), lr=1.0, rho=0.9, eps=1e-6, weight_decay=0
)

TensorFlow:

optimizer = tf.keras.optimizers.Adadelta(
    learning_rate=1.0, rho=0.9, epsilon=1e-6
)
# learning_rate=1.0 gives the pure Adadelta update, matching PyTorch's default

Note that PyTorch's implementation retains lr as a global scale applied after the Adadelta update — setting lr=1.0 (default) gives the pure version. Adadelta is robust to learning rate choice and works well when hyperparameter tuning budgets are limited.

Comparison

Method	Second-moment estimate	Forgets old gradients?	LR required?
Adagrad	Cumulative sum	No (monotone decay)	Yes
RMSprop	EMA (α)	Yes	Yes
Adadelta	EMA (ρ) + update EMA	Yes	No (optional scale)

When to Use Each

Adagrad — sparse NLP tasks (TF-IDF, bag-of-words), short convex training runs
RMSprop — RNNs, reinforcement learning (historically strong), quick convergence on non-stationary objectives
Adadelta — situations where LR search is expensive; robust default without tuning

In practice, Adam/AdamW has largely superseded all three for dense deep learning. Understanding adaptive methods remains essential because Adam is their direct descendant.

References

Zeiler 2012 — ADADELTA: An Adaptive Learning Rate Method

Duchi et al. 2011 — Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Previous Next →

Adaptive Learning Rate Methods

The Motivation for Per-Parameter Learning Rates

Adagrad

RMSprop

Adadelta

Comparison

When to Use Each

Privacy Policy

What we collect

What we don't collect

Your choices

Contact