Adaptive Learning Rate Methods
- Derive the Adagrad update rule and explain why its monotonically increasing squared-gradient accumulator causes the effective learning rate to approach zero
- Explain how RMSprop's exponential moving average fixes Adagrad's dying-rate problem and identify the effect of the decay factor α on gradient memory
- Trace the Adadelta update through its two EMAs and explain how the numerator's update-scale EMA eliminates the global learning rate hyperparameter
- Select among Adagrad, RMSprop, and Adadelta given constraints on gradient sparsity, stationarity, and hyperparameter tuning budget
The Motivation for Per-Parameter Learning Rates
A single global learning rate treats every parameter identically. But parameters differ dramatically: input-layer weights in a language model see many zero gradients (sparse tokens), while output weights receive dense updates every step. Adaptive methods assign each parameter its own effective learning rate based on its gradient history.
Adagrad
Adagrad accumulates the sum of squared gradients for each parameter and divides the learning rate by its square root:
Parameters that receive large gradients accumulate a large , reducing their effective learning rate. Parameters with rare, small gradients keep a small , retaining a large effective rate. This makes Adagrad ideal for sparse features (e.g., word embeddings with infrequent tokens).
optimizer = torch.optim.Adagrad(
model.parameters(), lr=0.01, eps=1e-10, weight_decay=0
)
TensorFlow:
optimizer = tf.keras.optimizers.Adagrad(
learning_rate=0.01, epsilon=1e-10
)
The fatal flaw: is monotonically increasing and never forgets. After many steps, every parameter's effective learning rate approaches zero and learning stops. Adagrad is mostly useful for convex shallow models or short training runs.
RMSprop
RMSprop (Hinton, unpublished 2012) fixes Adagrad's dying-rate problem by using an exponential moving average of squared gradients instead of the cumulative sum:
The decay factor (typically 0.99) controls how quickly the history is forgotten — gradients from ~100 steps ago receive nearly zero weight. The effective learning rate adapts but never collapses.
optimizer = torch.optim.RMSprop(
model.parameters(), lr=0.01, alpha=0.99, eps=1e-8,
momentum=0, centered=False, weight_decay=0
)
TensorFlow:
optimizer = tf.keras.optimizers.RMSprop(
learning_rate=0.001, rho=0.99, epsilon=1e-8,
momentum=0.0, centered=False
)
# Note: PyTorch param is 'alpha'; TF param is 'rho' — same concept
Centered RMSprop (centered=True) normalizes by the variance of the gradient rather than the raw second moment:
This removes the mean gradient contribution, making the normalizer a true variance estimate and often improving convergence on recurrent networks.
Adadelta
Adadelta (Zeiler 2012) pushes the adaptive idea further by eliminating the learning rate hyperparameter entirely. It tracks both a running average of squared gradients and a running average of squared parameter updates:
The numerator self-calibrates the update scale based on recent parameter changes, giving each parameter an automatically tuned effective step size.
optimizer = torch.optim.Adadelta(
model.parameters(), lr=1.0, rho=0.9, eps=1e-6, weight_decay=0
)
TensorFlow:
optimizer = tf.keras.optimizers.Adadelta(
learning_rate=1.0, rho=0.9, epsilon=1e-6
)
# learning_rate=1.0 gives the pure Adadelta update, matching PyTorch's default
Note that PyTorch's implementation retains lr as a global scale applied after the Adadelta update — setting lr=1.0 (default) gives the pure version. Adadelta is robust to learning rate choice and works well when hyperparameter tuning budgets are limited.
Comparison
| Method | Second-moment estimate | Forgets old gradients? | LR required? |
|---|---|---|---|
| Adagrad | Cumulative sum | No (monotone decay) | Yes |
| RMSprop | EMA (α) | Yes | Yes |
| Adadelta | EMA (ρ) + update EMA | Yes | No (optional scale) |
When to Use Each
- Adagrad — sparse NLP tasks (TF-IDF, bag-of-words), short convex training runs
- RMSprop — RNNs, reinforcement learning (historically strong), quick convergence on non-stationary objectives
- Adadelta — situations where LR search is expensive; robust default without tuning
In practice, Adam/AdamW has largely superseded all three for dense deep learning. Understanding adaptive methods remains essential because Adam is their direct descendant.