Supplement · Optimizers

The Adam Family

16 min read
By the end of this reading you will be able to:
  • Derive Adam's bias correction terms and explain why dividing by (1 − β^t) is necessary when moment buffers are zero-initialized
  • Distinguish AdamW from Adam by tracing how decoupled weight decay changes the update rule and restores uniform L2 regularization semantics
  • Compare RAdam's rectification term to standard Adam and state the conditions under which RAdam falls back to SGD-like updates during early training
  • Select among Adam, AdamW, Adamax, NAdam, RAdam, and SparseAdam based on gradient sparsity, weight decay requirements, and training stability needs

Adam

Adam (Kingma & Ba 2014) combines momentum (first moment) with RMSprop's adaptive learning rate (second moment), and critically adds bias correction to compensate for zero-initialization:

mt=β1mt1+(1β1)gt(first moment)m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \qquad \text{(first moment)}

vt=β2vt1+(1β2)gt2(second moment)v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \qquad \text{(second moment)}

m^t=mt1β1t,v^t=vt1β2t(bias correction)\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \qquad \text{(bias correction)}

θt+1=θtηv^t+εm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \varepsilon} \hat{m}_t

With β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, the bias-correction denominator is 10.999t1 - 0.999^t, which is tiny at t=1t=1 (≈ 0.001), inflating v^t\hat{v}_t to a large value. This prevents exploding updates at the start of training when the second moment estimate is near zero.

optimizer = torch.optim.Adam(
    model.parameters(), lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0
)

TensorFlow:

optimizer = tf.keras.optimizers.Adam(
    learning_rate=1e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-8
)
# AMSGrad: tf.keras.optimizers.Adam(..., amsgrad=True)  (Keras 3 / TF 2.13+)

AMSGrad (amsgrad=True) uses the maximum of all past v^t\hat{v}_t values instead of the current estimate, guaranteeing a non-increasing effective learning rate and improving convergence guarantees at the cost of slightly slower adaptation.

AdamW — Decoupled Weight Decay

In Adam, weight_decay adds λθ\lambda \theta to the gradient before the update — making the decay adaptive (scaled by 1/v^1/\sqrt{\hat{v}}). This means parameters with large gradient variance receive less regularization, breaking the intended uniform shrinkage.

AdamW (Loshchilov & Hutter 2017) applies weight decay after the adaptive update, independently of the gradient: θt+1=θtη(m^tv^t+ε+λθt)\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} + \lambda \theta_t \right)

This restores the original L2 semantics: every parameter is shrunk by the same fraction ηλ\eta \lambda regardless of its gradient history. AdamW is the default choice for transformers, diffusion models, and most modern architectures.

optimizer = torch.optim.AdamW(
    model.parameters(), lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
    weight_decay=0.01  # typical range: 0.01 – 0.1
)

TensorFlow:

optimizer = tf.keras.optimizers.AdamW(
    learning_rate=1e-3, weight_decay=0.01,
    beta_1=0.9, beta_2=0.999, epsilon=1e-8
)

Adamax

Adamax is a variant of Adam that replaces the L2 norm in the second moment with the L∞ (max) norm: ut=max(β2ut1,gt)u_t = \max(\beta_2 u_{t-1}, |g_t|)

θt+1=θtηut+εm^t\theta_{t+1} = \theta_t - \frac{\eta}{u_t + \varepsilon} \hat{m}_t

Because the max operation is naturally bounded (no runaway accumulation), Adamax does not need bias correction on utu_t. It is more robust to large gradient spikes and can be useful for embedding layers with extreme gradient variance.

optimizer = torch.optim.Adamax(
    model.parameters(), lr=2e-3, betas=(0.9, 0.999), eps=1e-8
)

TensorFlow:

optimizer = tf.keras.optimizers.Adamax(
    learning_rate=2e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-8
)

NAdam — Nesterov Adam

NAdam (Dozat 2016) incorporates Nesterov momentum into Adam by using a lookahead first moment — the gradient contribution from the next step's momentum is included in the current update: θt+1=θtηv^t+ε(β1m^t1β1t+1+(1β1)gt1β1t)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \varepsilon} \left( \frac{\beta_1 \hat{m}_t}{1 - \beta_1^{t+1}} + \frac{(1-\beta_1) g_t}{1 - \beta_1^t} \right)

In practice, NAdam converges slightly faster than Adam on smooth objectives and is a low-risk drop-in replacement.

optimizer = torch.optim.NAdam(
    model.parameters(), lr=2e-3, betas=(0.9, 0.999),
    momentum_decay=0.004  # gradual momentum warmup
)

TensorFlow:

optimizer = tf.keras.optimizers.Nadam(
    learning_rate=2e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-8
)

RAdam — Rectified Adam

RAdam (Liu et al. 2019) diagnoses Adam's instability in early training: the second moment estimate has high variance when tt is small, causing the effective learning rate to fluctuate wildly. RAdam computes an analytical rectification term rtr_t based on the estimated variance of the second moment:

ρt=ρ2tβ2t1β2t,ρ=21β21\rho_t = \rho_\infty - \frac{2t\beta_2^t}{1 - \beta_2^t}, \quad \rho_\infty = \frac{2}{1 - \beta_2} - 1

  • If ρt>4\rho_t > 4: variance is tractable — apply the rectified adaptive update (like Adam)
  • Else: fall back to SGD with momentum (no adaptive scaling)

This gives RAdam automatic warmup-free behavior: the early steps are SGD-like and the adaptive phase engages once the second moment estimate stabilizes.

optimizer = torch.optim.RAdam(
    model.parameters(), lr=1e-3, betas=(0.9, 0.999), eps=1e-8
)

TensorFlow: No built-in RAdam in core TF. Available via TF-Addons (tfa.optimizers.RectifiedAdam) or implement from the paper. In practice, AdamW with a short LinearLR warmup achieves similar early-training stability.

SparseAdam

Standard Adam updates the second moment for every parameter every step, even when a parameter's gradient is zero (e.g., an embedding row for a token not in the batch). SparseAdam applies a lazy update: only parameters with non-zero gradients are updated at each step.

# Only valid for parameters with sparse gradients (e.g., nn.Embedding)
optimizer = torch.optim.SparseAdam(
    [{'params': model.embedding.parameters()}],
    lr=1e-3, betas=(0.9, 0.999), eps=1e-8
)

TensorFlow: TF handles sparse gradients automatically — tf.keras.optimizers.Adam applies lazy updates for tf.IndexedSlices (the TF equivalent of sparse tensors). No separate SparseAdam class is needed.

Constraint: SparseAdam only supports sparse gradients. Dense parameters must use a separate optimizer (use multiple param groups with different optimizer types, or AdamW for everything).

Adam Family Summary

Variant Key difference Default weight_decay
Adam First + second moment with bias correction 0
AdamW Decoupled weight decay 0.01
Adamax L∞ second moment, robust to spikes 0
NAdam Nesterov lookahead first moment 0
RAdam Variance-rectified warmup, auto SGD fallback 0
SparseAdam Lazy update for sparse gradients n/a