Supplement · Optimizers

Learning Rate Schedulers

15 min read
By the end of this reading you will be able to:
  • Compute the learning rate at a given step for StepLR, ExponentialLR, and CosineAnnealingLR using their respective formulas
  • Implement a linear warmup schedule using LinearLR and explain why warmup prevents early-training instability in large-batch and transformer settings
  • Configure ReduceLROnPlateau and explain how patience, threshold, and cooldown control when and by how much the learning rate is reduced
  • Implement a custom learning rate schedule using LambdaLR and verify the trajectory using get_last_lr()

Why Schedule the Learning Rate?

A fixed learning rate is a compromise: large enough to make progress early, small enough to converge precisely late. Scheduling solves both ends: start large (fast progress) and decay toward a small value (precise convergence). Some schedules additionally use a warmup phase where the LR increases from near-zero to its peak, stabilizing adaptive optimizers during the high-variance early steps.

The Scheduler API

All schedulers wrap an optimizer and expose scheduler.step(), which must be called after optimizer.step():

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()  # after optimizer.step(), not before

TensorFlow: Schedulers are passed directly to the optimizer constructor as a LearningRateSchedule:

schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1, decay_steps=100
)
optimizer = tf.keras.optimizers.SGD(learning_rate=schedule)
# No scheduler.step() needed — TF updates LR automatically each optimizer step

To resume from a checkpoint, pass last_epoch to restore scheduler state:

scheduler = StepLR(optimizer, step_size=30, last_epoch=current_epoch)

Step-Based Decay

StepLR

Decays the learning rate by gamma every step_size epochs: ηt=η0γt/step_size\eta_t = \eta_0 \cdot \gamma^{\lfloor t / \text{step\_size} \rfloor}

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# LR: 0.1 → 0.01 at epoch 30 → 0.001 at epoch 60

TensorFlow:

schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.1, decay_steps=30, decay_rate=0.1, staircase=True
)
# staircase=True gives step-decay; staircase=False gives smooth exponential

MultiStepLR

Decays at specified milestone epochs rather than a fixed interval:

scheduler = torch.optim.lr_scheduler.MultiStepLR(
    optimizer, milestones=[30, 60, 80], gamma=0.1
)

Useful when you know in advance when the loss plateaus (e.g., after major data augmentation changes).

ExponentialLR

Decays by gamma every single epoch: ηt=η0γt\eta_t = \eta_0 \cdot \gamma^t

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

PolynomialLR

Decays following a polynomial curve over total_iters steps: ηt=η0(1tT)power\eta_t = \eta_0 \left(1 - \frac{t}{T}\right)^{\text{power}}

scheduler = torch.optim.lr_scheduler.PolynomialLR(
    optimizer, total_iters=100, power=1.0  # power=1 → linear decay
)

Warmup Schedulers

ConstantLR

Multiplies the LR by factor for the first total_iters steps, then restores the base LR:

scheduler = torch.optim.lr_scheduler.ConstantLR(
    optimizer, factor=1/3, total_iters=5  # LR starts at base/3 for 5 steps
)

LinearLR

Linearly interpolates from start_factor × base_lr to end_factor × base_lr over total_iters:

scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer, start_factor=0.1, end_factor=1.0, total_iters=10
)
# Typical use: warmup LR from 0.1× to 1× over 10 steps

LinearLR is the standard warmup scheduler in modern training recipes (e.g., combined with CosineAnnealingLR via SequentialLR).

Cosine Annealing

CosineAnnealingLR

Decays the LR following the first half of a cosine curve from ηmax\eta_{\max} to ηmin\eta_{\min}: ηt=ηmin+12(ηmaxηmin)(1+cos(πtTmax))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T_{\max}}\right)\right)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

TensorFlow:

schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1,
    decay_steps=100,
    alpha=1e-6 / 0.1  # alpha = eta_min / initial_lr
)

The cosine shape gives a slow start, fast middle, and slow end — well-matched to how loss landscapes behave. It is the most widely used annealing schedule for image classification.

Metric-Driven Decay

ReduceLROnPlateau

Reduces the LR when a monitored metric stops improving:

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',        # 'min' for loss, 'max' for accuracy
    factor=0.1,        # new_lr = lr * factor
    patience=10,       # epochs with no improvement before reduction
    threshold=1e-4,    # minimum change to count as improvement
    min_lr=1e-7,
)

# Called with metric, not just step:
for epoch in range(epochs):
    val_loss = validate()
    scheduler.step(val_loss)  # <-- pass the metric

TensorFlow:

# Use the ReduceLROnPlateau callback with model.fit:
callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.1, patience=10,
    min_delta=1e-4, min_lr=1e-7
)
model.fit(train_data, validation_data=val_data, callbacks=[callback])

ReduceLROnPlateau is scheduler-agnostic and works with any optimizer. The cooldown parameter adds a pause after each LR reduction.

Custom Schedules

LambdaLR

Applies an arbitrary function of the epoch index as a multiplicative factor:

warmup = lambda epoch: min(epoch / 10, 1.0)  # linear warmup for 10 epochs
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warmup)

TensorFlow:

class LinearWarmup(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, base_lr, warmup_steps):
        self.base_lr = base_lr
        self.warmup_steps = warmup_steps

    def __call__(self, step):
        return self.base_lr * tf.minimum(step / self.warmup_steps, 1.0)

optimizer = tf.keras.optimizers.AdamW(learning_rate=LinearWarmup(1e-3, 10))

MultiplicativeLR

Like LambdaLR but multiplies the current LR (not the base LR) by the lambda output each step:

scheduler = torch.optim.lr_scheduler.MultiplicativeLR(
    optimizer, lr_lambda=lambda epoch: 0.95
)

Scheduler Comparison

Scheduler Shape LR required Best for
StepLR Staircase Yes Simple baselines
MultiStepLR Multi-step staircase Yes Known plateau epochs
ExponentialLR Exponential decay Yes Short runs
PolynomialLR Polynomial Yes BERT-style fine-tuning
CosineAnnealingLR Half-cosine Yes Image classification
ReduceLROnPlateau Adaptive Yes When val metric is reliable
LambdaLR Arbitrary Yes Custom warmup recipes