Learning Rate Schedulers
- Compute the learning rate at a given step for StepLR, ExponentialLR, and CosineAnnealingLR using their respective formulas
- Implement a linear warmup schedule using LinearLR and explain why warmup prevents early-training instability in large-batch and transformer settings
- Configure ReduceLROnPlateau and explain how patience, threshold, and cooldown control when and by how much the learning rate is reduced
- Implement a custom learning rate schedule using LambdaLR and verify the trajectory using get_last_lr()
Why Schedule the Learning Rate?
A fixed learning rate is a compromise: large enough to make progress early, small enough to converge precisely late. Scheduling solves both ends: start large (fast progress) and decay toward a small value (precise convergence). Some schedules additionally use a warmup phase where the LR increases from near-zero to its peak, stabilizing adaptive optimizers during the high-variance early steps.
The Scheduler API
All schedulers wrap an optimizer and expose scheduler.step(), which must be called after optimizer.step():
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
train_one_epoch()
scheduler.step() # after optimizer.step(), not before
TensorFlow: Schedulers are passed directly to the optimizer constructor as a LearningRateSchedule:
schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.1, decay_steps=100
)
optimizer = tf.keras.optimizers.SGD(learning_rate=schedule)
# No scheduler.step() needed — TF updates LR automatically each optimizer step
To resume from a checkpoint, pass last_epoch to restore scheduler state:
scheduler = StepLR(optimizer, step_size=30, last_epoch=current_epoch)
Step-Based Decay
StepLR
Decays the learning rate by gamma every step_size epochs:
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# LR: 0.1 → 0.01 at epoch 30 → 0.001 at epoch 60
TensorFlow:
schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.1, decay_steps=30, decay_rate=0.1, staircase=True
)
# staircase=True gives step-decay; staircase=False gives smooth exponential
MultiStepLR
Decays at specified milestone epochs rather than a fixed interval:
scheduler = torch.optim.lr_scheduler.MultiStepLR(
optimizer, milestones=[30, 60, 80], gamma=0.1
)
Useful when you know in advance when the loss plateaus (e.g., after major data augmentation changes).
ExponentialLR
Decays by gamma every single epoch:
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)
PolynomialLR
Decays following a polynomial curve over total_iters steps:
scheduler = torch.optim.lr_scheduler.PolynomialLR(
optimizer, total_iters=100, power=1.0 # power=1 → linear decay
)
Warmup Schedulers
ConstantLR
Multiplies the LR by factor for the first total_iters steps, then restores the base LR:
scheduler = torch.optim.lr_scheduler.ConstantLR(
optimizer, factor=1/3, total_iters=5 # LR starts at base/3 for 5 steps
)
LinearLR
Linearly interpolates from start_factor × base_lr to end_factor × base_lr over total_iters:
scheduler = torch.optim.lr_scheduler.LinearLR(
optimizer, start_factor=0.1, end_factor=1.0, total_iters=10
)
# Typical use: warmup LR from 0.1× to 1× over 10 steps
LinearLR is the standard warmup scheduler in modern training recipes (e.g., combined with CosineAnnealingLR via SequentialLR).
Cosine Annealing
CosineAnnealingLR
Decays the LR following the first half of a cosine curve from to :
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100, eta_min=1e-6
)
TensorFlow:
schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.1,
decay_steps=100,
alpha=1e-6 / 0.1 # alpha = eta_min / initial_lr
)
The cosine shape gives a slow start, fast middle, and slow end — well-matched to how loss landscapes behave. It is the most widely used annealing schedule for image classification.
Metric-Driven Decay
ReduceLROnPlateau
Reduces the LR when a monitored metric stops improving:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='min', # 'min' for loss, 'max' for accuracy
factor=0.1, # new_lr = lr * factor
patience=10, # epochs with no improvement before reduction
threshold=1e-4, # minimum change to count as improvement
min_lr=1e-7,
)
# Called with metric, not just step:
for epoch in range(epochs):
val_loss = validate()
scheduler.step(val_loss) # <-- pass the metric
TensorFlow:
# Use the ReduceLROnPlateau callback with model.fit:
callback = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', factor=0.1, patience=10,
min_delta=1e-4, min_lr=1e-7
)
model.fit(train_data, validation_data=val_data, callbacks=[callback])
ReduceLROnPlateau is scheduler-agnostic and works with any optimizer. The cooldown parameter adds a pause after each LR reduction.
Custom Schedules
LambdaLR
Applies an arbitrary function of the epoch index as a multiplicative factor:
warmup = lambda epoch: min(epoch / 10, 1.0) # linear warmup for 10 epochs
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warmup)
TensorFlow:
class LinearWarmup(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, base_lr, warmup_steps):
self.base_lr = base_lr
self.warmup_steps = warmup_steps
def __call__(self, step):
return self.base_lr * tf.minimum(step / self.warmup_steps, 1.0)
optimizer = tf.keras.optimizers.AdamW(learning_rate=LinearWarmup(1e-3, 10))
MultiplicativeLR
Like LambdaLR but multiplies the current LR (not the base LR) by the lambda output each step:
scheduler = torch.optim.lr_scheduler.MultiplicativeLR(
optimizer, lr_lambda=lambda epoch: 0.95
)
Scheduler Comparison
| Scheduler | Shape | LR required | Best for |
|---|---|---|---|
| StepLR | Staircase | Yes | Simple baselines |
| MultiStepLR | Multi-step staircase | Yes | Known plateau epochs |
| ExponentialLR | Exponential decay | Yes | Short runs |
| PolynomialLR | Polynomial | Yes | BERT-style fine-tuning |
| CosineAnnealingLR | Half-cosine | Yes | Image classification |
| ReduceLROnPlateau | Adaptive | Yes | When val metric is reliable |
| LambdaLR | Arbitrary | Yes | Custom warmup recipes |