Advanced Schedulers & Composition
- Explain CosineAnnealingWarmRestarts and state how T_0 and T_mult control the restart period, cycle growth, and exploration behavior
- Configure OneCycleLR for super-convergence, identifying the three training phases and the role of div_factor, pct_start, and final_div_factor
- Distinguish CyclicLR's three modes (triangular, triangular2, exp_range) and explain why inverse momentum cycling reduces oscillations during the high-LR phase
- Compose a warmup-then-cosine schedule using SequentialLR and trace the learning rate value through the transition milestone
CosineAnnealingWarmRestarts (SGDR)
SGDR (Loshchilov & Hutter 2016) extends cosine annealing by periodically restarting the LR from its maximum. After each restart the period optionally grows by a factor :
where is the length of the -th cycle and resets to 0 at each restart.
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
optimizer,
T_0=10, # length of first restart cycle (in epochs)
T_mult=2, # each cycle is 2× longer: 10 → 20 → 40 ...
eta_min=1e-6
)
TensorFlow:
schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(
initial_learning_rate=0.1,
first_decay_steps=10, # T_0 in steps
t_mul=2.0, # T_mult
alpha=1e-6 / 0.1 # eta_min / initial_lr
)
optimizer = tf.keras.optimizers.SGD(learning_rate=schedule)
The restarts serve as exploration pulses: after annealing to a sharp minimum, the LR spikes again and the model can escape to a flatter (better-generalizing) basin. This is particularly effective when combined with model snapshot ensembling — saving the model at each restart and averaging their predictions.
CyclicLR
CyclicLR (Smith 2017) cycles the LR between base_lr and max_lr using a triangular wave. It is typically applied per batch (not per epoch).
Modes
| Mode | Behavior |
|---|---|
triangular |
Fixed triangle amplitude every cycle |
triangular2 |
Amplitude halves after each cycle |
exp_range |
Amplitude decays by gamma^iteration |
scheduler = torch.optim.lr_scheduler.CyclicLR(
optimizer,
base_lr=1e-4,
max_lr=1e-2,
step_size_up=2000, # steps to go from base_lr → max_lr
mode='triangular2',
cycle_momentum=True, # SGD only: momentum cycles inversely to LR
)
# Step per batch:
for batch in dataloader:
train_batch()
scheduler.step()
TensorFlow: No built-in CyclicLR. Implement as a custom schedule:
class TriangularCyclicLR(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, base_lr, max_lr, step_size):
self.base_lr, self.max_lr, self.step_size = base_lr, max_lr, step_size
def __call__(self, step):
cycle = tf.floor(1 + step / (2 * self.step_size))
x = tf.abs(step / self.step_size - 2 * cycle + 1)
return self.base_lr + (self.max_lr - self.base_lr) * tf.maximum(0.0, 1 - x)
The inverse momentum cycling (cycle_momentum=True) is a key feature: as LR increases, momentum decreases, preventing oscillations during the high-LR phase.
OneCycleLR — Super-Convergence
OneCycleLR (Smith & Topin 2018) is a single-cycle policy: LR rises from base_lr to max_lr over pct_start of training, then decays to near zero. Smith demonstrated that this enables super-convergence — reaching better accuracy in 10× fewer epochs than StepLR schedules.
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.1,
total_steps=len(dataloader) * epochs, # total training steps
pct_start=0.3, # 30% of steps for warmup (increase phase)
anneal_strategy='cos', # 'cos' (smooth) or 'linear'
div_factor=25.0, # base_lr = max_lr / div_factor
final_div_factor=1e4, # min_lr = base_lr / final_div_factor
)
TensorFlow: No built-in OneCycleLR. Approximate with warmup then cosine decay:
total_steps = len(dataloader) * epochs
warmup_steps = int(0.3 * total_steps)
# Phase 1: linear warmup from max_lr/25 to max_lr
# Phase 2: cosine decay from max_lr to max_lr/25/10000
schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.1, decay_steps=total_steps - warmup_steps,
warmup_steps=warmup_steps, warmup_target=0.1 # TF 2.13+
)
Key parameters:
max_lr— the peak LR; must be found via the LR range testpct_start=0.3— first 30% of steps are warmupdiv_factor=25— initial LR = max_lr/25final_div_factor=1e4— final LR = initial_lr/10000 ≈ max_lr/250000
OneCycleLR is called per batch and is the de-facto standard for fast training experiments.
SequentialLR — Compose Schedulers
SequentialLR chains multiple schedulers, switching between them at specified milestones:
# Warmup for 5 epochs, then cosine annealing for 95 epochs
warmup = torch.optim.lr_scheduler.LinearLR(
optimizer, start_factor=0.1, end_factor=1.0, total_iters=5
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=95
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer, schedulers=[warmup, cosine], milestones=[5]
)
The milestone [5] means: use warmup for epochs 0–4, then switch to cosine from epoch 5 onward.
ChainedScheduler (Deprecated)
ChainedScheduler multiplies the LR factors from multiple schedulers together each step. It is deprecated in favor of SequentialLR and should not be used in new code.
The Warmup + Cosine Decay Pattern
The standard recipe for transformers and large vision models:
# Total steps = epochs × steps_per_epoch
warmup_steps = 500
total_steps = 10000
warmup = torch.optim.lr_scheduler.LinearLR(
optimizer, start_factor=1e-4, end_factor=1.0, total_iters=warmup_steps
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=total_steps - warmup_steps, eta_min=1e-7
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps]
)
TensorFlow:
class WarmupCosineDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, base_lr, warmup_steps, total_steps, eta_min=1e-7):
self.base_lr = base_lr
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.eta_min = eta_min
def __call__(self, step):
warmup_lr = self.base_lr * (step / self.warmup_steps)
cos_steps = tf.cast(self.total_steps - self.warmup_steps, tf.float32)
cos_lr = self.eta_min + 0.5 * (self.base_lr - self.eta_min) * (
1 + tf.cos(3.14159 * (step - self.warmup_steps) / cos_steps)
)
return tf.where(step < self.warmup_steps, warmup_lr, cos_lr)
optimizer = tf.keras.optimizers.AdamW(
learning_rate=WarmupCosineDecay(1e-3, 500, 10000)
)
Scheduler Selection Summary
| Goal | Scheduler |
|---|---|
| Fastest training (single run) | OneCycleLR |
| Best generalization (ensemble) | CosineAnnealingWarmRestarts |
| Transformer / LLM fine-tuning | LinearLR warmup + CosineAnnealingLR |
| Unknown plateau epochs | ReduceLROnPlateau |
| Exploration during training | CyclicLR (triangular2) |
| Fixed-epoch milestone curriculum | MultiStepLR |
| Custom LR function | LambdaLR |
Calling scheduler.step() Correctly
# CORRECT — optimizer.step() before scheduler.step()
optimizer.step()
scheduler.step()
# WRONG — scheduler.step() before optimizer.step() causes off-by-one LR
scheduler.step() # Do not do this first
optimizer.step()
For per-batch schedulers (CyclicLR, OneCycleLR, CosineAnnealingWarmRestarts with step(t_cur)), call scheduler.step() after each batch's optimizer.step(). For epoch schedulers, call once per epoch after the training loop.