Supplement · Optimizers

Advanced Schedulers & Composition

14 min read
By the end of this reading you will be able to:
  • Explain CosineAnnealingWarmRestarts and state how T_0 and T_mult control the restart period, cycle growth, and exploration behavior
  • Configure OneCycleLR for super-convergence, identifying the three training phases and the role of div_factor, pct_start, and final_div_factor
  • Distinguish CyclicLR's three modes (triangular, triangular2, exp_range) and explain why inverse momentum cycling reduces oscillations during the high-LR phase
  • Compose a warmup-then-cosine schedule using SequentialLR and trace the learning rate value through the transition milestone

CosineAnnealingWarmRestarts (SGDR)

SGDR (Loshchilov & Hutter 2016) extends cosine annealing by periodically restarting the LR from its maximum. After each restart the period optionally grows by a factor TmultT_{\text{mult}}: ηt=ηmin+12(ηmaxηmin)(1+cos(πTcurTi))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min}) \left(1 + \cos\left(\frac{\pi T_{\text{cur}}}{T_i}\right)\right)

where Ti=T0TmultiT_i = T_0 \cdot T_{\text{mult}}^i is the length of the ii-th cycle and TcurT_{\text{cur}} resets to 0 at each restart.

scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,      # length of first restart cycle (in epochs)
    T_mult=2,    # each cycle is 2× longer: 10 → 20 → 40 ...
    eta_min=1e-6
)

TensorFlow:

schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(
    initial_learning_rate=0.1,
    first_decay_steps=10,    # T_0 in steps
    t_mul=2.0,               # T_mult
    alpha=1e-6 / 0.1         # eta_min / initial_lr
)
optimizer = tf.keras.optimizers.SGD(learning_rate=schedule)

The restarts serve as exploration pulses: after annealing to a sharp minimum, the LR spikes again and the model can escape to a flatter (better-generalizing) basin. This is particularly effective when combined with model snapshot ensembling — saving the model at each restart and averaging their predictions.

CyclicLR

CyclicLR (Smith 2017) cycles the LR between base_lr and max_lr using a triangular wave. It is typically applied per batch (not per epoch).

Modes

Mode Behavior
triangular Fixed triangle amplitude every cycle
triangular2 Amplitude halves after each cycle
exp_range Amplitude decays by gamma^iteration
scheduler = torch.optim.lr_scheduler.CyclicLR(
    optimizer,
    base_lr=1e-4,
    max_lr=1e-2,
    step_size_up=2000,   # steps to go from base_lr → max_lr
    mode='triangular2',
    cycle_momentum=True, # SGD only: momentum cycles inversely to LR
)
# Step per batch:
for batch in dataloader:
    train_batch()
    scheduler.step()

TensorFlow: No built-in CyclicLR. Implement as a custom schedule:

class TriangularCyclicLR(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, base_lr, max_lr, step_size):
        self.base_lr, self.max_lr, self.step_size = base_lr, max_lr, step_size

    def __call__(self, step):
        cycle = tf.floor(1 + step / (2 * self.step_size))
        x = tf.abs(step / self.step_size - 2 * cycle + 1)
        return self.base_lr + (self.max_lr - self.base_lr) * tf.maximum(0.0, 1 - x)

The inverse momentum cycling (cycle_momentum=True) is a key feature: as LR increases, momentum decreases, preventing oscillations during the high-LR phase.

OneCycleLR — Super-Convergence

OneCycleLR (Smith & Topin 2018) is a single-cycle policy: LR rises from base_lr to max_lr over pct_start of training, then decays to near zero. Smith demonstrated that this enables super-convergence — reaching better accuracy in 10× fewer epochs than StepLR schedules.

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,
    total_steps=len(dataloader) * epochs,  # total training steps
    pct_start=0.3,           # 30% of steps for warmup (increase phase)
    anneal_strategy='cos',   # 'cos' (smooth) or 'linear'
    div_factor=25.0,         # base_lr = max_lr / div_factor
    final_div_factor=1e4,    # min_lr = base_lr / final_div_factor
)

TensorFlow: No built-in OneCycleLR. Approximate with warmup then cosine decay:

total_steps = len(dataloader) * epochs
warmup_steps = int(0.3 * total_steps)

# Phase 1: linear warmup from max_lr/25 to max_lr
# Phase 2: cosine decay from max_lr to max_lr/25/10000
schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1, decay_steps=total_steps - warmup_steps,
    warmup_steps=warmup_steps, warmup_target=0.1  # TF 2.13+
)

Key parameters:

  • max_lr — the peak LR; must be found via the LR range test
  • pct_start=0.3 — first 30% of steps are warmup
  • div_factor=25 — initial LR = max_lr/25
  • final_div_factor=1e4 — final LR = initial_lr/10000 ≈ max_lr/250000

OneCycleLR is called per batch and is the de-facto standard for fast training experiments.

SequentialLR — Compose Schedulers

SequentialLR chains multiple schedulers, switching between them at specified milestones:

# Warmup for 5 epochs, then cosine annealing for 95 epochs
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer, start_factor=0.1, end_factor=1.0, total_iters=5
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=95
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer, schedulers=[warmup, cosine], milestones=[5]
)

The milestone [5] means: use warmup for epochs 0–4, then switch to cosine from epoch 5 onward.

ChainedScheduler (Deprecated)

ChainedScheduler multiplies the LR factors from multiple schedulers together each step. It is deprecated in favor of SequentialLR and should not be used in new code.

The Warmup + Cosine Decay Pattern

The standard recipe for transformers and large vision models:

# Total steps = epochs × steps_per_epoch
warmup_steps = 500
total_steps = 10000

warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer, start_factor=1e-4, end_factor=1.0, total_iters=warmup_steps
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=total_steps - warmup_steps, eta_min=1e-7
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps]
)

TensorFlow:

class WarmupCosineDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, base_lr, warmup_steps, total_steps, eta_min=1e-7):
        self.base_lr = base_lr
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps
        self.eta_min = eta_min

    def __call__(self, step):
        warmup_lr = self.base_lr * (step / self.warmup_steps)
        cos_steps = tf.cast(self.total_steps - self.warmup_steps, tf.float32)
        cos_lr = self.eta_min + 0.5 * (self.base_lr - self.eta_min) * (
            1 + tf.cos(3.14159 * (step - self.warmup_steps) / cos_steps)
        )
        return tf.where(step < self.warmup_steps, warmup_lr, cos_lr)

optimizer = tf.keras.optimizers.AdamW(
    learning_rate=WarmupCosineDecay(1e-3, 500, 10000)
)

Scheduler Selection Summary

Goal Scheduler
Fastest training (single run) OneCycleLR
Best generalization (ensemble) CosineAnnealingWarmRestarts
Transformer / LLM fine-tuning LinearLR warmup + CosineAnnealingLR
Unknown plateau epochs ReduceLROnPlateau
Exploration during training CyclicLR (triangular2)
Fixed-epoch milestone curriculum MultiStepLR
Custom LR function LambdaLR

Calling scheduler.step() Correctly

# CORRECT — optimizer.step() before scheduler.step()
optimizer.step()
scheduler.step()

# WRONG — scheduler.step() before optimizer.step() causes off-by-one LR
scheduler.step()  # Do not do this first
optimizer.step()

For per-batch schedulers (CyclicLR, OneCycleLR, CosineAnnealingWarmRestarts with step(t_cur)), call scheduler.step() after each batch's optimizer.step(). For epoch schedulers, call once per epoch after the training loop.