Supplement · Regularization

Early Stopping and Data Augmentation

15 min read

By the end of this reading you will be able to:

Explain early stopping as L2 regularization in disguise — tracing the connection between stopping time and the effective regularization strength — and describe the patience hyperparameter
Explain how data augmentation acts as a regularizer by enlarging the effective training set, and distinguish label-preserving augmentations (flips, crops) from label-mixing augmentations (Mixup, CutMix)
Describe the Mixup interpolation scheme — the convex combination of both inputs and labels — and explain why soft labels act as regularizers
Distinguish CutOut (masking a region of a single image) from CutMix (replacing the masked region with a patch from another image and blending labels proportionally) and state the intuition behind each

Early Stopping

The simplest regularizer requires no change to the model or loss function: stop training when validation loss stops improving.

Algorithm:

Hold out a validation set
After each epoch (or every $k$ steps), evaluate validation loss
Keep the checkpoint with the lowest validation loss seen so far
Stop training if validation loss has not improved for patience evaluations
Restore the best checkpoint

Patience is the key hyperparameter: the number of evaluations to wait before stopping. Too low → stops too early (underfitting); too high → wastes compute and may overfit. Typical values: 5–20 epochs.

Why It Works: Connection to L2 Regularization

For gradient descent on a quadratic loss near a minimum, the number of training steps $T$ and L2 regularization strength $\lambda$ are related by $T \approx 1/\lambda$ — training for fewer steps is equivalent to stronger L2 regularization.

Intuitively: early in training, weight updates are large and the model is far from any minimum. As training continues, parameters drift further into the high-curvature directions where the model fits noise. Stopping early constrains the effective complexity — the model stays in the low-complexity region it reaches first.

Practical Implementation

best_val_loss = float('inf')
patience_count = 0

for epoch in range(max_epochs):
    train_one_epoch(model, optimizer, train_loader)
    val_loss = evaluate(model, val_loader)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_checkpoint(model)  # keep best weights
        patience_count = 0
    else:
        patience_count += 1
        if patience_count >= patience:
            break

load_checkpoint(model)  # restore best

Data Augmentation: Enlarging the Training Set

A model that sees many variations of the training data cannot memorize specific examples — the effective training set is much larger than the raw dataset. Data augmentation creates these variations on the fly during training.

Why it is regularization: augmented examples are not seen again (new random transforms each epoch), and their diversity reduces the model's ability to overfit the specific pixel patterns in the original images.

Standard Geometric and Color Augmentations

Transform	What it does	Typical setting
Random horizontal flip	Mirror image left-right	p=0.5
Random crop	Sample a sub-region, resize to original	224×224 from 256×256
Color jitter	Randomly adjust brightness, contrast, saturation, hue	Strength 0.4
Gaussian blur	Low-pass filter with random radius	Used in SimCLR contrastive pretraining
Random rotation	Rotate by small angle	±15° for natural images
Random erasing	Zero out a random rectangle	Equivalent to Cutout

These augmentations are label-preserving — a flipped cat is still a cat. They extend the invariances learned by the model.

Mixup

Mixup (Zhang et al., 2018) interpolates between two training examples and their labels:

$\tilde{x} = \lambda x_i + (1-\lambda) x_j$ $\tilde{y} = \lambda y_i + (1-\lambda) y_j$

where $\lambda \sim \text{Beta}(\alpha, \alpha)$ with typical $\alpha \in [0.2, 0.4]$ .

Key properties:

The interpolated example is never exactly either training example — forces the model to have linear behavior between training points
The soft (blended) labels $\tilde{y}$ regularize output confidence: the model cannot assign probability 1 to any class for a blended example
Reduces oscillation in training and improves calibration

Why soft labels regularize: A model trained with hard labels can assign arbitrarily high logits to the correct class. Soft labels force the model to spread probability across classes proportional to the blend ratio — preventing overconfident predictions.

CutOut

CutOut (DeVries & Taylor, 2017) zeros out a randomly located square region of a training image:

Mask size: typically 16×16 or 32×32 pixels
Labels unchanged: the label stays a hard one-hot vector
Effect: forces the model to use context rather than relying on the most discriminative region (e.g., the face of a dog rather than the whole dog)

Analogous to dropout on the input space — creates robustness to occlusion.

CutMix

CutMix (Yun et al., 2019) combines CutOut and Mixup: replace the cut region with a patch from a different training image, then blend the labels proportionally to the area:

$\tilde{x} = \mathbf{M} \odot x_i + (1-\mathbf{M}) \odot x_j$ $\tilde{y} = \lambda y_i + (1-\lambda) y_j, \quad \lambda = \frac{|\mathbf{M}|}{HW}$

where $\mathbf{M}$ is a binary mask and $\lambda$ is the fraction of the image from $x_i$ .

Vs. Mixup: pixel values are not interpolated (no ghosting). Vs. CutOut: the masked region is filled with real image content rather than zeros. CutMix tends to outperform both on image classification benchmarks.

RandAugment

RandAugment (Cubuk et al., 2020) simplifies augmentation policy search: randomly apply $N$ augmentation operations from a fixed library (shear, translate, rotate, solarize, equalize, etc.), each at magnitude $M$ :

Two hyperparameters only: $N$ (number of ops, typically 2) and $M$ (magnitude, 1–10)
Replaces AutoAugment's expensive search with a simple grid search or default values
Near-AutoAugment accuracy with trivial implementation

Now the de facto standard for image classification augmentation in the absence of a specific domain policy.

References

Zhang et al. 2018 — Mixup: Beyond Empirical Risk Minimization

Yun et al. 2019 — CutMix: Training Strategy that Makes Use of Sample Mixing

Cubuk et al. 2020 — RandAugment: Practical automated data augmentation

Previous Next →

Early Stopping and Data Augmentation

Early Stopping

Why It Works: Connection to L2 Regularization

Practical Implementation

Data Augmentation: Enlarging the Training Set

Standard Geometric and Color Augmentations

Mixup

CutOut

CutMix

RandAugment

Privacy Policy

What we collect

What we don't collect

Your choices

Contact