Supplement · Regularization

Overfitting and the Bias-Variance Tradeoff

13 min read

By the end of this reading you will be able to:

Distinguish underfitting (high bias) from overfitting (high variance) using training and validation loss curves, and identify the symptoms of each in practice
Explain the bias-variance decomposition of expected test error and state what each term measures
Define regularization precisely — any technique that reduces generalization error, possibly at the cost of higher training error — and explain why this encompasses techniques that do not add an explicit penalty term
Identify the double-descent phenomenon and explain why modern overparameterized models can interpolate the training data and still generalize well

The Fundamental Problem

A model trained to minimize training loss has no direct incentive to do well on unseen data. The gap between training performance and test performance is the generalization gap, and closing it is the central concern of regularization.

Two failure modes sit at opposite ends:

Underfitting (high bias): The model is too simple to capture the true structure in the data. Training loss and test loss are both high. More capacity or more training will help.

Overfitting (high variance): The model has memorized the training data, including its noise. Training loss is low but test loss is high — the learned function does not generalize.

Reading the Loss Curves

The clearest diagnostic tool is plotting training loss and validation loss against training steps:

Loss
 |
 |\  training loss
 | \ \___________
 |  \              validation loss
 |   \___    _____/
 |       \__/
 |         ↑
 |     optimal early stop point
 +----------------------------→ Steps

Training loss always decreases (the optimizer is minimizing it directly)
Validation loss decreases, then rises — the point at which it stops decreasing is the onset of overfitting
The gap between the two curves is the generalization gap

A large gap with high training accuracy = overfitting. Both losses high = underfitting.

Bias-Variance Decomposition

For a squared loss regression problem, the expected test error decomposes as:

$\mathbb{E}\bigl[(y - \hat{f}(x))^2\bigr] = \underbrace{\text{Bias}[\hat{f}(x)]^2}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to training set}} + \underbrace{\sigma^2}_{\text{irreducible noise}}$

Bias: how far the average prediction (over all possible training sets) is from the true answer. High when the model family is too restricted.
Variance: how much predictions vary across different training sets. High when the model is too sensitive to the specific data seen — overfitting.
Irreducible noise $\sigma^2$ : inherent noise in the labels; cannot be reduced by any model.

The classical tradeoff: increasing model complexity reduces bias but increases variance. The goal is to find the complexity that minimizes their sum.

What Is Regularization?

A broad and useful definition (Goodfellow et al., 2016):

Regularization is any modification we make to a learning algorithm intended to reduce its generalization error but not its training error.

This encompasses far more than explicit penalty terms:

Technique	Mechanism
L2 weight penalty	Shrinks weights toward zero
Dropout	Stochastic masking of activations
Early stopping	Stops before overfitting region
Data augmentation	Expands effective training set
Batch normalization	Implicit regularization via noise
Weight initialization	Constrains initial function class
SGD noise	Implicit bias toward flat minima

All of these reduce generalization error, often at the cost of higher training error (or at least not minimizing training error as aggressively).

Double Descent

The classical bias-variance tradeoff predicts a U-shaped test error curve — there is an optimal capacity, and going past it hurts. This was the prevailing view until the deep learning era revealed a more complex picture.

Double descent (Belkin et al., 2019): as model capacity increases past the interpolation threshold (where training loss reaches zero), test error can decrease again — even though the model perfectly memorizes the training data.

Test error
     |
     |         classical
     |\        regime   /↑ modern regime
     | \      /        /
     |  \    /        /↓ test error decreases again!
     |   \  /
     |    \/
     +------|----------→ Model capacity
            ↑
       interpolation
       threshold

Why it happens: In the overparameterized regime, there are many functions that interpolate the training data. SGD with appropriate implicit biases (e.g., small initialization, weight decay) finds the minimum-norm interpolant — and for well-structured data, this turns out to generalize.

Practical implication: modern neural networks operate far in the overparameterized regime and still generalize. Classical regularization techniques remain valuable, but they work alongside this implicit regularization — not instead of it.

PyTorch and TensorFlow

The bias-variance tradeoff is most directly observable through learning curves — plotting train and validation loss over epochs. This is the first diagnostic to run when a model overfits or underfits.

PyTorch — minimal training loop with learning curve tracking:

import torch
import torch.nn as nn

model     = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

train_losses, val_losses = [], []

for epoch in range(100):
    # Training step
    model.train()
    train_pred = model(X_train)
    train_loss = criterion(train_pred, y_train)
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    # Validation step — no gradients, eval mode (matters for Dropout/BatchNorm)
    model.eval()
    with torch.no_grad():
        val_loss = criterion(model(X_val), y_val)

    train_losses.append(train_loss.item())
    val_losses.append(val_loss.item())

# Diagnose: large gap between train_losses and val_losses → overfitting
# Both curves high and similar → underfitting (high bias)
# Both curves low and similar → good generalization

TensorFlow / Keras:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(64,)),
    tf.keras.layers.Dense(1),
])
model.compile(optimizer='adam', loss='mse')

# history.history['loss'] and history.history['val_loss'] are the learning curves
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    verbose=0,
)

train_losses = history.history['loss']
val_losses   = history.history['val_loss']
# Persistent val_loss > train_loss → overfitting; add regularization
# val_loss ≈ train_loss but both high → underfitting; use a larger model

References

Belkin et al. 2019 — Reconciling modern machine-learning practice and the classical bias-variance trade-off

Overview Next →

Overfitting and the Bias-Variance Tradeoff

The Fundamental Problem

Reading the Loss Curves

Bias-Variance Decomposition

What Is Regularization?

Double Descent

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact