Supplement · Regularization

Overfitting and the Bias-Variance Tradeoff

13 min read
By the end of this reading you will be able to:
  • Distinguish underfitting (high bias) from overfitting (high variance) using training and validation loss curves, and identify the symptoms of each in practice
  • Explain the bias-variance decomposition of expected test error and state what each term measures
  • Define regularization precisely — any technique that reduces generalization error, possibly at the cost of higher training error — and explain why this encompasses techniques that do not add an explicit penalty term
  • Identify the double-descent phenomenon and explain why modern overparameterized models can interpolate the training data and still generalize well

The Fundamental Problem

A model trained to minimize training loss has no direct incentive to do well on unseen data. The gap between training performance and test performance is the generalization gap, and closing it is the central concern of regularization.

Two failure modes sit at opposite ends:

Underfitting (high bias): The model is too simple to capture the true structure in the data. Training loss and test loss are both high. More capacity or more training will help.

Overfitting (high variance): The model has memorized the training data, including its noise. Training loss is low but test loss is high — the learned function does not generalize.


Reading the Loss Curves

The clearest diagnostic tool is plotting training loss and validation loss against training steps:

Loss
 |
 |\  training loss
 | \ \___________
 |  \              validation loss
 |   \___    _____/
 |       \__/
 |         ↑
 |     optimal early stop point
 +----------------------------→ Steps
  • Training loss always decreases (the optimizer is minimizing it directly)
  • Validation loss decreases, then rises — the point at which it stops decreasing is the onset of overfitting
  • The gap between the two curves is the generalization gap

A large gap with high training accuracy = overfitting. Both losses high = underfitting.


Bias-Variance Decomposition

For a squared loss regression problem, the expected test error decomposes as:

E[(yf^(x))2]=Bias[f^(x)]2systematic error+Var[f^(x)]sensitivity to training set+σ2irreducible noise\mathbb{E}\bigl[(y - \hat{f}(x))^2\bigr] = \underbrace{\text{Bias}[\hat{f}(x)]^2}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to training set}} + \underbrace{\sigma^2}_{\text{irreducible noise}}

  • Bias: how far the average prediction (over all possible training sets) is from the true answer. High when the model family is too restricted.
  • Variance: how much predictions vary across different training sets. High when the model is too sensitive to the specific data seen — overfitting.
  • Irreducible noise σ2\sigma^2: inherent noise in the labels; cannot be reduced by any model.

The classical tradeoff: increasing model complexity reduces bias but increases variance. The goal is to find the complexity that minimizes their sum.


What Is Regularization?

A broad and useful definition (Goodfellow et al., 2016):

Regularization is any modification we make to a learning algorithm intended to reduce its generalization error but not its training error.

This encompasses far more than explicit penalty terms:

Technique Mechanism
L2 weight penalty Shrinks weights toward zero
Dropout Stochastic masking of activations
Early stopping Stops before overfitting region
Data augmentation Expands effective training set
Batch normalization Implicit regularization via noise
Weight initialization Constrains initial function class
SGD noise Implicit bias toward flat minima

All of these reduce generalization error, often at the cost of higher training error (or at least not minimizing training error as aggressively).


Double Descent

The classical bias-variance tradeoff predicts a U-shaped test error curve — there is an optimal capacity, and going past it hurts. This was the prevailing view until the deep learning era revealed a more complex picture.

Double descent (Belkin et al., 2019): as model capacity increases past the interpolation threshold (where training loss reaches zero), test error can decrease again — even though the model perfectly memorizes the training data.

Test error
     |
     |         classical
     |\        regime   /↑ modern regime
     | \      /        /
     |  \    /        /↓ test error decreases again!
     |   \  /
     |    \/
     +------|----------→ Model capacity
            ↑
       interpolation
       threshold

Why it happens: In the overparameterized regime, there are many functions that interpolate the training data. SGD with appropriate implicit biases (e.g., small initialization, weight decay) finds the minimum-norm interpolant — and for well-structured data, this turns out to generalize.

Practical implication: modern neural networks operate far in the overparameterized regime and still generalize. Classical regularization techniques remain valuable, but they work alongside this implicit regularization — not instead of it.


PyTorch and TensorFlow

The bias-variance tradeoff is most directly observable through learning curves — plotting train and validation loss over epochs. This is the first diagnostic to run when a model overfits or underfits.

PyTorch — minimal training loop with learning curve tracking:

import torch
import torch.nn as nn

model     = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

train_losses, val_losses = [], []

for epoch in range(100):
    # Training step
    model.train()
    train_pred = model(X_train)
    train_loss = criterion(train_pred, y_train)
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    # Validation step — no gradients, eval mode (matters for Dropout/BatchNorm)
    model.eval()
    with torch.no_grad():
        val_loss = criterion(model(X_val), y_val)

    train_losses.append(train_loss.item())
    val_losses.append(val_loss.item())

# Diagnose: large gap between train_losses and val_losses → overfitting
# Both curves high and similar → underfitting (high bias)
# Both curves low and similar → good generalization

TensorFlow / Keras:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(64,)),
    tf.keras.layers.Dense(1),
])
model.compile(optimizer='adam', loss='mse')

# history.history['loss'] and history.history['val_loss'] are the learning curves
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    verbose=0,
)

train_losses = history.history['loss']
val_losses   = history.history['val_loss']
# Persistent val_loss > train_loss → overfitting; add regularization
# val_loss ≈ train_loss but both high → underfitting; use a larger model