Supplement · Loss Functions

Regression Losses: L1, MSE, Huber, SmoothL1

15 min read
By the end of this reading you will be able to:
  • Derive the gradient of L1Loss and MSELoss and explain why MSE penalises large errors more severely than MAE
  • Apply HuberLoss with a given delta and describe the quadratic-to-linear transition at the boundary
  • Compare L1, MSE, Huber, and SmoothL1 losses in terms of gradient magnitude as a function of error magnitude and identify which is robust to outliers
  • Select among the four regression losses given the noise distribution of the target variable

Overview

Regression losses measure the discrepancy between a real-valued prediction xix_i and a real-valued target yiy_i. They differ in how they penalize large errors — a choice that determines robustness to outliers and gradient behaviour.


nn.L1Loss — Mean Absolute Error

LL1=1Ni=1Nxiyi\mathcal{L}_{\text{L1}} = \frac{1}{N} \sum_{i=1}^N |x_i - y_i|

The gradient of the absolute value is simply ±1\pm 1 (it is undefined exactly at zero, but PyTorch uses a subgradient of 0 there):

LL1xi=1Nsign(xiyi)\frac{\partial \mathcal{L}_{\text{L1}}}{\partial x_i} = \frac{1}{N} \operatorname{sign}(x_i - y_i)

Because the gradient does not grow with error magnitude, L1 is robust to outliers — a single hugely wrong prediction contributes gradient ±1/N, same as a small error.

loss = nn.L1Loss()
output = loss(torch.tensor([3.0, 1.0]), torch.tensor([1.0, 3.0]))
# |3-1| + |1-3| = 2 + 2 = 4 → mean = 2.0

When to use: Robust regression; super-resolution; when outliers in the target set should not dominate training.


nn.MSELoss — Mean Squared Error

LMSE=1Ni=1N(xiyi)2\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^N (x_i - y_i)^2

The gradient grows linearly with the residual:

LMSExi=2N(xiyi)\frac{\partial \mathcal{L}_{\text{MSE}}}{\partial x_i} = \frac{2}{N}(x_i - y_i)

This makes MSE sensitive to outliers: a prediction that is 10 units off contributes 100× more to the loss than one that is 1 unit off. It also means large errors are corrected faster — which is an advantage when the data is clean.

loss = nn.MSELoss()
output = loss(torch.tensor([3.0, 1.0]), torch.tensor([1.0, 3.0]))
# (3-1)² + (1-3)² = 4 + 4 = 8 → mean = 4.0

When to use: Standard regression; audio synthesis; when all targets are trusted (no outliers).


nn.HuberLoss — Quadratic Near Zero, Linear Far Away

Huber loss introduces a threshold δ>0\delta > 0 that separates the quadratic region from the linear region:

i={12(xiyi)2if xiyiδδ(xiyiδ2)otherwise\ell_i = \begin{cases} \frac{1}{2}(x_i - y_i)^2 & \text{if } |x_i - y_i| \le \delta \\ \delta \left(|x_i - y_i| - \frac{\delta}{2}\right) & \text{otherwise} \end{cases}

For small errors (xyδ|x-y| \le \delta) it behaves like MSE — smooth gradient, easy convergence. For large errors (xy>δ|x-y| > \delta) it behaves like L1 — gradient capped at δ\delta, outlier-robust.

The transition is continuous and differentiable everywhere, making it compatible with all gradient-based optimizers.

loss = nn.HuberLoss(delta=1.0)  # default delta=1
input  = torch.tensor([0.5, 2.0])
target = torch.tensor([0.0, 0.0])
# |0.5-0|=0.5 ≤ 1 → 0.5*(0.5)²=0.125
# |2.0-0|=2.0 > 1 → 1*(2.0-0.5)=1.5
# mean = (0.125 + 1.5)/2 = 0.8125

When to use: Reinforcement learning value functions; regression with occasional outliers; anywhere you want L2 smoothness but L1 tail behaviour.


nn.SmoothL1Loss — Faster R-CNN Variant

SmoothL1 uses a similar piecewise formula but with a different scaling via parameter β\beta:

i={12β(xiyi)2if xiyi<βxiyiβ2otherwise\ell_i = \begin{cases} \frac{1}{2\beta}(x_i - y_i)^2 & \text{if } |x_i - y_i| < \beta \\ |x_i - y_i| - \frac{\beta}{2} & \text{otherwise} \end{cases}

With β=1\beta = 1 this produces the same values as Huber with δ=1\delta = 1, but the quadratic region is divided by β\beta, giving a different gradient scale. The default changed to β=1\beta = 1 in PyTorch 1.9; earlier versions used β=1\beta = 1 implicitly matching the original Faster R-CNN formulation.

loss = nn.SmoothL1Loss(beta=1.0)

When to use: Object detection bounding box regression (Faster R-CNN, SSD); historically the standard loss for that domain.


Comparison: Gradient Magnitude vs. Error

Error xy|x - y| L1 gradient MSE gradient Huber gradient (δ=1\delta=1)
0.1 1/N 0.2/N 0.1/N
1.0 1/N 2.0/N 1.0/N
5.0 1/N 10.0/N 1.0/N (capped)
10.0 1/N 20.0/N 1.0/N (capped)

L1 always clips gradients to ±1/N. MSE scales gradient linearly with error. Huber is quadratic up to δ\delta, then clips — the best of both.