Supplement · Loss Functions

Regression Losses: L1, MSE, Huber, SmoothL1

15 min read

By the end of this reading you will be able to:

Derive the gradient of L1Loss and MSELoss and explain why MSE penalises large errors more severely than MAE
Apply HuberLoss with a given delta and describe the quadratic-to-linear transition at the boundary
Compare L1, MSE, Huber, and SmoothL1 losses in terms of gradient magnitude as a function of error magnitude and identify which is robust to outliers
Select among the four regression losses given the noise distribution of the target variable

Overview

Regression losses measure the discrepancy between a real-valued prediction $x_i$ and a real-valued target $y_i$ . They differ in how they penalize large errors — a choice that determines robustness to outliers and gradient behaviour.

nn.L1Loss — Mean Absolute Error

$\mathcal{L}_{\text{L1}} = \frac{1}{N} \sum_{i=1}^N |x_i - y_i|$

The gradient of the absolute value is simply $\pm 1$ (it is undefined exactly at zero, but PyTorch uses a subgradient of 0 there):

$\frac{\partial \mathcal{L}_{\text{L1}}}{\partial x_i} = \frac{1}{N} \operatorname{sign}(x_i - y_i)$

Because the gradient does not grow with error magnitude, L1 is robust to outliers — a single hugely wrong prediction contributes gradient ±1/N, same as a small error.

loss = nn.L1Loss()
output = loss(torch.tensor([3.0, 1.0]), torch.tensor([1.0, 3.0]))
# |3-1| + |1-3| = 2 + 2 = 4 → mean = 2.0

When to use: Robust regression; super-resolution; when outliers in the target set should not dominate training.

nn.MSELoss — Mean Squared Error

$\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^N (x_i - y_i)^2$

The gradient grows linearly with the residual:

$\frac{\partial \mathcal{L}_{\text{MSE}}}{\partial x_i} = \frac{2}{N}(x_i - y_i)$

This makes MSE sensitive to outliers: a prediction that is 10 units off contributes 100× more to the loss than one that is 1 unit off. It also means large errors are corrected faster — which is an advantage when the data is clean.

loss = nn.MSELoss()
output = loss(torch.tensor([3.0, 1.0]), torch.tensor([1.0, 3.0]))
# (3-1)² + (1-3)² = 4 + 4 = 8 → mean = 4.0

When to use: Standard regression; audio synthesis; when all targets are trusted (no outliers).

nn.HuberLoss — Quadratic Near Zero, Linear Far Away

Huber loss introduces a threshold $\delta > 0$ that separates the quadratic region from the linear region:

$\ell_i = \begin{cases} \frac{1}{2}(x_i - y_i)^2 & \text{if } |x_i - y_i| \le \delta \\ \delta \left(|x_i - y_i| - \frac{\delta}{2}\right) & \text{otherwise} \end{cases}$

For small errors ( $|x-y| \le \delta$ ) it behaves like MSE — smooth gradient, easy convergence. For large errors ( $|x-y| > \delta$ ) it behaves like L1 — gradient capped at $\delta$ , outlier-robust.

The transition is continuous and differentiable everywhere, making it compatible with all gradient-based optimizers.

loss = nn.HuberLoss(delta=1.0)  # default delta=1
input  = torch.tensor([0.5, 2.0])
target = torch.tensor([0.0, 0.0])
# |0.5-0|=0.5 ≤ 1 → 0.5*(0.5)²=0.125
# |2.0-0|=2.0 > 1 → 1*(2.0-0.5)=1.5
# mean = (0.125 + 1.5)/2 = 0.8125

When to use: Reinforcement learning value functions; regression with occasional outliers; anywhere you want L2 smoothness but L1 tail behaviour.

nn.SmoothL1Loss — Faster R-CNN Variant

SmoothL1 uses a similar piecewise formula but with a different scaling via parameter $\beta$ :

$\ell_i = \begin{cases} \frac{1}{2\beta}(x_i - y_i)^2 & \text{if } |x_i - y_i| < \beta \\ |x_i - y_i| - \frac{\beta}{2} & \text{otherwise} \end{cases}$

With $\beta = 1$ this produces the same values as Huber with $\delta = 1$ , but the quadratic region is divided by $\beta$ , giving a different gradient scale. The default changed to $\beta = 1$ in PyTorch 1.9; earlier versions used $\beta = 1$ implicitly matching the original Faster R-CNN formulation.