Regression Losses: L1, MSE, Huber, SmoothL1
- Derive the gradient of L1Loss and MSELoss and explain why MSE penalises large errors more severely than MAE
- Apply HuberLoss with a given delta and describe the quadratic-to-linear transition at the boundary
- Compare L1, MSE, Huber, and SmoothL1 losses in terms of gradient magnitude as a function of error magnitude and identify which is robust to outliers
- Select among the four regression losses given the noise distribution of the target variable
Overview
Regression losses measure the discrepancy between a real-valued prediction and a real-valued target . They differ in how they penalize large errors — a choice that determines robustness to outliers and gradient behaviour.
nn.L1Loss — Mean Absolute Error
The gradient of the absolute value is simply (it is undefined exactly at zero, but PyTorch uses a subgradient of 0 there):
Because the gradient does not grow with error magnitude, L1 is robust to outliers — a single hugely wrong prediction contributes gradient ±1/N, same as a small error.
loss = nn.L1Loss()
output = loss(torch.tensor([3.0, 1.0]), torch.tensor([1.0, 3.0]))
# |3-1| + |1-3| = 2 + 2 = 4 → mean = 2.0
When to use: Robust regression; super-resolution; when outliers in the target set should not dominate training.
nn.MSELoss — Mean Squared Error
The gradient grows linearly with the residual:
This makes MSE sensitive to outliers: a prediction that is 10 units off contributes 100× more to the loss than one that is 1 unit off. It also means large errors are corrected faster — which is an advantage when the data is clean.
loss = nn.MSELoss()
output = loss(torch.tensor([3.0, 1.0]), torch.tensor([1.0, 3.0]))
# (3-1)² + (1-3)² = 4 + 4 = 8 → mean = 4.0
When to use: Standard regression; audio synthesis; when all targets are trusted (no outliers).
nn.HuberLoss — Quadratic Near Zero, Linear Far Away
Huber loss introduces a threshold that separates the quadratic region from the linear region:
For small errors () it behaves like MSE — smooth gradient, easy convergence. For large errors () it behaves like L1 — gradient capped at , outlier-robust.
The transition is continuous and differentiable everywhere, making it compatible with all gradient-based optimizers.
loss = nn.HuberLoss(delta=1.0) # default delta=1
input = torch.tensor([0.5, 2.0])
target = torch.tensor([0.0, 0.0])
# |0.5-0|=0.5 ≤ 1 → 0.5*(0.5)²=0.125
# |2.0-0|=2.0 > 1 → 1*(2.0-0.5)=1.5
# mean = (0.125 + 1.5)/2 = 0.8125
When to use: Reinforcement learning value functions; regression with occasional outliers; anywhere you want L2 smoothness but L1 tail behaviour.
nn.SmoothL1Loss — Faster R-CNN Variant
SmoothL1 uses a similar piecewise formula but with a different scaling via parameter :
With this produces the same values as Huber with , but the quadratic region is divided by , giving a different gradient scale. The default changed to in PyTorch 1.9; earlier versions used implicitly matching the original Faster R-CNN formulation.
loss = nn.SmoothL1Loss(beta=1.0)
When to use: Object detection bounding box regression (Faster R-CNN, SSD); historically the standard loss for that domain.
Comparison: Gradient Magnitude vs. Error
| Error | L1 gradient | MSE gradient | Huber gradient () |
|---|---|---|---|
| 0.1 | 1/N | 0.2/N | 0.1/N |
| 1.0 | 1/N | 2.0/N | 1.0/N |
| 5.0 | 1/N | 10.0/N | 1.0/N (capped) |
| 10.0 | 1/N | 20.0/N | 1.0/N (capped) |
L1 always clips gradients to ±1/N. MSE scales gradient linearly with error. Huber is quadratic up to , then clips — the best of both.