Supplement · Loss Functions

Probabilistic Regression: Poisson & Gaussian NLL

12 min read
By the end of this reading you will be able to:
  • Derive any regression loss from the negative log-likelihood of an assumed output distribution
  • Apply PoissonNLLLoss to count-valued targets and explain when the log_input flag must be set
  • Use GaussianNLLLoss to train a model that predicts both mean and variance and interpret the heteroscedastic output
  • Explain why predicting variance alongside mean improves model calibration and enables uncertainty quantification

From Point Estimates to Distributions

L1 and MSE assume the model outputs a single number — a point estimate of the target. Probabilistic losses go further: the model outputs the parameters of a probability distribution over the target, and the loss is the negative log-likelihood (NLL) of the observed target under that distribution.

This lets the model express uncertainty: when the target is ambiguous, the predicted variance should be large.


Deriving a Loss from Maximum Likelihood

Suppose the target yiy_i is drawn from a distribution p(yiθi)p(y_i \mid \theta_i) parameterised by the model output θi\theta_i. Maximum likelihood estimation (MLE) maximises

i=1Np(yiθi)\prod_{i=1}^N p(y_i \mid \theta_i)

Taking the logarithm (which is monotone, so maximising log-likelihood is equivalent) and negating to turn maximisation into minimisation:

LNLL=1Ni=1Nlogp(yiθi)\mathcal{L}_{\text{NLL}} = -\frac{1}{N} \sum_{i=1}^N \log p(y_i \mid \theta_i)

This is the negative log-likelihood objective. Every probabilistic loss in PyTorch is a special case.


nn.PoissonNLLLoss — Count Data

When targets are non-negative integer counts (word frequencies, photon arrivals, events per interval), the natural model is a Poisson distribution:

p(yλ)=λyeλy!,λ>0p(y \mid \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}, \quad \lambda > 0

The network predicts x=logλx = \log \lambda (to ensure λ=ex>0\lambda = e^x > 0). The NLL is:

logp(yex)=exyx+log(y!)-\log p(y \mid e^x) = e^x - y \cdot x + \log(y!)

Dropping the constant log(y!)\log(y!) (it does not depend on the model) gives the default loss:

i=exiyixi\ell_i = e^{x_i} - y_i \cdot x_i

With full=True, PyTorch adds Stirling's approximation for the factorial term: log(y!)ylogyy+0.5log(2πy)\log(y!) \approx y \log y - y + 0.5 \log(2\pi y).

loss = nn.PoissonNLLLoss(log_input=True)   # input is log(λ)
x = torch.tensor([0.5, 1.2])               # log(λ)
y = torch.tensor([1.0, 3.0])               # count targets
output = loss(x, y)
# ℓ₁ = exp(0.5) − 1·0.5 = 1.649 − 0.5 = 1.149
# ℓ₂ = exp(1.2) − 3·1.2 = 3.320 − 3.6 = −0.280

When to use: NLP word counts; medical event rates; any target that is a non-negative integer following a Poisson process.


nn.GaussianNLLLoss — Heteroscedastic Regression

When the target is real-valued but the observation noise varies across samples, model the target as

yiN(μi,σi2)y_i \sim \mathcal{N}(\mu_i, \sigma_i^2)

The network predicts both μi\mu_i (the mean, called input) and σi2\sigma_i^2 (the variance, called var). The NLL is:

logp(yiμi,σi2)=12[log(2πσi2)+(μiyi)2σi2]-\log p(y_i \mid \mu_i, \sigma_i^2) = \frac{1}{2}\left[\log(2\pi\sigma_i^2) + \frac{(\mu_i - y_i)^2}{\sigma_i^2}\right]

Dropping the constant 12log(2π)\frac{1}{2}\log(2\pi):

i=12[log(σi2)+(μiyi)2σi2]\ell_i = \frac{1}{2}\left[\log(\sigma_i^2) + \frac{(\mu_i - y_i)^2}{\sigma_i^2}\right]

The model learns a trade-off: if it is very uncertain (σi2\sigma_i^2 large), the squared-error term is down-weighted but the log-variance term increases. If it is very confident (σi2\sigma_i^2 small), the log-variance is small but squared error is amplified.

loss = nn.GaussianNLLLoss()
mean = torch.tensor([1.0, 2.0])   # predicted μ
var  = torch.tensor([0.5, 2.0])   # predicted σ²  (must be > 0)
target = torch.tensor([1.2, 1.0])
output = loss(mean, target, var)

PyTorch adds a small eps to var for numerical stability.

When to use: Uncertainty-aware regression; weather forecasting; any setting where prediction confidence should be data-driven.


Why Predict Variance?

In standard MSE, the model always acts as though it is equally confident about every prediction. In GaussianNLL, a well-trained model learns:

  • For easy, predictable samples → small σ2\sigma^2 → tight distribution
  • For ambiguous, noisy samples → large σ2\sigma^2 → wide distribution

The calibrated uncertainty can then be used for downstream decisions (e.g., active learning, safety-critical rejection).