Supplement · Loss Functions

Probabilistic Regression: Poisson & Gaussian NLL

12 min read

By the end of this reading you will be able to:

Derive any regression loss from the negative log-likelihood of an assumed output distribution
Apply PoissonNLLLoss to count-valued targets and explain when the log_input flag must be set
Use GaussianNLLLoss to train a model that predicts both mean and variance and interpret the heteroscedastic output
Explain why predicting variance alongside mean improves model calibration and enables uncertainty quantification

From Point Estimates to Distributions

L1 and MSE assume the model outputs a single number — a point estimate of the target. Probabilistic losses go further: the model outputs the parameters of a probability distribution over the target, and the loss is the negative log-likelihood (NLL) of the observed target under that distribution.

This lets the model express uncertainty: when the target is ambiguous, the predicted variance should be large.

Deriving a Loss from Maximum Likelihood

Suppose the target $y_i$ is drawn from a distribution $p(y_i \mid \theta_i)$ parameterised by the model output $\theta_i$ . Maximum likelihood estimation (MLE) maximises

$\prod_{i=1}^N p(y_i \mid \theta_i)$

Taking the logarithm (which is monotone, so maximising log-likelihood is equivalent) and negating to turn maximisation into minimisation:

$\mathcal{L}_{\text{NLL}} = -\frac{1}{N} \sum_{i=1}^N \log p(y_i \mid \theta_i)$

This is the negative log-likelihood objective. Every probabilistic loss in PyTorch is a special case.

nn.PoissonNLLLoss — Count Data

When targets are non-negative integer counts (word frequencies, photon arrivals, events per interval), the natural model is a Poisson distribution:

$p(y \mid \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}, \quad \lambda > 0$

The network predicts $x = \log \lambda$ (to ensure $\lambda = e^x > 0$ ). The NLL is:

$-\log p(y \mid e^x) = e^x - y \cdot x + \log(y!)$

Dropping the constant $\log(y!)$ (it does not depend on the model) gives the default loss:

$\ell_i = e^{x_i} - y_i \cdot x_i$

With full=True, PyTorch adds Stirling's approximation for the factorial term: $\log(y!) \approx y \log y - y + 0.5 \log(2\pi y)$ .

loss = nn.PoissonNLLLoss(log_input=True)   # input is log(λ)
x = torch.tensor([0.5, 1.2])               # log(λ)
y = torch.tensor([1.0, 3.0])               # count targets
output = loss(x, y)
# ℓ₁ = exp(0.5) − 1·0.5 = 1.649 − 0.5 = 1.149
# ℓ₂ = exp(1.2) − 3·1.2 = 3.320 − 3.6 = −0.280

When to use: NLP word counts; medical event rates; any target that is a non-negative integer following a Poisson process.

nn.GaussianNLLLoss — Heteroscedastic Regression

When the target is real-valued but the observation noise varies across samples, model the target as

$y_i \sim \mathcal{N}(\mu_i, \sigma_i^2)$

The network predicts both $\mu_i$ (the mean, called input) and $\sigma_i^2$ (the variance, called var). The NLL is:

$-\log p(y_i \mid \mu_i, \sigma_i^2) = \frac{1}{2}\left[\log(2\pi\sigma_i^2) + \frac{(\mu_i - y_i)^2}{\sigma_i^2}\right]$

Dropping the constant $\frac{1}{2}\log(2\pi)$ :

$\ell_i = \frac{1}{2}\left[\log(\sigma_i^2) + \frac{(\mu_i - y_i)^2}{\sigma_i^2}\right]$

The model learns a trade-off: if it is very uncertain ( $\sigma_i^2$ large), the squared-error term is down-weighted but the log-variance term increases. If it is very confident ( $\sigma_i^2$ small), the log-variance is small but squared error is amplified.

loss = nn.GaussianNLLLoss()
mean = torch.tensor([1.0, 2.0])   # predicted μ
var  = torch.tensor([0.5, 2.0])   # predicted σ²  (must be > 0)
target = torch.tensor([1.2, 1.0])
output = loss(mean, target, var)

PyTorch adds a small eps to var for numerical stability.

When to use: Uncertainty-aware regression; weather forecasting; any setting where prediction confidence should be data-driven.

Why Predict Variance?

In standard MSE, the model always acts as though it is equally confident about every prediction. In GaussianNLL, a well-trained model learns:

For easy, predictable samples → small $\sigma^2$ → tight distribution
For ambiguous, noisy samples → large $\sigma^2$ → wide distribution

The calibrated uncertainty can then be used for downstream decisions (e.g., active learning, safety-critical rejection).

References

[1] — nn.PoissonNLLLoss — PyTorch docs

[2] — nn.GaussianNLLLoss — PyTorch docs

[3] — Nix & Weigend (1994) — Estimating the mean and variance of the target probability distribution

Previous Next →

Probabilistic Regression: Poisson & Gaussian NLL

From Point Estimates to Distributions

Deriving a Loss from Maximum Likelihood

nn.PoissonNLLLoss — Count Data

nn.GaussianNLLLoss — Heteroscedastic Regression

Why Predict Variance?

Privacy Policy

What we collect

What we don't collect

Your choices

Contact