Supplement · Loss Functions

What Is a Loss Function?

10 min read

By the end of this reading you will be able to:

Explain the role of a loss function in supervised learning as a scalar signal that measures the gap between predictions and targets
Distinguish the three reduction modes (none, mean, sum) and identify which to use for batch training vs. manual weighting vs. sequence models
Trace the computational graph from a loss value back to a parameter update and explain where autograd attaches
Select an appropriate loss family for a given supervised task using the output type and target distribution as decision criteria

Motivation

A neural network learns by adjusting its parameters $\theta$ so that its predictions $f_\theta(x)$ are close to the ground-truth targets $y$ . A loss function (also called a criterion or objective) gives this notion of "closeness" a precise scalar number that the optimizer minimizes.

Without a loss, gradient descent has no direction to follow.

The Training Objective

Given a dataset of $N$ input-target pairs $\{(x_i, y_i)\}_{i=1}^N$ , the empirical training loss is

$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell\bigl(f_\theta(x_i),\, y_i\bigr)$

where $\ell$ is the per-sample loss specific to the task. The optimizer updates parameters via

$\theta \leftarrow \theta - \alpha \, \nabla_\theta \mathcal{L}(\theta)$

where $\alpha$ is the learning rate. PyTorch computes $\nabla_\theta \mathcal{L}$ automatically via loss.backward().

PyTorch Convention

In PyTorch, a loss function is an nn.Module that takes two tensors — input (model prediction) and target (ground truth) — and returns a scalar:

import torch
import torch.nn as nn

criterion = nn.MSELoss()
input  = torch.randn(4, 1)   # model output
target = torch.randn(4, 1)   # ground truth
loss   = criterion(input, target)  # scalar
loss.backward()              # populate .grad for every parameter

The names input and target follow PyTorch's own documentation. In context:

input — the raw output of the last layer (logits or probabilities, depending on the loss)
target — the label or value you want the model to predict

Reduction Modes

Every PyTorch loss accepts a reduction keyword that controls whether the per-sample losses $\ell_i$ are aggregated:

`reduction`	Formula	When to use
`'mean'`	$\frac{1}{N}\sum_i \ell_i$	Default; loss scale is independent of batch size
`'sum'`	$\sum_i \ell_i$	When you want total loss, e.g. VAE ELBO
`'none'`	$[\ell_1, \ell_2, \ldots, \ell_N]$	Per-sample losses; useful for custom sample-level weighting

criterion_mean = nn.L1Loss(reduction='mean')
criterion_none = nn.L1Loss(reduction='none')

l_mean = criterion_mean(input, target)      # scalar
l_each = criterion_none(input, target)      # shape (4, 1)

The Computational Graph

PyTorch builds a dynamic computation graph during the forward pass. The loss node sits at the top; calling .backward() propagates gradients back through every operation to the leaf parameters.

  x  ──► layer 1 ──► layer 2 ──► output
                                     │
                         target ──► loss ──► .backward()
                                               │
                                       ∇θ flows back

This is why the loss must be a scalar (or you must specify a gradient vector if using backward(gradient=...)). Most built-in losses produce a scalar when reduction != 'none'.

Choosing the Right Loss

The right loss encodes your statistical assumption about the target:

Assumption	Loss
Target is a continuous value; errors are symmetric	MSELoss or HuberLoss
Target follows a Poisson distribution (count data)	PoissonNLLLoss
Target is Gaussian with learnable variance	GaussianNLLLoss
Binary label $y \in \{0,1\}$	BCEWithLogitsLoss
Categorical label $y \in \{0,\ldots,C-1\}$	CrossEntropyLoss
Two distributions should match	KLDivLoss
Similar items should be embedded close together	TripletMarginLoss

The remaining readings in this supplement cover each family in detail.

References

[1] — PyTorch nn.Module loss functions reference

[2] — PyTorch Autograd mechanics

Overview Next →

What Is a Loss Function?

Motivation

The Training Objective

PyTorch Convention

Reduction Modes

The Computational Graph

Choosing the Right Loss

Privacy Policy

What we collect

What we don't collect

Your choices

Contact