Supplement · Loss Functions

What Is a Loss Function?

10 min read
By the end of this reading you will be able to:
  • Explain the role of a loss function in supervised learning as a scalar signal that measures the gap between predictions and targets
  • Distinguish the three reduction modes (none, mean, sum) and identify which to use for batch training vs. manual weighting vs. sequence models
  • Trace the computational graph from a loss value back to a parameter update and explain where autograd attaches
  • Select an appropriate loss family for a given supervised task using the output type and target distribution as decision criteria

Motivation

A neural network learns by adjusting its parameters θ\theta so that its predictions fθ(x)f_\theta(x) are close to the ground-truth targets yy. A loss function (also called a criterion or objective) gives this notion of "closeness" a precise scalar number that the optimizer minimizes.

Without a loss, gradient descent has no direction to follow.


The Training Objective

Given a dataset of NN input-target pairs {(xi,yi)}i=1N\{(x_i, y_i)\}_{i=1}^N, the empirical training loss is

L(θ)=1Ni=1N(fθ(xi),yi)\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell\bigl(f_\theta(x_i),\, y_i\bigr)

where \ell is the per-sample loss specific to the task. The optimizer updates parameters via

θθαθL(θ)\theta \leftarrow \theta - \alpha \, \nabla_\theta \mathcal{L}(\theta)

where α\alpha is the learning rate. PyTorch computes θL\nabla_\theta \mathcal{L} automatically via loss.backward().


PyTorch Convention

In PyTorch, a loss function is an nn.Module that takes two tensors — input (model prediction) and target (ground truth) — and returns a scalar:

import torch
import torch.nn as nn

criterion = nn.MSELoss()
input  = torch.randn(4, 1)   # model output
target = torch.randn(4, 1)   # ground truth
loss   = criterion(input, target)  # scalar
loss.backward()              # populate .grad for every parameter

The names input and target follow PyTorch's own documentation. In context:

  • input — the raw output of the last layer (logits or probabilities, depending on the loss)
  • target — the label or value you want the model to predict

Reduction Modes

Every PyTorch loss accepts a reduction keyword that controls whether the per-sample losses i\ell_i are aggregated:

reduction Formula When to use
'mean' 1Nii\frac{1}{N}\sum_i \ell_i Default; loss scale is independent of batch size
'sum' ii\sum_i \ell_i When you want total loss, e.g. VAE ELBO
'none' [1,2,,N][\ell_1, \ell_2, \ldots, \ell_N] Per-sample losses; useful for custom sample-level weighting
criterion_mean = nn.L1Loss(reduction='mean')
criterion_none = nn.L1Loss(reduction='none')

l_mean = criterion_mean(input, target)      # scalar
l_each = criterion_none(input, target)      # shape (4, 1)

The Computational Graph

PyTorch builds a dynamic computation graph during the forward pass. The loss node sits at the top; calling .backward() propagates gradients back through every operation to the leaf parameters.

  x  ──► layer 1 ──► layer 2 ──► output
                                     │
                         target ──► loss ──► .backward()
                                               │
                                       ∇θ flows back

This is why the loss must be a scalar (or you must specify a gradient vector if using backward(gradient=...)). Most built-in losses produce a scalar when reduction != 'none'.


Choosing the Right Loss

The right loss encodes your statistical assumption about the target:

Assumption Loss
Target is a continuous value; errors are symmetric MSELoss or HuberLoss
Target follows a Poisson distribution (count data) PoissonNLLLoss
Target is Gaussian with learnable variance GaussianNLLLoss
Binary label y{0,1}y \in \{0,1\} BCEWithLogitsLoss
Categorical label y{0,,C1}y \in \{0,\ldots,C-1\} CrossEntropyLoss
Two distributions should match KLDivLoss
Similar items should be embedded close together TripletMarginLoss

The remaining readings in this supplement cover each family in detail.