What Is a Loss Function?
- Explain the role of a loss function in supervised learning as a scalar signal that measures the gap between predictions and targets
- Distinguish the three reduction modes (none, mean, sum) and identify which to use for batch training vs. manual weighting vs. sequence models
- Trace the computational graph from a loss value back to a parameter update and explain where autograd attaches
- Select an appropriate loss family for a given supervised task using the output type and target distribution as decision criteria
Motivation
A neural network learns by adjusting its parameters so that its predictions are close to the ground-truth targets . A loss function (also called a criterion or objective) gives this notion of "closeness" a precise scalar number that the optimizer minimizes.
Without a loss, gradient descent has no direction to follow.
The Training Objective
Given a dataset of input-target pairs , the empirical training loss is
where is the per-sample loss specific to the task. The optimizer updates parameters via
where is the learning rate. PyTorch computes automatically via loss.backward().
PyTorch Convention
In PyTorch, a loss function is an nn.Module that takes two tensors — input (model prediction) and target (ground truth) — and returns a scalar:
import torch
import torch.nn as nn
criterion = nn.MSELoss()
input = torch.randn(4, 1) # model output
target = torch.randn(4, 1) # ground truth
loss = criterion(input, target) # scalar
loss.backward() # populate .grad for every parameter
The names input and target follow PyTorch's own documentation. In context:
input— the raw output of the last layer (logits or probabilities, depending on the loss)target— the label or value you want the model to predict
Reduction Modes
Every PyTorch loss accepts a reduction keyword that controls whether the per-sample losses are aggregated:
reduction |
Formula | When to use |
|---|---|---|
'mean' |
Default; loss scale is independent of batch size | |
'sum' |
When you want total loss, e.g. VAE ELBO | |
'none' |
Per-sample losses; useful for custom sample-level weighting |
criterion_mean = nn.L1Loss(reduction='mean')
criterion_none = nn.L1Loss(reduction='none')
l_mean = criterion_mean(input, target) # scalar
l_each = criterion_none(input, target) # shape (4, 1)
The Computational Graph
PyTorch builds a dynamic computation graph during the forward pass. The loss node sits at the top; calling .backward() propagates gradients back through every operation to the leaf parameters.
x ──► layer 1 ──► layer 2 ──► output
│
target ──► loss ──► .backward()
│
∇θ flows back
This is why the loss must be a scalar (or you must specify a gradient vector if using backward(gradient=...)). Most built-in losses produce a scalar when reduction != 'none'.
Choosing the Right Loss
The right loss encodes your statistical assumption about the target:
| Assumption | Loss |
|---|---|
| Target is a continuous value; errors are symmetric | MSELoss or HuberLoss |
| Target follows a Poisson distribution (count data) | PoissonNLLLoss |
| Target is Gaussian with learnable variance | GaussianNLLLoss |
| Binary label | BCEWithLogitsLoss |
| Categorical label | CrossEntropyLoss |
| Two distributions should match | KLDivLoss |
| Similar items should be embedded close together | TripletMarginLoss |
The remaining readings in this supplement cover each family in detail.