Supplement · Loss Functions

Classification Losses: BCE, CrossEntropy, NLL

18 min read
By the end of this reading you will be able to:
  • Derive binary cross-entropy as the Bernoulli negative log-likelihood and state the numerical instability of combining it with a sigmoid
  • Explain why BCEWithLogitsLoss is preferred over BCELoss and describe its numerically stable log-sum-exp implementation
  • Apply CrossEntropyLoss for multi-class classification and identify its internal log-softmax plus NLL decomposition
  • Match each classification loss (BCELoss, BCEWithLogitsLoss, CrossEntropyLoss, NLLLoss) to its expected input format: logits, probabilities, or log-probabilities

Information-Theoretic Foundation

Classification losses arise naturally from information theory. The cross-entropy between a true distribution pp and a predicted distribution qq measures how many bits are wasted encoding samples from pp using a code designed for qq:

H(p,q)=kpklogqkH(p, q) = -\sum_{k} p_k \log q_k

When pp is the one-hot encoding of the ground-truth label and qq is the model's predicted probability vector, minimising cross-entropy is equivalent to maximum likelihood estimation of a categorical distribution.


nn.BCELoss — Binary Cross-Entropy

For binary classification with label y{0,1}y \in \{0, 1\} and predicted probability p(0,1)p \in (0, 1):

=[ylogp+(1y)log(1p)]\ell = -\left[ y \log p + (1 - y) \log(1 - p) \right]

This is the NLL of the Bernoulli distribution p(yp)=py(1p)1yp(y \mid p) = p^y (1-p)^{1-y}. When y=1y = 1, only the first term survives and the loss penalises low pp. When y=0y = 0, only the second term survives and the loss penalises high pp.

Important: BCELoss expects probabilities in (0,1)(0,1) as input. You must apply torch.sigmoid yourself before passing to BCELoss.

m = nn.Sigmoid()
loss = nn.BCELoss()
logit = torch.tensor([2.0, -1.0, 0.5])
target = torch.tensor([1.0, 0.0, 1.0])
output = loss(m(logit), target)

nn.BCEWithLogitsLoss — Numerically Stable Version

BCEWithLogitsLoss combines sigmoid and BCE into one operation, using the log-sum-exp trick to avoid numerical overflow:

=[ylogσ(x)+(1y)log(1σ(x))]\ell = -\left[ y \log \sigma(x) + (1-y) \log(1 - \sigma(x)) \right]

which simplifies to:

=max(x,0)xy+log(1+ex)\ell = \max(x, 0) - x \cdot y + \log\bigl(1 + e^{-|x|}\bigr)

This avoids computing exe^x for large positive xx (which overflows to inf in float32). The pos_weight parameter re-weights positive examples, useful for class-imbalanced datasets.

loss = nn.BCEWithLogitsLoss()          # takes RAW LOGITS directly
logit  = torch.tensor([2.0, -1.0, 0.5])
target = torch.tensor([1.0, 0.0, 1.0])
output = loss(logit, target)

Always prefer BCEWithLogitsLoss over Sigmoid + BCELoss — it is more numerically stable and computationally equivalent.


nn.CrossEntropyLoss — Multi-class Classification

For a CC-class problem with logits xRCx \in \mathbb{R}^C and ground-truth class c{0,,C1}c \in \{0, \ldots, C-1\}:

=logexcj=0C1exj=xc+logj=0C1exj\ell = -\log \frac{e^{x_c}}{\sum_{j=0}^{C-1} e^{x_j}} = -x_c + \log \sum_{j=0}^{C-1} e^{x_j}

This is the NLL of the softmax distribution: softmax(x)j=exj/kexk\text{softmax}(x)_j = e^{x_j} / \sum_k e^{x_k}. CrossEntropyLoss applies log-softmax and NLL in one numerically stable step.

loss = nn.CrossEntropyLoss()         # takes RAW LOGITS
x = torch.randn(4, 10)               # 4 samples, 10 classes
y = torch.tensor([0, 3, 7, 2])       # integer class labels
output = loss(x, y)

For soft labels (label smoothing or teacher-provided probabilities), pass a float tensor of shape (N,C)(N, C) as the target:

loss = nn.CrossEntropyLoss(label_smoothing=0.1)

With label_smoothing=ε, the true distribution is mixed with a uniform distribution: pc=1ε+ε/Cp_c = 1 - ε + ε/C, all others ε/Cε/C. This prevents the model from being overconfident.


nn.NLLLoss — The Building Block

NLLLoss is the final step inside CrossEntropyLoss. It expects log-probabilities as input (the output of F.log_softmax) and returns:

=xi,ci\ell = -x_{i, c_i}

It simply looks up and negates the log-probability assigned to the correct class.

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
logits = torch.randn(4, 10)
y      = torch.tensor([0, 3, 7, 2])
output = loss(m(logits), y)
# Equivalent to nn.CrossEntropyLoss()(logits, y)

CrossEntropyLoss = LogSoftmax + NLLLoss — use CrossEntropyLoss directly; use NLLLoss only when you have already computed log-probabilities (e.g., CTC, mixture models).


Summary: Which Input Does Each Loss Expect?

Loss Input type Target type
BCELoss Probabilities (0,1)(0,1) Float {0.0,1.0}\{0.0, 1.0\}
BCEWithLogitsLoss Raw logits R\mathbb{R} Float {0.0,1.0}\{0.0, 1.0\}
CrossEntropyLoss Raw logits RC\mathbb{R}^C Integer class index or float probabilities
NLLLoss Log-probabilities Integer class index

A common mistake is passing probabilities to CrossEntropyLoss (it applies log-softmax internally, so probabilities would be log-softmaxed a second time — producing garbage).