Classification Losses: BCE, CrossEntropy, NLL
- Derive binary cross-entropy as the Bernoulli negative log-likelihood and state the numerical instability of combining it with a sigmoid
- Explain why BCEWithLogitsLoss is preferred over BCELoss and describe its numerically stable log-sum-exp implementation
- Apply CrossEntropyLoss for multi-class classification and identify its internal log-softmax plus NLL decomposition
- Match each classification loss (BCELoss, BCEWithLogitsLoss, CrossEntropyLoss, NLLLoss) to its expected input format: logits, probabilities, or log-probabilities
Information-Theoretic Foundation
Classification losses arise naturally from information theory. The cross-entropy between a true distribution and a predicted distribution measures how many bits are wasted encoding samples from using a code designed for :
When is the one-hot encoding of the ground-truth label and is the model's predicted probability vector, minimising cross-entropy is equivalent to maximum likelihood estimation of a categorical distribution.
nn.BCELoss — Binary Cross-Entropy
For binary classification with label and predicted probability :
This is the NLL of the Bernoulli distribution . When , only the first term survives and the loss penalises low . When , only the second term survives and the loss penalises high .
Important: BCELoss expects probabilities in as input. You must apply torch.sigmoid yourself before passing to BCELoss.
m = nn.Sigmoid()
loss = nn.BCELoss()
logit = torch.tensor([2.0, -1.0, 0.5])
target = torch.tensor([1.0, 0.0, 1.0])
output = loss(m(logit), target)
nn.BCEWithLogitsLoss — Numerically Stable Version
BCEWithLogitsLoss combines sigmoid and BCE into one operation, using the log-sum-exp trick to avoid numerical overflow:
which simplifies to:
This avoids computing for large positive (which overflows to inf in float32). The pos_weight parameter re-weights positive examples, useful for class-imbalanced datasets.
loss = nn.BCEWithLogitsLoss() # takes RAW LOGITS directly
logit = torch.tensor([2.0, -1.0, 0.5])
target = torch.tensor([1.0, 0.0, 1.0])
output = loss(logit, target)
Always prefer BCEWithLogitsLoss over Sigmoid + BCELoss — it is more numerically stable and computationally equivalent.
nn.CrossEntropyLoss — Multi-class Classification
For a -class problem with logits and ground-truth class :
This is the NLL of the softmax distribution: . CrossEntropyLoss applies log-softmax and NLL in one numerically stable step.
loss = nn.CrossEntropyLoss() # takes RAW LOGITS
x = torch.randn(4, 10) # 4 samples, 10 classes
y = torch.tensor([0, 3, 7, 2]) # integer class labels
output = loss(x, y)
For soft labels (label smoothing or teacher-provided probabilities), pass a float tensor of shape as the target:
loss = nn.CrossEntropyLoss(label_smoothing=0.1)
With label_smoothing=ε, the true distribution is mixed with a uniform distribution: , all others . This prevents the model from being overconfident.
nn.NLLLoss — The Building Block
NLLLoss is the final step inside CrossEntropyLoss. It expects log-probabilities as input (the output of F.log_softmax) and returns:
It simply looks up and negates the log-probability assigned to the correct class.
m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
logits = torch.randn(4, 10)
y = torch.tensor([0, 3, 7, 2])
output = loss(m(logits), y)
# Equivalent to nn.CrossEntropyLoss()(logits, y)
CrossEntropyLoss = LogSoftmax + NLLLoss — use CrossEntropyLoss directly; use NLLLoss only when you have already computed log-probabilities (e.g., CTC, mixture models).
Summary: Which Input Does Each Loss Expect?
| Loss | Input type | Target type |
|---|---|---|
| BCELoss | Probabilities | Float |
| BCEWithLogitsLoss | Raw logits | Float |
| CrossEntropyLoss | Raw logits | Integer class index or float probabilities |
| NLLLoss | Log-probabilities | Integer class index |
A common mistake is passing probabilities to CrossEntropyLoss (it applies log-softmax internally, so probabilities would be log-softmaxed a second time — producing garbage).