Supplement · Loss Functions

Classification Losses: BCE, CrossEntropy, NLL

18 min read

By the end of this reading you will be able to:

Derive binary cross-entropy as the Bernoulli negative log-likelihood and state the numerical instability of combining it with a sigmoid
Explain why BCEWithLogitsLoss is preferred over BCELoss and describe its numerically stable log-sum-exp implementation
Apply CrossEntropyLoss for multi-class classification and identify its internal log-softmax plus NLL decomposition
Match each classification loss (BCELoss, BCEWithLogitsLoss, CrossEntropyLoss, NLLLoss) to its expected input format: logits, probabilities, or log-probabilities

Information-Theoretic Foundation

Classification losses arise naturally from information theory. The cross-entropy between a true distribution $p$ and a predicted distribution $q$ measures how many bits are wasted encoding samples from $p$ using a code designed for $q$ :

$H(p, q) = -\sum_{k} p_k \log q_k$

When $p$ is the one-hot encoding of the ground-truth label and $q$ is the model's predicted probability vector, minimising cross-entropy is equivalent to maximum likelihood estimation of a categorical distribution.

nn.BCELoss — Binary Cross-Entropy

For binary classification with label $y \in \{0, 1\}$ and predicted probability $p \in (0, 1)$ :

$\ell = -\left[ y \log p + (1 - y) \log(1 - p) \right]$

This is the NLL of the Bernoulli distribution $p(y \mid p) = p^y (1-p)^{1-y}$ . When $y = 1$ , only the first term survives and the loss penalises low $p$ . When $y = 0$ , only the second term survives and the loss penalises high $p$ .

Important: BCELoss expects probabilities in $(0,1)$ as input. You must apply torch.sigmoid yourself before passing to BCELoss.

m = nn.Sigmoid()
loss = nn.BCELoss()
logit = torch.tensor([2.0, -1.0, 0.5])
target = torch.tensor([1.0, 0.0, 1.0])
output = loss(m(logit), target)

nn.BCEWithLogitsLoss — Numerically Stable Version

BCEWithLogitsLoss combines sigmoid and BCE into one operation, using the log-sum-exp trick to avoid numerical overflow:

$\ell = -\left[ y \log \sigma(x) + (1-y) \log(1 - \sigma(x)) \right]$

which simplifies to:

$\ell = \max(x, 0) - x \cdot y + \log\bigl(1 + e^{-|x|}\bigr)$

This avoids computing $e^x$ for large positive $x$ (which overflows to inf in float32). The pos_weight parameter re-weights positive examples, useful for class-imbalanced datasets.

loss = nn.BCEWithLogitsLoss()          # takes RAW LOGITS directly
logit  = torch.tensor([2.0, -1.0, 0.5])
target = torch.tensor([1.0, 0.0, 1.0])
output = loss(logit, target)

Always prefer BCEWithLogitsLoss over Sigmoid + BCELoss — it is more numerically stable and computationally equivalent.

nn.CrossEntropyLoss — Multi-class Classification

For a $C$ -class problem with logits $x \in \mathbb{R}^C$ and ground-truth class $c \in \{0, \ldots, C-1\}$ :

$\ell = -\log \frac{e^{x_c}}{\sum_{j=0}^{C-1} e^{x_j}} = -x_c + \log \sum_{j=0}^{C-1} e^{x_j}$

This is the NLL of the softmax distribution: $\text{softmax}(x)_j = e^{x_j} / \sum_k e^{x_k}$ . CrossEntropyLoss applies log-softmax and NLL in one numerically stable step.

loss = nn.CrossEntropyLoss()         # takes RAW LOGITS
x = torch.randn(4, 10)               # 4 samples, 10 classes
y = torch.tensor([0, 3, 7, 2])       # integer class labels
output = loss(x, y)

For soft labels (label smoothing or teacher-provided probabilities), pass a float tensor of shape $(N, C)$ as the target:

loss = nn.CrossEntropyLoss(label_smoothing=0.1)

With label_smoothing=ε, the true distribution is mixed with a uniform distribution: $p_c = 1 - ε + ε/C$ , all others $ε/C$ . This prevents the model from being overconfident.

nn.NLLLoss — The Building Block

NLLLoss is the final step inside CrossEntropyLoss. It expects log-probabilities as input (the output of F.log_softmax) and returns:

$\ell = -x_{i, c_i}$

It simply looks up and negates the log-probability assigned to the correct class.

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
logits = torch.randn(4, 10)
y      = torch.tensor([0, 3, 7, 2])
output = loss(m(logits), y)
# Equivalent to nn.CrossEntropyLoss()(logits, y)

CrossEntropyLoss = LogSoftmax + NLLLoss — use CrossEntropyLoss directly; use NLLLoss only when you have already computed log-probabilities (e.g., CTC, mixture models).

Summary: Which Input Does Each Loss Expect?

Loss	Input type	Target type
BCELoss	Probabilities $(0,1)$	Float $\{0.0, 1.0\}$
BCEWithLogitsLoss	Raw logits $\mathbb{R}$	Float $\{0.0, 1.0\}$
CrossEntropyLoss	Raw logits $\mathbb{R}^C$	Integer class index or float probabilities
NLLLoss	Log-probabilities	Integer class index

A common mistake is passing probabilities to CrossEntropyLoss (it applies log-softmax internally, so probabilities would be log-softmaxed a second time — producing garbage).

References

[1] — nn.BCELoss — PyTorch docs

[2] — nn.BCEWithLogitsLoss — PyTorch docs

[3] — nn.CrossEntropyLoss — PyTorch docs

[4] — nn.NLLLoss — PyTorch docs

Previous Take Quiz →

Classification Losses: BCE, CrossEntropy, NLL

Information-Theoretic Foundation

nn.BCELoss — Binary Cross-Entropy

nn.BCEWithLogitsLoss — Numerically Stable Version

nn.CrossEntropyLoss — Multi-class Classification

nn.NLLLoss — The Building Block

Summary: Which Input Does Each Loss Expect?

Privacy Policy

What we collect

What we don't collect

Your choices

Contact