Supplement · Loss Functions

Distribution & Similarity Losses

12 min read

By the end of this reading you will be able to:

Compute KL divergence between two distributions and explain its asymmetry and why it diverges when the target distribution assigns zero probability
Apply MarginRankingLoss to a pair of scores and verify the direction of the margin constraint
Distinguish HingeEmbeddingLoss (similar/dissimilar with {-1, +1} labels) from CosineEmbeddingLoss (cosine distance minimisation)
Explain when to prefer a distribution divergence loss (KLDiv) over a point-wise similarity loss (cosine, hinge)

nn.KLDivLoss — Kullback-Leibler Divergence

The KL divergence measures how much information is lost when using distribution $Q$ to approximate distribution $P$ :

$D_{\text{KL}}(P \| Q) = \sum_k P_k \log \frac{P_k}{Q_k} = \sum_k P_k \log P_k - \sum_k P_k \log Q_k$

In PyTorch, $P$ is the target (the "true" distribution) and $Q$ is the model's predicted distribution. The input must be log-probabilities:

$\ell_k = P_k \bigl(\log P_k - \text{input}_k\bigr)$

where input = $\log Q$ .

kl = nn.KLDivLoss(reduction='batchmean')
log_q = F.log_softmax(torch.randn(4, 10), dim=1)  # log-probs from model
p     = F.softmax(torch.randn(4, 10),     dim=1)  # target distribution
output = kl(log_q, p)

Reduction note: Use reduction='batchmean' (divides by batch size $N$ ) for proper KL divergence. The legacy default 'mean' divides by the total number of elements $N \times C$ , which gives a $1/C$ scaled version.

When to use: Knowledge distillation (student learns from teacher's soft probabilities); variational autoencoders (regularise latent distribution toward prior); temperature scaling for calibration.

nn.MarginRankingLoss — Pairwise Ranking

Given two scores $x_1, x_2$ and a label $y \in \{-1, +1\}$ , enforce that the better-ranked input exceeds the other by a margin $m$ :

$\ell = \max\bigl(0,\; -y \cdot (x_1 - x_2) + m\bigr)$

If $y = +1$ : the loss is zero when $x_1 > x_2 + m$ (" $x_1$ should rank higher and does").
If $y = -1$ : the loss is zero when $x_2 > x_1 + m$ (" $x_2$ should rank higher and does").

The loss is zero when the margin constraint is satisfied; otherwise it penalises the violation linearly.

loss = nn.MarginRankingLoss(margin=1.0)
x1 = torch.randn(5)
x2 = torch.randn(5)
y  = torch.sign(torch.randn(5))   # random ±1 labels
output = loss(x1, x2, y)

When to use: Learning-to-rank (search, recommendation); information retrieval; any task where relative ordering matters more than absolute scores.

nn.HingeEmbeddingLoss — Similar vs. Dissimilar Inputs

Given a distance or similarity value $x$ (a scalar, typically $\|\mathbf{u} - \mathbf{v}\|$ ) and a label $y \in \{+1, -1\}$ :

$\ell = \begin{cases} x & \text{if } y = +1 \\ \max(0, m - x) & \text{if } y = -1 \end{cases}$

When $y = +1$ (similar pair): minimising $\ell = x$ pulls embeddings together.
When $y = -1$ (dissimilar pair): the loss is zero if the distance $x$ already exceeds the margin $m$ ; otherwise it pushes the embeddings apart.

loss = nn.HingeEmbeddingLoss(margin=1.0)
x = torch.tensor([0.3, 1.5, 0.8])   # distances
y = torch.tensor([1., -1., 1.])     # +1=similar, -1=dissimilar
output = loss(x, y)

When to use: Siamese networks for verification (face verification, signature verification); one-shot learning.

nn.CosineEmbeddingLoss — Directional Similarity

Measures similarity between two embedding vectors using cosine similarity $\cos(\mathbf{x}_1, \mathbf{x}_2) = \frac{\mathbf{x}_1 \cdot \mathbf{x}_2}{\|\mathbf{x}_1\| \|\mathbf{x}_2\|}$ :

$\ell = \begin{cases} 1 - \cos(\mathbf{x}_1, \mathbf{x}_2) & \text{if } y = +1 \\ \max\bigl(0, \cos(\mathbf{x}_1, \mathbf{x}_2) - m\bigr) & \text{if } y = -1 \end{cases}$

When $y = +1$ : loss is $1 - \cos(\cdot) \in [0, 2]$ . Minimising it makes the two vectors point in the same direction.
When $y = -1$ : loss penalises the cosine similarity exceeding a margin $m$ (default 0). Dissimilar pairs should have $\cos \le m$ .

loss = nn.CosineEmbeddingLoss(margin=0.0)
u = torch.randn(4, 128)
v = torch.randn(4, 128)
y = torch.sign(torch.randn(4))   # ±1 labels
output = loss(u, v, y)

Why cosine vs. Euclidean? Cosine similarity is scale-invariant — it depends only on the angle between vectors, not their magnitude. This is natural for text embeddings (TF-IDF, sentence embeddings) where vector norms encode term frequency artifacts, not semantic distance.

When to use: Sentence embedding learning; semantic similarity; contrastive learning with unit-normalised representations.

References

[1] — nn.KLDivLoss — PyTorch docs

[2] — nn.MarginRankingLoss — PyTorch docs

[3] — nn.HingeEmbeddingLoss — PyTorch docs

[4] — nn.CosineEmbeddingLoss — PyTorch docs

Previous Next →

Distribution & Similarity Losses

nn.KLDivLoss — Kullback-Leibler Divergence

nn.MarginRankingLoss — Pairwise Ranking

nn.HingeEmbeddingLoss — Similar vs. Dissimilar Inputs

nn.CosineEmbeddingLoss — Directional Similarity

Privacy Policy

What we collect

What we don't collect

Your choices

Contact