Supplement · Loss Functions

Distribution & Similarity Losses

12 min read
By the end of this reading you will be able to:
  • Compute KL divergence between two distributions and explain its asymmetry and why it diverges when the target distribution assigns zero probability
  • Apply MarginRankingLoss to a pair of scores and verify the direction of the margin constraint
  • Distinguish HingeEmbeddingLoss (similar/dissimilar with {-1, +1} labels) from CosineEmbeddingLoss (cosine distance minimisation)
  • Explain when to prefer a distribution divergence loss (KLDiv) over a point-wise similarity loss (cosine, hinge)

nn.KLDivLoss — Kullback-Leibler Divergence

The KL divergence measures how much information is lost when using distribution QQ to approximate distribution PP:

DKL(PQ)=kPklogPkQk=kPklogPkkPklogQkD_{\text{KL}}(P \| Q) = \sum_k P_k \log \frac{P_k}{Q_k} = \sum_k P_k \log P_k - \sum_k P_k \log Q_k

In PyTorch, PP is the target (the "true" distribution) and QQ is the model's predicted distribution. The input must be log-probabilities:

k=Pk(logPkinputk)\ell_k = P_k \bigl(\log P_k - \text{input}_k\bigr)

where input = logQ\log Q.

kl = nn.KLDivLoss(reduction='batchmean')
log_q = F.log_softmax(torch.randn(4, 10), dim=1)  # log-probs from model
p     = F.softmax(torch.randn(4, 10),     dim=1)  # target distribution
output = kl(log_q, p)

Reduction note: Use reduction='batchmean' (divides by batch size NN) for proper KL divergence. The legacy default 'mean' divides by the total number of elements N×CN \times C, which gives a 1/C1/C scaled version.

When to use: Knowledge distillation (student learns from teacher's soft probabilities); variational autoencoders (regularise latent distribution toward prior); temperature scaling for calibration.


nn.MarginRankingLoss — Pairwise Ranking

Given two scores x1,x2x_1, x_2 and a label y{1,+1}y \in \{-1, +1\}, enforce that the better-ranked input exceeds the other by a margin mm:

=max(0,  y(x1x2)+m)\ell = \max\bigl(0,\; -y \cdot (x_1 - x_2) + m\bigr)

  • If y=+1y = +1: the loss is zero when x1>x2+mx_1 > x_2 + m ("x1x_1 should rank higher and does").
  • If y=1y = -1: the loss is zero when x2>x1+mx_2 > x_1 + m ("x2x_2 should rank higher and does").

The loss is zero when the margin constraint is satisfied; otherwise it penalises the violation linearly.

loss = nn.MarginRankingLoss(margin=1.0)
x1 = torch.randn(5)
x2 = torch.randn(5)
y  = torch.sign(torch.randn(5))   # random ±1 labels
output = loss(x1, x2, y)

When to use: Learning-to-rank (search, recommendation); information retrieval; any task where relative ordering matters more than absolute scores.


nn.HingeEmbeddingLoss — Similar vs. Dissimilar Inputs

Given a distance or similarity value xx (a scalar, typically uv\|\mathbf{u} - \mathbf{v}\|) and a label y{+1,1}y \in \{+1, -1\}:

={xif y=+1max(0,mx)if y=1\ell = \begin{cases} x & \text{if } y = +1 \\ \max(0, m - x) & \text{if } y = -1 \end{cases}

  • When y=+1y = +1 (similar pair): minimising =x\ell = x pulls embeddings together.
  • When y=1y = -1 (dissimilar pair): the loss is zero if the distance xx already exceeds the margin mm; otherwise it pushes the embeddings apart.
loss = nn.HingeEmbeddingLoss(margin=1.0)
x = torch.tensor([0.3, 1.5, 0.8])   # distances
y = torch.tensor([1., -1., 1.])     # +1=similar, -1=dissimilar
output = loss(x, y)

When to use: Siamese networks for verification (face verification, signature verification); one-shot learning.


nn.CosineEmbeddingLoss — Directional Similarity

Measures similarity between two embedding vectors using cosine similarity cos(x1,x2)=x1x2x1x2\cos(\mathbf{x}_1, \mathbf{x}_2) = \frac{\mathbf{x}_1 \cdot \mathbf{x}_2}{\|\mathbf{x}_1\| \|\mathbf{x}_2\|}:

={1cos(x1,x2)if y=+1max(0,cos(x1,x2)m)if y=1\ell = \begin{cases} 1 - \cos(\mathbf{x}_1, \mathbf{x}_2) & \text{if } y = +1 \\ \max\bigl(0, \cos(\mathbf{x}_1, \mathbf{x}_2) - m\bigr) & \text{if } y = -1 \end{cases}

  • When y=+1y = +1: loss is 1cos()[0,2]1 - \cos(\cdot) \in [0, 2]. Minimising it makes the two vectors point in the same direction.
  • When y=1y = -1: loss penalises the cosine similarity exceeding a margin mm (default 0). Dissimilar pairs should have cosm\cos \le m.
loss = nn.CosineEmbeddingLoss(margin=0.0)
u = torch.randn(4, 128)
v = torch.randn(4, 128)
y = torch.sign(torch.randn(4))   # ±1 labels
output = loss(u, v, y)

Why cosine vs. Euclidean? Cosine similarity is scale-invariant — it depends only on the angle between vectors, not their magnitude. This is natural for text embeddings (TF-IDF, sentence embeddings) where vector norms encode term frequency artifacts, not semantic distance.

When to use: Sentence embedding learning; semantic similarity; contrastive learning with unit-normalised representations.