Distribution & Similarity Losses
- Compute KL divergence between two distributions and explain its asymmetry and why it diverges when the target distribution assigns zero probability
- Apply MarginRankingLoss to a pair of scores and verify the direction of the margin constraint
- Distinguish HingeEmbeddingLoss (similar/dissimilar with {-1, +1} labels) from CosineEmbeddingLoss (cosine distance minimisation)
- Explain when to prefer a distribution divergence loss (KLDiv) over a point-wise similarity loss (cosine, hinge)
nn.KLDivLoss — Kullback-Leibler Divergence
The KL divergence measures how much information is lost when using distribution to approximate distribution :
In PyTorch, is the target (the "true" distribution) and is the model's predicted distribution. The input must be log-probabilities:
where input = .
kl = nn.KLDivLoss(reduction='batchmean')
log_q = F.log_softmax(torch.randn(4, 10), dim=1) # log-probs from model
p = F.softmax(torch.randn(4, 10), dim=1) # target distribution
output = kl(log_q, p)
Reduction note: Use reduction='batchmean' (divides by batch size ) for proper KL divergence. The legacy default 'mean' divides by the total number of elements , which gives a scaled version.
When to use: Knowledge distillation (student learns from teacher's soft probabilities); variational autoencoders (regularise latent distribution toward prior); temperature scaling for calibration.
nn.MarginRankingLoss — Pairwise Ranking
Given two scores and a label , enforce that the better-ranked input exceeds the other by a margin :
- If : the loss is zero when (" should rank higher and does").
- If : the loss is zero when (" should rank higher and does").
The loss is zero when the margin constraint is satisfied; otherwise it penalises the violation linearly.
loss = nn.MarginRankingLoss(margin=1.0)
x1 = torch.randn(5)
x2 = torch.randn(5)
y = torch.sign(torch.randn(5)) # random ±1 labels
output = loss(x1, x2, y)
When to use: Learning-to-rank (search, recommendation); information retrieval; any task where relative ordering matters more than absolute scores.
nn.HingeEmbeddingLoss — Similar vs. Dissimilar Inputs
Given a distance or similarity value (a scalar, typically ) and a label :
- When (similar pair): minimising pulls embeddings together.
- When (dissimilar pair): the loss is zero if the distance already exceeds the margin ; otherwise it pushes the embeddings apart.
loss = nn.HingeEmbeddingLoss(margin=1.0)
x = torch.tensor([0.3, 1.5, 0.8]) # distances
y = torch.tensor([1., -1., 1.]) # +1=similar, -1=dissimilar
output = loss(x, y)
When to use: Siamese networks for verification (face verification, signature verification); one-shot learning.
nn.CosineEmbeddingLoss — Directional Similarity
Measures similarity between two embedding vectors using cosine similarity :
- When : loss is . Minimising it makes the two vectors point in the same direction.
- When : loss penalises the cosine similarity exceeding a margin (default 0). Dissimilar pairs should have .
loss = nn.CosineEmbeddingLoss(margin=0.0)
u = torch.randn(4, 128)
v = torch.randn(4, 128)
y = torch.sign(torch.randn(4)) # ±1 labels
output = loss(u, v, y)
Why cosine vs. Euclidean? Cosine similarity is scale-invariant — it depends only on the angle between vectors, not their magnitude. This is natural for text embeddings (TF-IDF, sentence embeddings) where vector norms encode term frequency artifacts, not semantic distance.
When to use: Sentence embedding learning; semantic similarity; contrastive learning with unit-normalised representations.