KL Divergence and Cross-Entropy
- Define entropy H(P) as the expected surprise under P, and explain why it is maximized by the uniform distribution and minimized by a point mass
- Derive the relationship H(P,Q) = H(P) + D_KL(P‖Q) and state what each term represents in terms of unavoidable coding cost vs. the penalty for using the wrong distribution
- Explain the asymmetry of KL divergence and distinguish the behavior of forward KL (D_KL(P‖Q), mean-seeking) from reverse KL (D_KL(Q‖P), mode-seeking) when approximating a multimodal distribution
- Identify cross-entropy loss as −log Q minimization, explain how this equals MLE when P is the empirical data distribution, and locate KL divergence in the VAE ELBO and in PPO's policy update constraint
Information and Surprise
How surprised are you when an event occurs? If a very likely event () happens, you are not surprised at all. If a very unlikely event () happens, you are very surprised.
The information content (or surprise) of an event with probability is:
Using natural logarithm gives information in nats; base-2 logarithm gives bits.
Properties:
- (information is non-negative since )
- when (certain events are not informative)
- as (impossible events would be infinitely surprising)
- for independent events (information adds)
Entropy
Entropy is the expected surprise of a distribution — the average information content of a draw from :
For continuous distributions (differential entropy):
Key properties:
- for discrete distributions
- Entropy is maximized by the uniform distribution — maximum uncertainty
- Entropy is minimized (= 0) by a point mass — no uncertainty
- A fair coin has nats = 1 bit
- A biased coin with has nats
Intuition for ML: Entropy measures how much information (on average) you learn from a single observation. A peaked distribution is easy to predict; a flat distribution is hard.
Cross-Entropy
Cross-entropy measures the expected surprise when events are drawn from but you are using model to predict them:
- When : — the best you can do
- When : — you are paying an extra cost for using the wrong model
Cross-entropy loss in classification. If is the one-hot ground truth (only class has ) and is the model's softmax output, then:
Minimizing cross-entropy loss over a dataset is equivalent to maximizing the log-likelihood of the training data under the model — this is maximum likelihood estimation (MLE).
KL Divergence
Kullback-Leibler (KL) divergence measures how much extra surprise you experience by using instead of :
This decomposes cross-entropy:
The entropy is fixed (you cannot change the true distribution). So minimizing cross-entropy is equivalent to minimizing KL divergence.
Non-Negativity
This follows from Jensen's inequality applied to the concave function . It confirms that cross-entropy is always at least as large as entropy — the wrong model always costs more.
Asymmetry
KL divergence is not a distance metric. This asymmetry has important practical consequences:
Forward KL: penalizes placing zero mass where has mass ( must cover all of ). When minimized, spreads to cover all modes of — mean-seeking or mass-covering behavior.
Reverse KL: penalizes placing mass where has zero mass ( must not assign probability to regions rejects). When minimized, concentrates on one mode of — mode-seeking or zero-forcing behavior.
Practical implication: Variational inference with (reverse KL) tends to find one good mode of the posterior rather than covering all of them.
KL in the VAE ELBO
Variational autoencoders (VAEs) optimize the Evidence Lower BOund (ELBO):
The KL term acts as a regularizer: it pushes the encoder posterior toward the prior . When is a Gaussian with mean and variance , this KL has a closed form:
KL and Policy Optimization
In reinforcement learning, Trust Region Policy Optimization (TRPO) constrains each policy update by a KL budget:
This prevents the new policy from diverging too far from the old one — ensuring the importance sampling correction (from r5) remains reliable.
PPO replaces the hard KL constraint with a clipped objective, but adds an optional KL penalty term for stability.
Summary: The Information Theory Triangle
| Quantity | Formula | Meaning |
|---|---|---|
| Entropy | Uncertainty in ; bits needed to encode samples | |
| Cross-entropy | Bits needed using model to encode -distributed samples | |
| KL divergence | Extra bits wasted using instead of |
Cross-entropy = Entropy + KL. Minimizing cross-entropy = minimizing KL = maximum likelihood estimation.