Prerequisite · Probability Foundations

KL Divergence and Cross-Entropy

16 min read
By the end of this reading you will be able to:
  • Define entropy H(P) as the expected surprise under P, and explain why it is maximized by the uniform distribution and minimized by a point mass
  • Derive the relationship H(P,Q) = H(P) + D_KL(P‖Q) and state what each term represents in terms of unavoidable coding cost vs. the penalty for using the wrong distribution
  • Explain the asymmetry of KL divergence and distinguish the behavior of forward KL (D_KL(P‖Q), mean-seeking) from reverse KL (D_KL(Q‖P), mode-seeking) when approximating a multimodal distribution
  • Identify cross-entropy loss as −log Q minimization, explain how this equals MLE when P is the empirical data distribution, and locate KL divergence in the VAE ELBO and in PPO's policy update constraint

Information and Surprise

How surprised are you when an event occurs? If a very likely event (p1p \approx 1) happens, you are not surprised at all. If a very unlikely event (p0p \approx 0) happens, you are very surprised.

The information content (or surprise) of an event with probability pp is:

I(x)=logp(x)I(x) = -\log p(x)

Using natural logarithm gives information in nats; base-2 logarithm gives bits.

Properties:

  • I(x)0I(x) \geq 0 (information is non-negative since p1p \leq 1)
  • I(x)=0I(x) = 0 when p(x)=1p(x) = 1 (certain events are not informative)
  • I(x)I(x) \to \infty as p(x)0p(x) \to 0 (impossible events would be infinitely surprising)
  • I(x1x2)=I(x1)+I(x2)I(x_1 x_2) = I(x_1) + I(x_2) for independent events (information adds)

Entropy

Entropy is the expected surprise of a distribution — the average information content of a draw from PP:

H(P)=ExP[logP(x)]=xP(x)logP(x)H(P) = \mathbb{E}_{x \sim P}[-\log P(x)] = -\sum_x P(x)\log P(x)

For continuous distributions (differential entropy):

H(P)=P(x)logP(x)dxH(P) = -\int P(x) \log P(x)\, dx

Key properties:

  • H(P)0H(P) \geq 0 for discrete distributions
  • Entropy is maximized by the uniform distribution — maximum uncertainty
  • Entropy is minimized (= 0) by a point mass — no uncertainty
  • A fair coin has H=log20.693H = \log 2 \approx 0.693 nats = 1 bit
  • A biased coin with p=0.9p = 0.9 has H=(0.9log0.9+0.1log0.1)0.325H = -(0.9\log 0.9 + 0.1\log 0.1) \approx 0.325 nats

Intuition for ML: Entropy measures how much information (on average) you learn from a single observation. A peaked distribution is easy to predict; a flat distribution is hard.


Cross-Entropy

Cross-entropy H(P,Q)H(P, Q) measures the expected surprise when events are drawn from PP but you are using model QQ to predict them:

H(P,Q)=ExP[logQ(x)]=xP(x)logQ(x)H(P, Q) = \mathbb{E}_{x \sim P}[-\log Q(x)] = -\sum_x P(x) \log Q(x)

  • When Q=PQ = P: H(P,P)=H(P)H(P, P) = H(P) — the best you can do
  • When QPQ \neq P: H(P,Q)>H(P)H(P, Q) > H(P) — you are paying an extra cost for using the wrong model

Cross-entropy loss in classification. If PP is the one-hot ground truth (only class kk has P(k)=1P(k) = 1) and Q=π^Q = \hat{\pi} is the model's softmax output, then:

H(P,Q)=logπ^kH(P, Q) = -\log \hat{\pi}_k

Minimizing cross-entropy loss over a dataset is equivalent to maximizing the log-likelihood of the training data under the model — this is maximum likelihood estimation (MLE).


KL Divergence

Kullback-Leibler (KL) divergence measures how much extra surprise you experience by using QQ instead of PP:

DKL(PQ)=H(P,Q)H(P)=ExP[logP(x)Q(x)]=xP(x)logP(x)Q(x)D_{\text{KL}}(P \| Q) = H(P, Q) - H(P) = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right] = \sum_x P(x) \log \frac{P(x)}{Q(x)}

This decomposes cross-entropy:

H(P,Q)cross-entropy=H(P)entropy (unavoidable)+DKL(PQ)KL (penalty for wrong model)\underbrace{H(P, Q)}_{\text{cross-entropy}} = \underbrace{H(P)}_{\text{entropy (unavoidable)}} + \underbrace{D_{\text{KL}}(P \| Q)}_{\text{KL (penalty for wrong model)}}

The entropy H(P)H(P) is fixed (you cannot change the true distribution). So minimizing cross-entropy is equivalent to minimizing KL divergence.

Non-Negativity

DKL(PQ)0,with equality iff P=QD_{\text{KL}}(P \| Q) \geq 0, \quad \text{with equality iff } P = Q

This follows from Jensen's inequality applied to the concave function log\log. It confirms that cross-entropy is always at least as large as entropy — the wrong model always costs more.

Asymmetry

DKL(PQ)DKL(QP)in generalD_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) \quad \text{in general}

KL divergence is not a distance metric. This asymmetry has important practical consequences:

Forward KL: DKL(PQ)D_{\text{KL}}(P \| Q) penalizes placing zero mass where PP has mass (QQ must cover all of PP). When minimized, QQ spreads to cover all modes of PPmean-seeking or mass-covering behavior.

Reverse KL: DKL(QP)D_{\text{KL}}(Q \| P) penalizes placing mass where PP has zero mass (QQ must not assign probability to regions PP rejects). When minimized, QQ concentrates on one mode of PPmode-seeking or zero-forcing behavior.

Practical implication: Variational inference with DKL(QP)D_{\text{KL}}(Q \| P) (reverse KL) tends to find one good mode of the posterior rather than covering all of them.


KL in the VAE ELBO

Variational autoencoders (VAEs) optimize the Evidence Lower BOund (ELBO):

L=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L} = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))

The KL term DKL(qϕ(zx)p(z))D_{\text{KL}}(q_\phi(z|x) \| p(z)) acts as a regularizer: it pushes the encoder posterior qϕ(zx)q_\phi(z|x) toward the prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I). When qϕq_\phi is a Gaussian with mean μ\mu and variance σ2\sigma^2, this KL has a closed form:

DKL(N(μ,σ2)N(0,1))=12(μ2+σ2logσ21)D_{\text{KL}}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}(\mu^2 + \sigma^2 - \log \sigma^2 - 1)


KL and Policy Optimization

In reinforcement learning, Trust Region Policy Optimization (TRPO) constrains each policy update by a KL budget:

Esρπold ⁣[DKL(πold(s)πθ(s))]δ\mathbb{E}_{s \sim \rho^{\pi_{\text{old}}}}\!\left[D_{\text{KL}}(\pi_{\text{old}}(\cdot|s) \| \pi_\theta(\cdot|s))\right] \leq \delta

This prevents the new policy from diverging too far from the old one — ensuring the importance sampling correction (from r5) remains reliable.

PPO replaces the hard KL constraint with a clipped objective, but adds an optional KL penalty term for stability.


Summary: The Information Theory Triangle

Quantity Formula Meaning
Entropy H(P)H(P) EP[logP]-\mathbb{E}_P[\log P] Uncertainty in PP; bits needed to encode samples
Cross-entropy H(P,Q)H(P,Q) EP[logQ]-\mathbb{E}_P[\log Q] Bits needed using model QQ to encode PP-distributed samples
KL divergence DKL(PQ)D_{\text{KL}}(P\|Q) H(P,Q)H(P)H(P,Q) - H(P) Extra bits wasted using QQ instead of PP

Cross-entropy = Entropy + KL. Minimizing cross-entropy = minimizing KL = maximum likelihood estimation.