Prerequisite · Probability Foundations

KL Divergence and Cross-Entropy

16 min read

By the end of this reading you will be able to:

Define entropy H(P) as the expected surprise under P, and explain why it is maximized by the uniform distribution and minimized by a point mass
Derive the relationship H(P,Q) = H(P) + D_KL(P‖Q) and state what each term represents in terms of unavoidable coding cost vs. the penalty for using the wrong distribution
Explain the asymmetry of KL divergence and distinguish the behavior of forward KL (D_KL(P‖Q), mean-seeking) from reverse KL (D_KL(Q‖P), mode-seeking) when approximating a multimodal distribution
Identify cross-entropy loss as −log Q minimization, explain how this equals MLE when P is the empirical data distribution, and locate KL divergence in the VAE ELBO and in PPO's policy update constraint

Information and Surprise

How surprised are you when an event occurs? If a very likely event ( $p \approx 1$ ) happens, you are not surprised at all. If a very unlikely event ( $p \approx 0$ ) happens, you are very surprised.

The information content (or surprise) of an event with probability $p$ is:

$I(x) = -\log p(x)$

Using natural logarithm gives information in nats; base-2 logarithm gives bits.

Properties:

$I(x) \geq 0$ (information is non-negative since $p \leq 1$ )
$I(x) = 0$ when $p(x) = 1$ (certain events are not informative)
$I(x) \to \infty$ as $p(x) \to 0$ (impossible events would be infinitely surprising)
$I(x_1 x_2) = I(x_1) + I(x_2)$ for independent events (information adds)

Entropy

Entropy is the expected surprise of a distribution — the average information content of a draw from $P$ :

$H(P) = \mathbb{E}_{x \sim P}[-\log P(x)] = -\sum_x P(x)\log P(x)$

For continuous distributions (differential entropy):

$H(P) = -\int P(x) \log P(x)\, dx$

Key properties:

$H(P) \geq 0$ for discrete distributions
Entropy is maximized by the uniform distribution — maximum uncertainty
Entropy is minimized (= 0) by a point mass — no uncertainty
A fair coin has $H = \log 2 \approx 0.693$ nats = 1 bit
A biased coin with $p = 0.9$ has $H = -(0.9\log 0.9 + 0.1\log 0.1) \approx 0.325$ nats

Intuition for ML: Entropy measures how much information (on average) you learn from a single observation. A peaked distribution is easy to predict; a flat distribution is hard.

Cross-Entropy

Cross-entropy $H(P, Q)$ measures the expected surprise when events are drawn from $P$ but you are using model $Q$ to predict them:

$H(P, Q) = \mathbb{E}_{x \sim P}[-\log Q(x)] = -\sum_x P(x) \log Q(x)$

When $Q = P$ : $H(P, P) = H(P)$ — the best you can do
When $Q \neq P$ : $H(P, Q) > H(P)$ — you are paying an extra cost for using the wrong model

Cross-entropy loss in classification. If $P$ is the one-hot ground truth (only class $k$ has $P(k) = 1$ ) and $Q = \hat{\pi}$ is the model's softmax output, then:

$H(P, Q) = -\log \hat{\pi}_k$

Minimizing cross-entropy loss over a dataset is equivalent to maximizing the log-likelihood of the training data under the model — this is maximum likelihood estimation (MLE).

KL Divergence

Kullback-Leibler (KL) divergence measures how much extra surprise you experience by using $Q$ instead of $P$ :

$D_{\text{KL}}(P \| Q) = H(P, Q) - H(P) = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right] = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

This decomposes cross-entropy:

$\underbrace{H(P, Q)}_{\text{cross-entropy}} = \underbrace{H(P)}_{\text{entropy (unavoidable)}} + \underbrace{D_{\text{KL}}(P \| Q)}_{\text{KL (penalty for wrong model)}}$

The entropy $H(P)$ is fixed (you cannot change the true distribution). So minimizing cross-entropy is equivalent to minimizing KL divergence.

Non-Negativity

$D_{\text{KL}}(P \| Q) \geq 0, \quad \text{with equality iff } P = Q$

This follows from Jensen's inequality applied to the concave function $\log$ . It confirms that cross-entropy is always at least as large as entropy — the wrong model always costs more.

Asymmetry

$D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) \quad \text{in general}$

KL divergence is not a distance metric. This asymmetry has important practical consequences:

Forward KL: $D_{\text{KL}}(P \| Q)$ penalizes placing zero mass where $P$ has mass ( $Q$ must cover all of $P$ ). When minimized, $Q$ spreads to cover all modes of $P$ — mean-seeking or mass-covering behavior.

Reverse KL: $D_{\text{KL}}(Q \| P)$ penalizes placing mass where $P$ has zero mass ( $Q$ must not assign probability to regions $P$ rejects). When minimized, $Q$ concentrates on one mode of $P$ — mode-seeking or zero-forcing behavior.

Practical implication: Variational inference with $D_{\text{KL}}(Q \| P)$ (reverse KL) tends to find one good mode of the posterior rather than covering all of them.

KL in the VAE ELBO

Variational autoencoders (VAEs) optimize the Evidence Lower BOund (ELBO):

$\mathcal{L} = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$

The KL term $D_{\text{KL}}(q_\phi(z|x) \| p(z))$ acts as a regularizer: it pushes the encoder posterior $q_\phi(z|x)$ toward the prior $p(z) = \mathcal{N}(0, I)$ . When $q_\phi$ is a Gaussian with mean $\mu$ and variance $\sigma^2$ , this KL has a closed form:

$D_{\text{KL}}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)) = \frac{1}{2}(\mu^2 + \sigma^2 - \log \sigma^2 - 1)$

KL and Policy Optimization

In reinforcement learning, Trust Region Policy Optimization (TRPO) constrains each policy update by a KL budget:

$\mathbb{E}_{s \sim \rho^{\pi_{\text{old}}}}\!\left[D_{\text{KL}}(\pi_{\text{old}}(\cdot|s) \| \pi_\theta(\cdot|s))\right] \leq \delta$

This prevents the new policy from diverging too far from the old one — ensuring the importance sampling correction (from r5) remains reliable.

PPO replaces the hard KL constraint with a clipped objective, but adds an optional KL penalty term for stability.

Summary: The Information Theory Triangle

Quantity	Formula	Meaning
Entropy $H(P)$	$-\mathbb{E}_P[\log P]$	Uncertainty in $P$ ; bits needed to encode samples
Cross-entropy $H(P,Q)$	$-\mathbb{E}_P[\log Q]$	Bits needed using model $Q$ to encode $P$ -distributed samples
KL divergence $D_{\text{KL}}(P\\|Q)$	$H(P,Q) - H(P)$	Extra bits wasted using $Q$ instead of $P$

Cross-entropy = Entropy + KL. Minimizing cross-entropy = minimizing KL = maximum likelihood estimation.

References

OpenAI Spinning Up — Prerequisites — The Right Background for Deep RL

Previous Overview

KL Divergence and Cross-Entropy

Information and Surprise

Entropy

Cross-Entropy

KL Divergence

Non-Negativity

Asymmetry

KL in the VAE ELBO

KL and Policy Optimization

Summary: The Information Theory Triangle

Privacy Policy

What we collect

What we don't collect

Your choices

Contact