Prerequisite · Probability Foundations

Entropy, KL Divergence, and Fitting Distributions

Colab Notebook · ~45 min
Google Colab Notebook
Entropy, KL Divergence, and Fitting Distributions
Python · ~45 min
Open in Colab
Lab Objectives
1
Compute entropy for Bernoulli(θ) across θ ∈ (0,1) and confirm the maximum at θ = 0.5 equals 1 bit.
2
Implement KL divergence for discrete distributions and verify non-negativity, D_KL(p‖p)=0, and asymmetry.
3
Numerically verify the cross-entropy decomposition H(p,q) = H(p) + D_KL(p‖q) for several distribution pairs.
4
Fit a Gaussian to a bimodal mixture by minimizing forward KL and reverse KL separately; plot and compare the two fitted curves.
5
Implement sequential Beta posterior updates starting from Beta(1,1) and plot how the posterior concentrates as coin-flip data accumulates.

Lab 3: Entropy, KL Divergence, and Fitting Distributions

Information-theoretic quantities — entropy, cross-entropy, KL divergence — appear in loss functions, regularization terms, and training objectives throughout deep learning. This lab implements them from scratch, verifies their properties, and demonstrates the mode-seeking vs mean-seeking behavior that determines whether a model learns a sharp or blurry approximation to data.

What You'll Build

  • An entropy calculator: compute H(p)=xp(x)log2p(x)H(p) = -\sum_x p(x) \log_2 p(x) for discrete distributions, verify that the uniform distribution over KK outcomes maximizes entropy at log2K\log_2 K bits, and plot entropy as a function of bias θ\theta for a Bernoulli
  • A KL divergence library: implement DKL(pq)D_{\text{KL}}(p \| q) for discrete and continuous distributions; verify non-negativity (always 0\geq 0), confirm DKL(pp)=0D_{\text{KL}}(p\|p) = 0, and demonstrate asymmetry with a numerical example
  • A cross-entropy decomposition: show that H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\text{KL}}(p \| q) holds numerically, and confirm that minimizing cross-entropy over qq is equivalent to minimizing DKL(pq)D_{\text{KL}}(p \| q) when pp is fixed (i.e., training with cross-entropy loss = minimizing forward KL)
  • A forward vs reverse KL fitting experiment: given a bimodal target distribution pp (mixture of two Gaussians), fit a single Gaussian qθq_\theta by minimizing (a) DKL(pqθ)D_{\text{KL}}(p \| q_\theta) (forward KL) and (b) DKL(qθp)D_{\text{KL}}(q_\theta \| p) (reverse KL) via gradient descent; overlay the fitted curves on pp and observe mean-seeking vs mode-seeking behavior
  • A Beta–Dirichlet posterior update: implement Bayesian updating with a Beta prior, observe how Beta(1,1)\text{Beta}(1,1) (uniform) evolves toward the MLE as data accumulates, and connect the α\alpha/β\beta pseudo-counts to the sample mean

Key Concepts Practiced

By the end you will have seen concretely why forward KL produces blurry means and reverse KL produces sharp modes, why cross-entropy training is equivalent to MLE, and how the Beta distribution's pseudo-count interpretation makes Bayesian updates intuitive.