Entropy, KL Divergence, and Fitting Distributions
Lab 3: Entropy, KL Divergence, and Fitting Distributions
Information-theoretic quantities — entropy, cross-entropy, KL divergence — appear in loss functions, regularization terms, and training objectives throughout deep learning. This lab implements them from scratch, verifies their properties, and demonstrates the mode-seeking vs mean-seeking behavior that determines whether a model learns a sharp or blurry approximation to data.
What You'll Build
- An entropy calculator: compute for discrete distributions, verify that the uniform distribution over outcomes maximizes entropy at bits, and plot entropy as a function of bias for a Bernoulli
- A KL divergence library: implement for discrete and continuous distributions; verify non-negativity (always ), confirm , and demonstrate asymmetry with a numerical example
- A cross-entropy decomposition: show that holds numerically, and confirm that minimizing cross-entropy over is equivalent to minimizing when is fixed (i.e., training with cross-entropy loss = minimizing forward KL)
- A forward vs reverse KL fitting experiment: given a bimodal target distribution (mixture of two Gaussians), fit a single Gaussian by minimizing (a) (forward KL) and (b) (reverse KL) via gradient descent; overlay the fitted curves on and observe mean-seeking vs mode-seeking behavior
- A Beta–Dirichlet posterior update: implement Bayesian updating with a Beta prior, observe how (uniform) evolves toward the MLE as data accumulates, and connect the / pseudo-counts to the sample mean
Key Concepts Practiced
By the end you will have seen concretely why forward KL produces blurry means and reverse KL produces sharp modes, why cross-entropy training is equivalent to MLE, and how the Beta distribution's pseudo-count interpretation makes Bayesian updates intuitive.