Prerequisite · Probability Foundations

Key Distributions for Machine Learning

18 min read
By the end of this reading you will be able to:
  • State the PMF or PDF, mean, and variance for the Bernoulli, Binomial, Categorical, and Multinomial distributions and identify the ML context where each appears
  • Explain the Gaussian distribution's central role via the Central Limit Theorem and extend it to the multivariate case, identifying what the covariance matrix encodes
  • Distinguish the Beta distribution (distribution over a single probability) from the Dirichlet distribution (distribution over a probability vector) and explain their role as conjugate priors
  • Identify the appropriate distribution family for a given ML modeling assumption: binary outcomes, class probabilities, continuous observations, count data, or categorical mixing weights

Why Distributions Matter

Every probabilistic ML model is built from named distributions. Understanding which distribution captures which kind of uncertainty — and what its parameters control — is essential for reading papers, designing models, and debugging training.


Discrete Distributions

Bernoulli(θ\theta)

The simplest random variable: a single binary outcome.

p(x)=θx(1θ)1x,x{0,1}p(x) = \theta^x (1-\theta)^{1-x}, \quad x \in \{0, 1\}

  • Parameters: θ[0,1]\theta \in [0,1] — the probability of x=1x=1
  • Mean: θ\theta
  • Variance: θ(1θ)\theta(1-\theta) — maximized at θ=0.5\theta = 0.5
  • ML uses: binary classification output, individual pixel in a binary image model, gate activation

Binomial(nn, θ\theta)

Sum of nn independent Bernoulli(θ\theta) trials: the number of successes.

p(k)=(nk)θk(1θ)nk,k{0,1,,n}p(k) = \binom{n}{k} \theta^k (1-\theta)^{n-k}, \quad k \in \{0,1,\ldots,n\}

  • Parameters: nn (number of trials), θ\theta (success probability)
  • Mean: nθn\theta
  • Variance: nθ(1θ)n\theta(1-\theta)
  • Bernoulli is the special case n=1n=1

Categorical(π\boldsymbol{\pi})

Generalization of Bernoulli to KK outcomes. The probability of outcome kk is πk\pi_k.

p(x=k)=πk,k{1,,K},k=1Kπk=1p(x = k) = \pi_k, \quad k \in \{1,\ldots,K\}, \quad \sum_{k=1}^K \pi_k = 1

  • Parameters: probability vector π=(π1,,πK)\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K) with πk0\pi_k \geq 0, kπk=1\sum_k \pi_k = 1
  • ML uses: multi-class classification (the softmax output defines π\boldsymbol{\pi}), sampling tokens from a language model, action selection in RL

Multinomial(nn, π\boldsymbol{\pi})

Generalization of Binomial: nn independent draws from a Categorical(π\boldsymbol{\pi}). Records the count vector (c1,,cK)(c_1, \ldots, c_K) where ckc_k is the number of times outcome kk occurred.

p(c1,,cK)=n!c1!cK!π1c1πKcKp(c_1, \ldots, c_K) = \frac{n!}{c_1! \cdots c_K!} \pi_1^{c_1} \cdots \pi_K^{c_K}

  • ML uses: word count models (bag-of-words), topic models (Latent Dirichlet Allocation)

Continuous Distributions

Gaussian (Normal) — N(μ,σ2)\mathcal{N}(\mu, \sigma^2)

The most important distribution in ML and statistics.

f(x)=1σ2πexp ⁣((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

  • Parameters: mean μ\mu (location), variance σ2\sigma^2 (spread)
  • Mean: μ\muVariance: σ2\sigma^2

Why it appears everywhere: The Central Limit Theorem states that the sum of many independent, finite-variance random variables converges to a Gaussian as the number grows. This explains why measurement errors, additive noise, and empirical averages are well-modeled by Gaussians.

Properties useful for ML:

  • Closed under linear transformations: aX+bN(aμ+b,a2σ2)aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)
  • Sum of independent Gaussians is Gaussian: X+YN(μX+μY,σX2+σY2)X + Y \sim \mathcal{N}(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2)
  • Maximally uncertain for a given mean and variance (maximum entropy property)

Multivariate Gaussian — N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})

Extension to a vector xRd\mathbf{x} \in \mathbb{R}^d:

f(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

  • Parameters: mean vector μRd\boldsymbol{\mu} \in \mathbb{R}^d, covariance matrix ΣRd×d\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d} (symmetric, positive definite)
  • The covariance matrix encodes the shape and orientation of the distribution's ellipsoidal contours

ML uses: Gaussian noise models, VAE latent prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I), Gaussian process priors, 3D Gaussian Splatting (Gaussians are MVN with learnable covariance)


Beta and Dirichlet: Priors Over Probabilities

Beta(α\alpha, β\beta)

The Beta distribution is a distribution over the interval [0,1][0,1] — making it the natural prior over a probability parameter.

f(θ)=θα1(1θ)β1B(α,β),θ[0,1]f(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}, \quad \theta \in [0,1]

where B(α,β)=01θα1(1θ)β1dθB(\alpha,\beta) = \int_0^1 \theta^{\alpha-1}(1-\theta)^{\beta-1} d\theta is the normalizing constant.

  • Parameters: α,β>0\alpha, \beta > 0 — can be thought of as pseudo-counts of successes and failures
  • Mean: α/(α+β)\alpha/(\alpha+\beta)
  • Shape: α=β=1\alpha = \beta = 1 is Uniform; α,β>1\alpha, \beta > 1 is unimodal; α,β<1\alpha, \beta < 1 is U-shaped

Conjugate prior for Bernoulli: If θBeta(α,β)\theta \sim \text{Beta}(\alpha,\beta) and you observe n1n_1 successes and n0n_0 failures, the posterior is Beta(α+n1,β+n0)\text{Beta}(\alpha + n_1,\, \beta + n_0) — same family, just updated counts. This is the definition of a conjugate prior: the posterior has the same distributional form as the prior.

Dirichlet(α\boldsymbol{\alpha})

The Dirichlet is the multivariate generalization of the Beta: a distribution over the KK-dimensional probability simplex {π:πk0,kπk=1}\{\boldsymbol{\pi}: \pi_k \geq 0, \sum_k \pi_k = 1\}.

f(π)k=1Kπkαk1f(\boldsymbol{\pi}) \propto \prod_{k=1}^K \pi_k^{\alpha_k - 1}

  • Parameters: concentration vector α=(α1,,αK)\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K) with αk>0\alpha_k > 0
  • Mean: E[πk]=αk/jαj\mathbb{E}[\pi_k] = \alpha_k / \sum_j \alpha_j
  • Symmetric case: αk=α\alpha_k = \alpha for all kk; small α\alpha → sparse (corners of simplex); large α\alpha → concentrated at center

Conjugate prior for Categorical/Multinomial: Observe counts (c1,,cK)(c_1, \ldots, c_K); posterior is Dirichlet(α1+c1,,αK+cK)\text{Dirichlet}(\alpha_1 + c_1, \ldots, \alpha_K + c_K).

ML uses: LDA (Latent Dirichlet Allocation) uses Dirichlet priors over topic distributions; Think Bayes Ch. 18 covers the Dirichlet-multinomial model in detail.


Quick Reference Table

Distribution Type Parameters Mean Variance ML use
Bernoulli(θ\theta) Discrete θ[0,1]\theta \in [0,1] θ\theta θ(1θ)\theta(1-\theta) Binary classification
Categorical(π\boldsymbol{\pi}) Discrete π\boldsymbol{\pi} on simplex πk\pi_k Multi-class output
Gaussian(μ,σ2\mu,\sigma^2) Continuous μR,σ2>0\mu \in \mathbb{R},\, \sigma^2 > 0 μ\mu σ2\sigma^2 Regression, noise, VAE latent
MVN(μ,Σ\boldsymbol{\mu},\boldsymbol{\Sigma}) Continuous mean + covariance μ\boldsymbol{\mu} Σ\boldsymbol{\Sigma} Gaussian processes, 3DGS
Beta(α,β\alpha,\beta) Continuous α,β>0\alpha,\beta > 0 αα+β\frac{\alpha}{\alpha+\beta} Prior over θ\theta
Dirichlet(α\boldsymbol{\alpha}) Continuous α>0\boldsymbol{\alpha} > 0 αk/αj\alpha_k / \sum \alpha_j Prior over π\boldsymbol{\pi}, LDA