Prerequisite · Probability Foundations

Key Distributions for Machine Learning

18 min read

By the end of this reading you will be able to:

State the PMF or PDF, mean, and variance for the Bernoulli, Binomial, Categorical, and Multinomial distributions and identify the ML context where each appears
Explain the Gaussian distribution's central role via the Central Limit Theorem and extend it to the multivariate case, identifying what the covariance matrix encodes
Distinguish the Beta distribution (distribution over a single probability) from the Dirichlet distribution (distribution over a probability vector) and explain their role as conjugate priors
Identify the appropriate distribution family for a given ML modeling assumption: binary outcomes, class probabilities, continuous observations, count data, or categorical mixing weights

Why Distributions Matter

Every probabilistic ML model is built from named distributions. Understanding which distribution captures which kind of uncertainty — and what its parameters control — is essential for reading papers, designing models, and debugging training.

Discrete Distributions

Bernoulli( $\theta$ )

The simplest random variable: a single binary outcome.

$p(x) = \theta^x (1-\theta)^{1-x}, \quad x \in \{0, 1\}$

Parameters: $\theta \in [0,1]$ — the probability of $x=1$
Mean: $\theta$
Variance: $\theta(1-\theta)$ — maximized at $\theta = 0.5$
ML uses: binary classification output, individual pixel in a binary image model, gate activation

Binomial( $n$ , $\theta$ )

Sum of $n$ independent Bernoulli( $\theta$ ) trials: the number of successes.

$p(k) = \binom{n}{k} \theta^k (1-\theta)^{n-k}, \quad k \in \{0,1,\ldots,n\}$

Parameters: $n$ (number of trials), $\theta$ (success probability)
Mean: $n\theta$
Variance: $n\theta(1-\theta)$
Bernoulli is the special case $n=1$

Categorical( $\boldsymbol{\pi}$ )

Generalization of Bernoulli to $K$ outcomes. The probability of outcome $k$ is $\pi_k$ .

$p(x = k) = \pi_k, \quad k \in \{1,\ldots,K\}, \quad \sum_{k=1}^K \pi_k = 1$

Parameters: probability vector $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ with $\pi_k \geq 0$ , $\sum_k \pi_k = 1$
ML uses: multi-class classification (the softmax output defines $\boldsymbol{\pi}$ ), sampling tokens from a language model, action selection in RL

Multinomial( $n$ , $\boldsymbol{\pi}$ )

Generalization of Binomial: $n$ independent draws from a Categorical( $\boldsymbol{\pi}$ ). Records the count vector $(c_1, \ldots, c_K)$ where $c_k$ is the number of times outcome $k$ occurred.

$p(c_1, \ldots, c_K) = \frac{n!}{c_1! \cdots c_K!} \pi_1^{c_1} \cdots \pi_K^{c_K}$

ML uses: word count models (bag-of-words), topic models (Latent Dirichlet Allocation)

Continuous Distributions

Gaussian (Normal) — $\mathcal{N}(\mu, \sigma^2)$

The most important distribution in ML and statistics.

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

Parameters: mean $\mu$ (location), variance $\sigma^2$ (spread)
Mean: $\mu$ — Variance: $\sigma^2$

Why it appears everywhere: The Central Limit Theorem states that the sum of many independent, finite-variance random variables converges to a Gaussian as the number grows. This explains why measurement errors, additive noise, and empirical averages are well-modeled by Gaussians.

Properties useful for ML:

Closed under linear transformations: $aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$
Sum of independent Gaussians is Gaussian: $X + Y \sim \mathcal{N}(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2)$
Maximally uncertain for a given mean and variance (maximum entropy property)

Multivariate Gaussian — $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$

Extension to a vector $\mathbf{x} \in \mathbb{R}^d$ :

$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$

Parameters: mean vector $\boldsymbol{\mu} \in \mathbb{R}^d$ , covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ (symmetric, positive definite)
The covariance matrix encodes the shape and orientation of the distribution's ellipsoidal contours

ML uses: Gaussian noise models, VAE latent prior $p(z) = \mathcal{N}(0, I)$ , Gaussian process priors, 3D Gaussian Splatting (Gaussians are MVN with learnable covariance)

Beta and Dirichlet: Priors Over Probabilities

Beta( $\alpha$ , $\beta$ )

The Beta distribution is a distribution over the interval $[0,1]$ — making it the natural prior over a probability parameter.

$f(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}, \quad \theta \in [0,1]$

where $B(\alpha,\beta) = \int_0^1 \theta^{\alpha-1}(1-\theta)^{\beta-1} d\theta$ is the normalizing constant.

Parameters: $\alpha, \beta > 0$ — can be thought of as pseudo-counts of successes and failures
Mean: $\alpha/(\alpha+\beta)$
Shape: $\alpha = \beta = 1$ is Uniform; $\alpha, \beta > 1$ is unimodal; $\alpha, \beta < 1$ is U-shaped

Conjugate prior for Bernoulli: If $\theta \sim \text{Beta}(\alpha,\beta)$ and you observe $n_1$ successes and $n_0$ failures, the posterior is $\text{Beta}(\alpha + n_1,\, \beta + n_0)$ — same family, just updated counts. This is the definition of a conjugate prior: the posterior has the same distributional form as the prior.

Dirichlet( $\boldsymbol{\alpha}$ )

The Dirichlet is the multivariate generalization of the Beta: a distribution over the $K$ -dimensional probability simplex $\{\boldsymbol{\pi}: \pi_k \geq 0, \sum_k \pi_k = 1\}$ .

$f(\boldsymbol{\pi}) \propto \prod_{k=1}^K \pi_k^{\alpha_k - 1}$

Parameters: concentration vector $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K)$ with $\alpha_k > 0$
Mean: $\mathbb{E}[\pi_k] = \alpha_k / \sum_j \alpha_j$
Symmetric case: $\alpha_k = \alpha$ for all $k$ ; small $\alpha$ → sparse (corners of simplex); large $\alpha$ → concentrated at center

Conjugate prior for Categorical/Multinomial: Observe counts $(c_1, \ldots, c_K)$ ; posterior is $\text{Dirichlet}(\alpha_1 + c_1, \ldots, \alpha_K + c_K)$ .

ML uses: LDA (Latent Dirichlet Allocation) uses Dirichlet priors over topic distributions; Think Bayes Ch. 18 covers the Dirichlet-multinomial model in detail.

Quick Reference Table

Distribution	Type	Parameters	Mean	Variance	ML use
Bernoulli( $\theta$ )	Discrete	$\theta \in [0,1]$	$\theta$	$\theta(1-\theta)$	Binary classification
Categorical( $\boldsymbol{\pi}$ )	Discrete	$\boldsymbol{\pi}$ on simplex	$\pi_k$	—	Multi-class output
Gaussian( $\mu,\sigma^2$ )	Continuous	$\mu \in \mathbb{R},\, \sigma^2 > 0$	$\mu$	$\sigma^2$	Regression, noise, VAE latent
MVN( $\boldsymbol{\mu},\boldsymbol{\Sigma}$ )	Continuous	mean + covariance	$\boldsymbol{\mu}$	$\boldsymbol{\Sigma}$	Gaussian processes, 3DGS
Beta( $\alpha,\beta$ )	Continuous	$\alpha,\beta > 0$	$\frac{\alpha}{\alpha+\beta}$	—	Prior over $\theta$
Dirichlet( $\boldsymbol{\alpha}$ )	Continuous	$\boldsymbol{\alpha} > 0$	$\alpha_k / \sum \alpha_j$	—	Prior over $\boldsymbol{\pi}$ , LDA

References

Downey — Think Bayes 2e, Ch. 4, 8, 12, 18 — Estimating Proportions, Poisson Processes, Classification, Conjugate Priors

Downey — Think Stats 2e, Ch. 5 — Modeling Distributions

Previous Next →

Key Distributions for Machine Learning

Why Distributions Matter

Discrete Distributions

Bernoulli(θ\thetaθ)

Binomial(nnn, θ\thetaθ)

Categorical(π\boldsymbol{\pi}π)

Multinomial(nnn, π\boldsymbol{\pi}π)

Continuous Distributions

Gaussian (Normal) — N(μ,σ2)\mathcal{N}(\mu, \sigma^2)N(μ,σ2)

Multivariate Gaussian — N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})N(μ,Σ)

Beta and Dirichlet: Priors Over Probabilities

Beta(α\alphaα, β\betaβ)

Dirichlet(α\boldsymbol{\alpha}α)

Quick Reference Table

Privacy Policy

What we collect

What we don't collect

Your choices

Contact