Prerequisite · Probability Foundations

Random Variables and Probability Distributions

16 min read
By the end of this reading you will be able to:
  • Define a random variable as a function from a sample space to the reals and distinguish discrete random variables (PMFs) from continuous random variables (PDFs)
  • Compute probabilities using a PMF, a PDF via integration, and a CDF, and state the normalization condition each must satisfy
  • Apply the law of total probability to marginalize over a nuisance variable, and state the conditions under which two random variables are independent
  • Explain what a joint distribution encodes and how marginal and conditional distributions are derived from it

What Is Probability?

Before defining random variables, we need a precise notion of probability. A probability space has three components:

  • Sample space Ω\Omega — the set of all possible outcomes. For a fair die: Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\}.
  • Event space F\mathcal{F} — a collection of subsets of Ω\Omega (the things we can assign probabilities to).
  • Probability measure PP — a function assigning each event a number in [0,1][0,1] satisfying the Kolmogorov axioms:
    1. P(A)0P(A) \geq 0 for all events AA
    2. P(Ω)=1P(\Omega) = 1 (something always happens)
    3. P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B) when AA and BB are mutually exclusive

From these three axioms alone, all of classical probability follows — including complement rule P(Ac)=1P(A)P(A^c) = 1 - P(A), union formula P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B), and more.


Random Variables

A random variable XX is a function X:ΩRX: \Omega \to \mathbb{R} that assigns a real number to each outcome. For example, if Ω\Omega is all possible sequences of 10 coin flips, XX could be the number of heads.

Random variables let us work with numbers instead of abstract outcomes. The probability that XX takes a particular value or range of values is determined by PP acting on the corresponding events in Ω\Omega.


Discrete Random Variables and PMFs

A random variable is discrete if it takes values in a countable set (integers, categories, etc.).

The probability mass function (PMF) specifies the probability of each value:

p(x)=P(X=x)p(x) = P(X = x)

Normalization: xp(x)=1\sum_x p(x) = 1

Example — rolling a fair die:

p(x)=16for x{1,2,3,4,5,6}p(x) = \frac{1}{6} \quad \text{for } x \in \{1,2,3,4,5,6\}

Example — Bernoulli(θ\theta): a binary outcome (heads/tails, success/failure)

p(x)=θx(1θ)1xx{0,1}p(x) = \theta^x (1-\theta)^{1-x} \quad x \in \{0,1\}

The parameter θ[0,1]\theta \in [0,1] is the probability of X=1X=1.


Continuous Random Variables and PDFs

A random variable is continuous if it can take any value in an interval. For continuous RVs, P(X=x)=0P(X = x) = 0 for any specific value — probability is defined only over intervals.

The probability density function (PDF) f(x)f(x) satisfies:

P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x)\, dx

Normalization: f(x)dx=1\int_{-\infty}^{\infty} f(x)\, dx = 1

Note that f(x)f(x) itself is not a probability — it is a density, so it can exceed 1.

Example — Uniform(aa, bb):

f(x)=1bafor x[a,b]f(x) = \frac{1}{b-a} \quad \text{for } x \in [a,b]

Example — Gaussian(μ\mu, σ2\sigma^2):

f(x)=1σ2πexp ⁣((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)


Cumulative Distribution Functions

The cumulative distribution function (CDF) works for both discrete and continuous RVs:

F(x)=P(Xx)F(x) = P(X \leq x)

For discrete RVs: F(x)=txp(t)F(x) = \sum_{t \leq x} p(t)

For continuous RVs: F(x)=xf(t)dtF(x) = \int_{-\infty}^x f(t)\, dt

Key properties:

  • F(x)F(x) is non-decreasing
  • limxF(x)=0\lim_{x \to -\infty} F(x) = 0, limx+F(x)=1\lim_{x \to +\infty} F(x) = 1
  • For continuous RVs: f(x)=F(x)f(x) = F'(x) (differentiating the CDF gives the PDF)

The CDF is useful for computing probabilities over ranges: P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a).


Joint Distributions

When working with multiple random variables XX and YY, the joint distribution p(x,y)p(x, y) (or f(x,y)f(x,y) for continuous) encodes all the probabilistic information about both variables simultaneously.

Marginal Distributions

To recover the distribution of one variable alone, sum (or integrate) out the other:

p(x)=yp(x,y)(discrete)p(x) = \sum_y p(x, y) \qquad \text{(discrete)}

f(x)=f(x,y)dy(continuous)f(x) = \int_{-\infty}^{\infty} f(x, y)\, dy \qquad \text{(continuous)}

This is called marginalizing over yy.

Conditional Distributions

The conditional distribution of XX given Y=yY = y is:

p(xy)=p(x,y)p(y)p(x \mid y) = \frac{p(x, y)}{p(y)}

Conditioning restricts attention to outcomes where Y=yY = y and renormalizes.

Independence

Two random variables are independent (written XYX \perp Y) if and only if:

p(x,y)=p(x)p(y)for all x,yp(x, y) = p(x)\, p(y) \quad \text{for all } x, y

Equivalently, p(xy)=p(x)p(x \mid y) = p(x) — knowing YY tells you nothing about XX.


The Law of Total Probability

If events {B1,B2,,Bn}\{B_1, B_2, \ldots, B_n\} partition the sample space (mutually exclusive, collectively exhaustive), then for any event AA:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \mid B_i)\, P(B_i)

In terms of random variables: to compute the marginal distribution of XX, you can condition on any other variable YY and average out:

p(x)=yp(xy)p(y)p(x) = \sum_y p(x \mid y)\, p(y)

This identity appears constantly in probabilistic ML — it is how you compute the likelihood in a latent variable model by marginalizing over the latent variable zz:

p(x)=zp(xz)p(z)p(x) = \sum_z p(x \mid z)\, p(z)


Why This Matters for ML

The language of random variables is the language in which every probabilistic model is written:

  • A neural network classifier defines a conditional distribution p(yx;θ)p(y \mid x; \theta) — the PMF over class labels given input xx.
  • A generative model defines a joint distribution p(x,z)p(x, z) over observations xx and latent variables zz.
  • Training by maximum likelihood chooses θ\theta to maximize ip(xi;θ)\prod_i p(x_i; \theta) — a product of PMF or PDF values.

Understanding distributions precisely — what they are, how they relate, how to manipulate them — is the entry point to every technique covered in the Spinning Up curriculum and beyond.