Prerequisite · Probability Foundations

Random Variables and Probability Distributions

16 min read

By the end of this reading you will be able to:

Define a random variable as a function from a sample space to the reals and distinguish discrete random variables (PMFs) from continuous random variables (PDFs)
Compute probabilities using a PMF, a PDF via integration, and a CDF, and state the normalization condition each must satisfy
Apply the law of total probability to marginalize over a nuisance variable, and state the conditions under which two random variables are independent
Explain what a joint distribution encodes and how marginal and conditional distributions are derived from it

What Is Probability?

Before defining random variables, we need a precise notion of probability. A probability space has three components:

Sample space $\Omega$ — the set of all possible outcomes. For a fair die: $\Omega = \{1,2,3,4,5,6\}$ .
Event space $\mathcal{F}$ — a collection of subsets of $\Omega$ (the things we can assign probabilities to).
Probability measure $P$ $P$ — a function assigning each event a number in $[0,1]$ $[0, 1]$ satisfying the Kolmogorov axioms:
1. $P(A) \geq 0$ for all events $A$
2. $P(\Omega) = 1$ (something always happens)
3. $P(A \cup B) = P(A) + P(B)$ when $A$ and $B$ are mutually exclusive

From these three axioms alone, all of classical probability follows — including complement rule $P(A^c) = 1 - P(A)$ , union formula $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ , and more.

Random Variables

A random variable $X$ is a function $X: \Omega \to \mathbb{R}$ that assigns a real number to each outcome. For example, if $\Omega$ is all possible sequences of 10 coin flips, $X$ could be the number of heads.

Random variables let us work with numbers instead of abstract outcomes. The probability that $X$ takes a particular value or range of values is determined by $P$ acting on the corresponding events in $\Omega$ .

Discrete Random Variables and PMFs

A random variable is discrete if it takes values in a countable set (integers, categories, etc.).

The probability mass function (PMF) specifies the probability of each value:

$p(x) = P(X = x)$

Normalization: $\sum_x p(x) = 1$

Example — rolling a fair die:

$p(x) = \frac{1}{6} \quad \text{for } x \in \{1,2,3,4,5,6\}$

Example — Bernoulli( $\theta$ ): a binary outcome (heads/tails, success/failure)

$p(x) = \theta^x (1-\theta)^{1-x} \quad x \in \{0,1\}$

The parameter $\theta \in [0,1]$ is the probability of $X=1$ .

Continuous Random Variables and PDFs

A random variable is continuous if it can take any value in an interval. For continuous RVs, $P(X = x) = 0$ for any specific value — probability is defined only over intervals.

The probability density function (PDF) $f(x)$ satisfies:

$P(a \leq X \leq b) = \int_a^b f(x)\, dx$

Normalization: $\int_{-\infty}^{\infty} f(x)\, dx = 1$

Note that $f(x)$ itself is not a probability — it is a density, so it can exceed 1.

Example — Uniform( $a$ , $b$ ):

$f(x) = \frac{1}{b-a} \quad \text{for } x \in [a,b]$

Example — Gaussian( $\mu$ , $\sigma^2$ ):

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

Cumulative Distribution Functions

The cumulative distribution function (CDF) works for both discrete and continuous RVs:

$F(x) = P(X \leq x)$

For discrete RVs: $F(x) = \sum_{t \leq x} p(t)$

For continuous RVs: $F(x) = \int_{-\infty}^x f(t)\, dt$

Key properties:

$F(x)$ is non-decreasing
$\lim_{x \to -\infty} F(x) = 0$ , $\lim_{x \to +\infty} F(x) = 1$
For continuous RVs: $f(x) = F'(x)$ (differentiating the CDF gives the PDF)

The CDF is useful for computing probabilities over ranges: $P(a < X \leq b) = F(b) - F(a)$ .

Joint Distributions

When working with multiple random variables $X$ and $Y$ , the joint distribution $p(x, y)$ (or $f(x,y)$ for continuous) encodes all the probabilistic information about both variables simultaneously.

Marginal Distributions

To recover the distribution of one variable alone, sum (or integrate) out the other:

$p(x) = \sum_y p(x, y) \qquad \text{(discrete)}$

$f(x) = \int_{-\infty}^{\infty} f(x, y)\, dy \qquad \text{(continuous)}$

This is called marginalizing over $y$ .

Conditional Distributions

The conditional distribution of $X$ given $Y = y$ is:

$p(x \mid y) = \frac{p(x, y)}{p(y)}$

Conditioning restricts attention to outcomes where $Y = y$ and renormalizes.

Independence

Two random variables are independent (written $X \perp Y$ ) if and only if:

$p(x, y) = p(x)\, p(y) \quad \text{for all } x, y$

Equivalently, $p(x \mid y) = p(x)$ — knowing $Y$ tells you nothing about $X$ .

The Law of Total Probability

If events $\{B_1, B_2, \ldots, B_n\}$ partition the sample space (mutually exclusive, collectively exhaustive), then for any event $A$ :

$P(A) = \sum_{i=1}^n P(A \mid B_i)\, P(B_i)$

In terms of random variables: to compute the marginal distribution of $X$ , you can condition on any other variable $Y$ and average out:

$p(x) = \sum_y p(x \mid y)\, p(y)$

This identity appears constantly in probabilistic ML — it is how you compute the likelihood in a latent variable model by marginalizing over the latent variable $z$ :

$p(x) = \sum_z p(x \mid z)\, p(z)$

Why This Matters for ML

The language of random variables is the language in which every probabilistic model is written:

A neural network classifier defines a conditional distribution $p(y \mid x; \theta)$ — the PMF over class labels given input $x$ .
A generative model defines a joint distribution $p(x, z)$ over observations $x$ and latent variables $z$ .
Training by maximum likelihood chooses $\theta$ to maximize $\prod_i p(x_i; \theta)$ — a product of PMF or PDF values.

Understanding distributions precisely — what they are, how they relate, how to manipulate them — is the entry point to every technique covered in the Spinning Up curriculum and beyond.

References

Downey — Think Stats 2e, Ch. 2–4 — Distributions, PMFs, and CDFs

Overview Next →