Random Variables and Probability Distributions
- Define a random variable as a function from a sample space to the reals and distinguish discrete random variables (PMFs) from continuous random variables (PDFs)
- Compute probabilities using a PMF, a PDF via integration, and a CDF, and state the normalization condition each must satisfy
- Apply the law of total probability to marginalize over a nuisance variable, and state the conditions under which two random variables are independent
- Explain what a joint distribution encodes and how marginal and conditional distributions are derived from it
What Is Probability?
Before defining random variables, we need a precise notion of probability. A probability space has three components:
- Sample space — the set of all possible outcomes. For a fair die: .
- Event space — a collection of subsets of (the things we can assign probabilities to).
- Probability measure — a function assigning each event a number in satisfying the Kolmogorov axioms:
- for all events
- (something always happens)
- when and are mutually exclusive
From these three axioms alone, all of classical probability follows — including complement rule , union formula , and more.
Random Variables
A random variable is a function that assigns a real number to each outcome. For example, if is all possible sequences of 10 coin flips, could be the number of heads.
Random variables let us work with numbers instead of abstract outcomes. The probability that takes a particular value or range of values is determined by acting on the corresponding events in .
Discrete Random Variables and PMFs
A random variable is discrete if it takes values in a countable set (integers, categories, etc.).
The probability mass function (PMF) specifies the probability of each value:
Normalization:
Example — rolling a fair die:
Example — Bernoulli(): a binary outcome (heads/tails, success/failure)
The parameter is the probability of .
Continuous Random Variables and PDFs
A random variable is continuous if it can take any value in an interval. For continuous RVs, for any specific value — probability is defined only over intervals.
The probability density function (PDF) satisfies:
Normalization:
Note that itself is not a probability — it is a density, so it can exceed 1.
Example — Uniform(, ):
Example — Gaussian(, ):
Cumulative Distribution Functions
The cumulative distribution function (CDF) works for both discrete and continuous RVs:
For discrete RVs:
For continuous RVs:
Key properties:
- is non-decreasing
- ,
- For continuous RVs: (differentiating the CDF gives the PDF)
The CDF is useful for computing probabilities over ranges: .
Joint Distributions
When working with multiple random variables and , the joint distribution (or for continuous) encodes all the probabilistic information about both variables simultaneously.
Marginal Distributions
To recover the distribution of one variable alone, sum (or integrate) out the other:
This is called marginalizing over .
Conditional Distributions
The conditional distribution of given is:
Conditioning restricts attention to outcomes where and renormalizes.
Independence
Two random variables are independent (written ) if and only if:
Equivalently, — knowing tells you nothing about .
The Law of Total Probability
If events partition the sample space (mutually exclusive, collectively exhaustive), then for any event :
In terms of random variables: to compute the marginal distribution of , you can condition on any other variable and average out:
This identity appears constantly in probabilistic ML — it is how you compute the likelihood in a latent variable model by marginalizing over the latent variable :
Why This Matters for ML
The language of random variables is the language in which every probabilistic model is written:
- A neural network classifier defines a conditional distribution — the PMF over class labels given input .
- A generative model defines a joint distribution over observations and latent variables .
- Training by maximum likelihood chooses to maximize — a product of PMF or PDF values.
Understanding distributions precisely — what they are, how they relate, how to manipulate them — is the entry point to every technique covered in the Spinning Up curriculum and beyond.