Bayes' Theorem and the Chain Rule of Probability
- Define conditional probability P(A|B) and derive the product rule P(A∩B) = P(A|B)P(B) from it
- Apply the chain rule of probability to factorize a joint distribution P(A₁,A₂,...,Aₙ) as a product of conditional distributions
- Derive Bayes' theorem from the definition of conditional probability and identify the prior, likelihood, posterior, and evidence in a concrete inference problem
- Explain how Bayes' theorem enables sequential belief updating and identify where this appears in probabilistic ML models
Conditional Probability
Conditioning is one of the most powerful operations in probability. It lets us update our beliefs once we have observed some information.
Definition. The conditional probability of given (written ) is the probability that occurs, restricted to the world where has already occurred:
Geometrically: conditioning on shrinks the sample space to and asks what fraction of is also in .
Example. A bag contains 3 red and 2 blue balls. You draw one and see it is red. What is the probability the next draw (without replacement) is also red?
The Product Rule
Rearranging the definition of conditional probability gives the product rule:
The joint probability of and equals the probability of times the probability of given . This also holds symmetrically:
If and are independent, , so the product rule simplifies to .
The Chain Rule of Probability
The product rule extends to any number of variables. The chain rule factorizes a joint distribution into a product of conditionals:
For continuous random variables :
Why this matters for ML. Autoregressive language models use exactly this factorization. The probability of a sequence of tokens is:
Each next-token prediction is a conditional probability — the chain rule says these multiply to give the joint probability of the whole sequence.
Bayes' Theorem
Bayes' theorem follows directly from the product rule. Since , dividing both sides by :
In the language of inference, we interpret this as:
| Term | Symbol | Meaning |
|---|---|---|
| Prior | Our belief about hypothesis before seeing evidence | |
| Likelihood | How probable is the evidence under hypothesis ? | |
| Posterior | Updated belief after observing evidence | |
| Evidence | Probability of the evidence under all hypotheses — normalizes the posterior |
Computing the Evidence
The denominator is found by summing (or integrating) over all hypotheses using the law of total probability:
This ensures the posterior sums to 1.
A Concrete Example
Disease testing. A disease has prevalence 1% in the population. A test is 95% sensitive (correctly detects 95% of cases) and 90% specific (correctly clears 90% of healthy people).
You test positive. What is the probability you have the disease?
- — prior (prevalence)
- — likelihood (sensitivity)
- — false positive rate
Only ~8.7% probability of disease despite a positive test — because the disease is rare (low prior). This counterintuitive result is a direct consequence of the prior dominating when the base rate is low.
Sequential Updating
Bayes' theorem is naturally sequential: today's posterior becomes tomorrow's prior.
Suppose you observe two independent pieces of evidence and :
(assuming — conditionally independent given ). You can apply Bayes' theorem sequentially:
- Use to get
- Use this as the new prior and apply :
The order of evidence does not matter — the final posterior is the same.
Bayes' Theorem in ML
Maximum a posteriori (MAP) estimation. Choosing model parameters to maximize . The prior acts as regularization — a Gaussian prior corresponds to L2 weight decay.
Bayesian neural networks. Instead of point estimates, maintain a posterior distribution over weights. Inference averages predictions over all weights weighted by the posterior.
Naive Bayes classifiers. Classify by computing for each class label , assuming feature independence given .
Probabilistic graphical models. Define complex joint distributions using the chain rule, with conditional independence assumptions encoded as a graph.