Prerequisite · Probability Foundations

Bayes' Theorem and the Chain Rule of Probability

15 min read

By the end of this reading you will be able to:

Define conditional probability P(A|B) and derive the product rule P(A∩B) = P(A|B)P(B) from it
Apply the chain rule of probability to factorize a joint distribution P(A₁,A₂,...,Aₙ) as a product of conditional distributions
Derive Bayes' theorem from the definition of conditional probability and identify the prior, likelihood, posterior, and evidence in a concrete inference problem
Explain how Bayes' theorem enables sequential belief updating and identify where this appears in probabilistic ML models

Conditional Probability

Conditioning is one of the most powerful operations in probability. It lets us update our beliefs once we have observed some information.

Definition. The conditional probability of $A$ given $B$ (written $P(A \mid B)$ ) is the probability that $A$ occurs, restricted to the world where $B$ has already occurred:

$P(A \mid B) = \frac{P(A \cap B)}{P(B)} \qquad \text{(provided } P(B) > 0\text{)}$

Geometrically: conditioning on $B$ shrinks the sample space to $B$ and asks what fraction of $B$ is also in $A$ .

Example. A bag contains 3 red and 2 blue balls. You draw one and see it is red. What is the probability the next draw (without replacement) is also red?

$P(\text{2nd red} \mid \text{1st red}) = P(\text{1st red and 2nd red}) / P(\text{1st red}) = (3/5 \cdot 2/4) / (3/5) = 2/4 = 0.5$

The Product Rule

Rearranging the definition of conditional probability gives the product rule:

$P(A \cap B) = P(A \mid B)\, P(B)$

The joint probability of $A$ and $B$ equals the probability of $B$ times the probability of $A$ given $B$ . This also holds symmetrically:

$P(A \cap B) = P(B \mid A)\, P(A)$

If $A$ and $B$ are independent, $P(A \mid B) = P(A)$ , so the product rule simplifies to $P(A \cap B) = P(A)\, P(B)$ .

The Chain Rule of Probability

The product rule extends to any number of variables. The chain rule factorizes a joint distribution into a product of conditionals:

$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\, P(A_2 \mid A_1)\, P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})$

For continuous random variables $x_1, \ldots, x_n$ :

$p(x_1, x_2, \ldots, x_n) = p(x_1)\, p(x_2 \mid x_1)\, p(x_3 \mid x_1, x_2) \cdots p(x_n \mid x_1, \ldots, x_{n-1})$

Why this matters for ML. Autoregressive language models use exactly this factorization. The probability of a sequence of tokens $(w_1, w_2, \ldots, w_T)$ is:

$p(w_1, \ldots, w_T) = \prod_{t=1}^T p(w_t \mid w_1, \ldots, w_{t-1})$

Each next-token prediction is a conditional probability — the chain rule says these multiply to give the joint probability of the whole sequence.

Bayes' Theorem

Bayes' theorem follows directly from the product rule. Since $P(A \cap B) = P(A \mid B) P(B) = P(B \mid A) P(A)$ , dividing both sides by $P(B)$ :

$\boxed{P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B)}}$

In the language of inference, we interpret this as:

$\underbrace{P(H \mid E)}_{\text{posterior}} = \frac{\underbrace{P(E \mid H)}_{\text{likelihood}} \cdot \underbrace{P(H)}_{\text{prior}}}{\underbrace{P(E)}_{\text{evidence}}}$

Term	Symbol	Meaning
Prior	$P(H)$	Our belief about hypothesis $H$ before seeing evidence
Likelihood	$P(E \mid H)$	How probable is the evidence under hypothesis $H$ ?
Posterior	$P(H \mid E)$	Updated belief after observing evidence $E$
Evidence	$P(E)$	Probability of the evidence under all hypotheses — normalizes the posterior

Computing the Evidence

The denominator $P(E)$ is found by summing (or integrating) over all hypotheses using the law of total probability:

$P(E) = \sum_H P(E \mid H)\, P(H)$

This ensures the posterior sums to 1.

A Concrete Example

Disease testing. A disease has prevalence 1% in the population. A test is 95% sensitive (correctly detects 95% of cases) and 90% specific (correctly clears 90% of healthy people).

You test positive. What is the probability you have the disease?

$P(D) = 0.01$ — prior (prevalence)
$P(+\mid D) = 0.95$ — likelihood (sensitivity)
$P(+\mid \neg D) = 0.10$ — false positive rate
$P(+) = P(+\mid D)P(D) + P(+\mid \neg D)P(\neg D) = 0.95 \times 0.01 + 0.10 \times 0.99 = 0.1085$

$P(D \mid +) = \frac{0.95 \times 0.01}{0.1085} \approx 0.087$

Only ~8.7% probability of disease despite a positive test — because the disease is rare (low prior). This counterintuitive result is a direct consequence of the prior dominating when the base rate is low.

Sequential Updating

Bayes' theorem is naturally sequential: today's posterior becomes tomorrow's prior.

Suppose you observe two independent pieces of evidence $E_1$ and $E_2$ :

$P(H \mid E_1, E_2) \propto P(E_1 \mid H)\, P(E_2 \mid H)\, P(H)$

(assuming $E_1 \perp E_2 \mid H$ — conditionally independent given $H$ ). You can apply Bayes' theorem sequentially:

Use $E_1$ to get $P(H \mid E_1) \propto P(E_1 \mid H) P(H)$
Use this as the new prior and apply $E_2$ : $P(H \mid E_1, E_2) \propto P(E_2 \mid H) P(H \mid E_1)$

The order of evidence does not matter — the final posterior is the same.

Bayes' Theorem in ML

Maximum a posteriori (MAP) estimation. Choosing model parameters $\theta$ to maximize $P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta)$ . The prior $P(\theta)$ acts as regularization — a Gaussian prior corresponds to L2 weight decay.

Bayesian neural networks. Instead of point estimates, maintain a posterior distribution over weights. Inference averages predictions over all weights weighted by the posterior.

Naive Bayes classifiers. Classify by computing $P(y \mid x) \propto P(x \mid y) P(y)$ for each class label $y$ , assuming feature independence given $y$ .

Probabilistic graphical models. Define complex joint distributions using the chain rule, with conditional independence assumptions encoded as a graph.

References

Downey — Think Bayes 2e, Ch. 1–2 — Probability and Bayes's Theorem

Previous Next →

Bayes' Theorem and the Chain Rule of Probability

Conditional Probability

The Product Rule

The Chain Rule of Probability

Bayes' Theorem

Computing the Evidence

A Concrete Example

Sequential Updating

Bayes' Theorem in ML

Privacy Policy

What we collect

What we don't collect

Your choices

Contact