Prerequisite · Probability Foundations

Bayes' Theorem and the Chain Rule of Probability

15 min read
By the end of this reading you will be able to:
  • Define conditional probability P(A|B) and derive the product rule P(A∩B) = P(A|B)P(B) from it
  • Apply the chain rule of probability to factorize a joint distribution P(A₁,A₂,...,Aₙ) as a product of conditional distributions
  • Derive Bayes' theorem from the definition of conditional probability and identify the prior, likelihood, posterior, and evidence in a concrete inference problem
  • Explain how Bayes' theorem enables sequential belief updating and identify where this appears in probabilistic ML models

Conditional Probability

Conditioning is one of the most powerful operations in probability. It lets us update our beliefs once we have observed some information.

Definition. The conditional probability of AA given BB (written P(AB)P(A \mid B)) is the probability that AA occurs, restricted to the world where BB has already occurred:

P(AB)=P(AB)P(B)(provided P(B)>0)P(A \mid B) = \frac{P(A \cap B)}{P(B)} \qquad \text{(provided } P(B) > 0\text{)}

Geometrically: conditioning on BB shrinks the sample space to BB and asks what fraction of BB is also in AA.

Example. A bag contains 3 red and 2 blue balls. You draw one and see it is red. What is the probability the next draw (without replacement) is also red?

  • P(2nd red1st red)=P(1st red and 2nd red)/P(1st red)=(3/52/4)/(3/5)=2/4=0.5P(\text{2nd red} \mid \text{1st red}) = P(\text{1st red and 2nd red}) / P(\text{1st red}) = (3/5 \cdot 2/4) / (3/5) = 2/4 = 0.5

The Product Rule

Rearranging the definition of conditional probability gives the product rule:

P(AB)=P(AB)P(B)P(A \cap B) = P(A \mid B)\, P(B)

The joint probability of AA and BB equals the probability of BB times the probability of AA given BB. This also holds symmetrically:

P(AB)=P(BA)P(A)P(A \cap B) = P(B \mid A)\, P(A)

If AA and BB are independent, P(AB)=P(A)P(A \mid B) = P(A), so the product rule simplifies to P(AB)=P(A)P(B)P(A \cap B) = P(A)\, P(B).


The Chain Rule of Probability

The product rule extends to any number of variables. The chain rule factorizes a joint distribution into a product of conditionals:

P(A1A2An)=P(A1)P(A2A1)P(A3A1,A2)P(AnA1,,An1)P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\, P(A_2 \mid A_1)\, P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})

For continuous random variables x1,,xnx_1, \ldots, x_n:

p(x1,x2,,xn)=p(x1)p(x2x1)p(x3x1,x2)p(xnx1,,xn1)p(x_1, x_2, \ldots, x_n) = p(x_1)\, p(x_2 \mid x_1)\, p(x_3 \mid x_1, x_2) \cdots p(x_n \mid x_1, \ldots, x_{n-1})

Why this matters for ML. Autoregressive language models use exactly this factorization. The probability of a sequence of tokens (w1,w2,,wT)(w_1, w_2, \ldots, w_T) is:

p(w1,,wT)=t=1Tp(wtw1,,wt1)p(w_1, \ldots, w_T) = \prod_{t=1}^T p(w_t \mid w_1, \ldots, w_{t-1})

Each next-token prediction is a conditional probability — the chain rule says these multiply to give the joint probability of the whole sequence.


Bayes' Theorem

Bayes' theorem follows directly from the product rule. Since P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) P(B) = P(B \mid A) P(A), dividing both sides by P(B)P(B):

P(AB)=P(BA)P(A)P(B)\boxed{P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B)}}

In the language of inference, we interpret this as:

P(HE)posterior=P(EH)likelihoodP(H)priorP(E)evidence\underbrace{P(H \mid E)}_{\text{posterior}} = \frac{\underbrace{P(E \mid H)}_{\text{likelihood}} \cdot \underbrace{P(H)}_{\text{prior}}}{\underbrace{P(E)}_{\text{evidence}}}

Term Symbol Meaning
Prior P(H)P(H) Our belief about hypothesis HH before seeing evidence
Likelihood P(EH)P(E \mid H) How probable is the evidence under hypothesis HH?
Posterior P(HE)P(H \mid E) Updated belief after observing evidence EE
Evidence P(E)P(E) Probability of the evidence under all hypotheses — normalizes the posterior

Computing the Evidence

The denominator P(E)P(E) is found by summing (or integrating) over all hypotheses using the law of total probability:

P(E)=HP(EH)P(H)P(E) = \sum_H P(E \mid H)\, P(H)

This ensures the posterior sums to 1.


A Concrete Example

Disease testing. A disease has prevalence 1% in the population. A test is 95% sensitive (correctly detects 95% of cases) and 90% specific (correctly clears 90% of healthy people).

You test positive. What is the probability you have the disease?

  • P(D)=0.01P(D) = 0.01 — prior (prevalence)
  • P(+D)=0.95P(+\mid D) = 0.95 — likelihood (sensitivity)
  • P(+¬D)=0.10P(+\mid \neg D) = 0.10 — false positive rate
  • P(+)=P(+D)P(D)+P(+¬D)P(¬D)=0.95×0.01+0.10×0.99=0.1085P(+) = P(+\mid D)P(D) + P(+\mid \neg D)P(\neg D) = 0.95 \times 0.01 + 0.10 \times 0.99 = 0.1085

P(D+)=0.95×0.010.10850.087P(D \mid +) = \frac{0.95 \times 0.01}{0.1085} \approx 0.087

Only ~8.7% probability of disease despite a positive test — because the disease is rare (low prior). This counterintuitive result is a direct consequence of the prior dominating when the base rate is low.


Sequential Updating

Bayes' theorem is naturally sequential: today's posterior becomes tomorrow's prior.

Suppose you observe two independent pieces of evidence E1E_1 and E2E_2:

P(HE1,E2)P(E1H)P(E2H)P(H)P(H \mid E_1, E_2) \propto P(E_1 \mid H)\, P(E_2 \mid H)\, P(H)

(assuming E1E2HE_1 \perp E_2 \mid H — conditionally independent given HH). You can apply Bayes' theorem sequentially:

  1. Use E1E_1 to get P(HE1)P(E1H)P(H)P(H \mid E_1) \propto P(E_1 \mid H) P(H)
  2. Use this as the new prior and apply E2E_2: P(HE1,E2)P(E2H)P(HE1)P(H \mid E_1, E_2) \propto P(E_2 \mid H) P(H \mid E_1)

The order of evidence does not matter — the final posterior is the same.


Bayes' Theorem in ML

Maximum a posteriori (MAP) estimation. Choosing model parameters θ\theta to maximize P(θD)P(Dθ)P(θ)P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta). The prior P(θ)P(\theta) acts as regularization — a Gaussian prior corresponds to L2 weight decay.

Bayesian neural networks. Instead of point estimates, maintain a posterior distribution over weights. Inference averages predictions over all weights weighted by the posterior.

Naive Bayes classifiers. Classify by computing P(yx)P(xy)P(y)P(y \mid x) \propto P(x \mid y) P(y) for each class label yy, assuming feature independence given yy.

Probabilistic graphical models. Define complex joint distributions using the chain rule, with conditional independence assumptions encoded as a graph.