Prerequisite · Calculus Foundations

Integration and the Fundamental Theorem

15 min read
0:00
Audio overview generated with
By the end of this reading you will be able to:
  • Explain what a definite integral computes geometrically and how it is defined as a limit of Riemann sums
  • Apply the Fundamental Theorem of Calculus to evaluate definite integrals using antiderivatives
  • Connect integration to ML: interpret probability density functions as distributions whose integral equals 1, and express expected value and entropy as integrals

The Accumulation Problem

Derivatives answer the question of rates of change. Integration answers the complementary question: given a rate of change, how much has accumulated over an interval?

If you know a car's speed at every moment, integration tells you the total distance traveled. If you know the density of a probability distribution at every point, integration tells you the total probability in any region. Both are the same mathematical operation.


Area Under a Curve

The definite integral of ff from aa to bb is written:

abf(x)dx\int_a^b f(x)\, dx

Geometrically, it equals the signed area between the curve f(x)f(x) and the xx-axis over [a,b][a, b]. Regions above the axis contribute positive area; regions below contribute negative area.


Riemann Sums

The precise definition: divide [a,b][a, b] into nn subintervals of width Δx=(ba)/n\Delta x = (b-a)/n. In each subinterval, evaluate ff at some sample point xix_i^* and form a rectangle of height f(xi)f(x_i^*) and width Δx\Delta x.

The sum of all rectangle areas approximates the integral:

abf(x)dx=limni=1nf(xi)Δx\int_a^b f(x)\,dx = \lim_{n \to \infty} \sum_{i=1}^n f(x_i^*)\,\Delta x

As nn \to \infty (rectangles get infinitely thin), the approximation becomes exact. This limit always exists for continuous functions.

The notation dxdx in the integral is the limiting version of Δx\Delta x — an infinitesimally thin strip width. It signals which variable we are integrating over.


Antiderivatives

An antiderivative of f(x)f(x) is any function F(x)F(x) such that F(x)=f(x)F'(x) = f(x). We can reverse the power rule:

xndx=xn+1n+1+C(n1)\int x^n\,dx = \frac{x^{n+1}}{n+1} + C \qquad (n \neq -1)

The constant CC — the constant of integration — appears because differentiation destroys constant terms: any F+CF + C has the same derivative ff.

Key antiderivatives:

f(x)f(x) F(x)=f(x)dxF(x) = \int f(x)\,dx
xnx^n (n1n \neq -1) xn+1n+1+C\frac{x^{n+1}}{n+1} + C
exe^x ex+Ce^x + C
1x\frac{1}{x} lnx+C\ln|x| + C
cosx\cos x sinx+C\sin x + C
sinx\sin x cosx+C-\cos x + C

The Fundamental Theorem of Calculus

The most important result in calculus connects differentiation and integration:

abf(x)dx=F(b)F(a)\int_a^b f(x)\,dx = F(b) - F(a)

where FF is any antiderivative of ff (F=fF' = f). To compute a definite integral, you do not need to set up and take a limit of Riemann sums — you find an antiderivative, evaluate it at the two endpoints, and subtract.

Example. 132xdx\int_1^3 2x\,dx

F(x)=x2F(x) = x^2 is an antiderivative of 2x2x since (x2)=2x(x^2)' = 2x. 132xdx=F(3)F(1)=91=8\int_1^3 2x\,dx = F(3) - F(1) = 9 - 1 = 8

Geometrically: the area under 2x2x from 1 to 3 is the trapezoid with vertices at (1,2)(1,2), (3,6)(3,6), (3,0)(3,0), (1,0)(1,0) — area =12(2+6)(2)=8= \frac{1}{2}(2+6)(2) = 8. ✓

Example. 01exdx=e1e0=e11.718\int_0^1 e^x\,dx = e^1 - e^0 = e - 1 \approx 1.718


Why This Matters for ML

Probability density functions. The probability foundation module (r1) stated that a PDF f(x)f(x) must satisfy f(x)dx=1\int_{-\infty}^{\infty} f(x)\,dx = 1. This normalization condition is an integral — and verifying it for distributions like the Gaussian requires the fundamental theorem applied to the antiderivative of ex2e^{-x^2}.

For any PDF, the probability of XX falling in [a,b][a,b] is: P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x)\,dx

Expected value. The expected value of a continuous random variable is an integral: E[X]=xf(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx

This is the continuous analogue of the discrete weighted average — infinitely many values, each weighted by their probability density.

Entropy. The entropy of a continuous distribution: H=f(x)lnf(x)dxH = -\int_{-\infty}^{\infty} f(x) \ln f(x)\,dx

KL divergence between distributions pp and qq: DKL(pq)=p(x)lnp(x)q(x)dxD_{\text{KL}}(p \| q) = \int_{-\infty}^{\infty} p(x) \ln \frac{p(x)}{q(x)}\,dx

All the probabilistic quantities that appear in the probability foundations module are, at their core, integrals. The notation \sum in the discrete case and \int in the continuous case are the same concept — summing a quantity weighted by probability — just over different kinds of domains.


The Relationship Between Derivatives and Integrals

The Fundamental Theorem has a second part that makes explicit the inverse relationship:

ddxaxf(t)dt=f(x)\frac{d}{dx}\int_a^x f(t)\,dt = f(x)

Differentiating the accumulated area function recovers the original function. Differentiation and integration undo each other — they are inverse operations, like multiplication and division.

This inverse relationship is why:

  • Gradient descent (differentiation) navigates the loss surface
  • Probability densities (integration) describe uncertainty

are both indispensable tools in ML despite being different operations.


PyTorch and TensorFlow

Numerical integration appears in ML for computing normalizing constants, estimating expectations via Monte Carlo, and evaluating metrics.

import torch
import numpy as np

# Numerical integration: P(X in [-1, 1]) for N(0,1)
# using the trapezoidal rule
x = torch.linspace(-1, 1, 1000)
f = torch.exp(-x**2 / 2) / (2 * torch.pi)**0.5  # N(0,1) PDF
prob = torch.trapezoid(f, x)
print(prob.item())  # ≈ 0.6827  (the 68% rule)

# Monte Carlo expected value: E[X²] for X ~ U(0,1)
samples = torch.rand(100_000)
print((samples**2).mean().item())  # ≈ 1/3  (exact: ∫₀¹ x² dx = 1/3)
import tensorflow as tf
import numpy as np

# Numerical integration using numpy (TF tensors use same approach)
x = np.linspace(-1, 1, 1000)
f = np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)
prob = np.trapz(f, x)
print(prob)  # ≈ 0.6827

# Monte Carlo estimate of E[X²] for X ~ U(0,1)
samples = tf.random.uniform((100_000,))
print(tf.reduce_mean(samples**2).numpy())  # ≈ 0.333

Monte Carlo integration — estimating integrals by averaging function values at random points — is one of the most powerful and widely-used techniques in probabilistic ML. It converts the integral Ep[f(x)]\mathbb{E}_p[f(x)] into a sample average, making continuous expectations tractable even in high dimensions.