Prerequisite · Calculus Foundations

The Chain Rule

15 min read

Audio overview generated with

By the end of this reading you will be able to:

Apply the chain rule to differentiate a composite function f(g(x)), identifying the outer and inner functions
Extend the chain rule to chains of three or more composed functions
Connect the chain rule to backpropagation: explain how gradients flow backward through a neural network as a repeated application of the chain rule

Composed Functions

Most functions we care about are not simple polynomials — they are compositions: one function applied inside another. For example:

$h(x) = (x^2 + 1)^{10}$ — raise a polynomial to a power
$h(x) = \sin(3x^2)$ — apply sine to a polynomial
$h(x) = \sqrt{x^2 + 1}$ — take the square root of something

In each case, $h(x) = f(g(x))$ for some outer function $f$ and inner function $g$ . The rules from r2 do not handle this — we need the chain rule.

The Chain Rule

If $h(x) = f(g(x))$ , then:

$h'(x) = f'(g(x)) \cdot g'(x)$

In words: differentiate the outside, leave the inside alone, then multiply by the derivative of the inside.

In Leibniz notation, let $u = g(x)$ and $y = f(u)$ . Then:

$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$

This looks like the $du$ 's cancel — they do not literally cancel (they are not fractions), but the intuition is sound and the notation makes the rule easy to remember.

Worked Examples

Example 1: $h(x) = (x^2 + 1)^{10}$

Outer function: $f(u) = u^{10}$ , so $f'(u) = 10u^9$ Inner function: $g(x) = x^2 + 1$ , so $g'(x) = 2x$

$h'(x) = 10(x^2+1)^9 \cdot 2x = 20x(x^2+1)^9$

Example 2: $h(x) = \sin(3x^2)$

Outer: $f(u) = \sin u$ , so $f'(u) = \cos u$ Inner: $g(x) = 3x^2$ , so $g'(x) = 6x$

$h'(x) = \cos(3x^2) \cdot 6x = 6x\cos(3x^2)$

Example 3: $h(x) = \sqrt{x^2 + 1} = (x^2+1)^{1/2}$

Outer: $f(u) = u^{1/2}$ , so $f'(u) = \frac{1}{2}u^{-1/2}$ Inner: $g(x) = x^2+1$ , so $g'(x) = 2x$

$h'(x) = \frac{1}{2}(x^2+1)^{-1/2} \cdot 2x = \frac{x}{\sqrt{x^2+1}}$

Chains of Three or More

The chain rule extends to any number of composed functions. For $h(x) = f(g(k(x)))$ :

$h'(x) = f'(g(k(x))) \cdot g'(k(x)) \cdot k'(x)$

In Leibniz notation, with $v = k(x)$ , $u = g(v)$ , $y = f(u)$ :

$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx}$

Each link in the chain contributes a multiplicative factor. The gradient of the output with respect to the input is the product of all the local derivatives along the path.

The Chain Rule and Backpropagation

A feedforward neural network is literally a chain of composed functions. For a simple two-layer network:

$\hat{y} = f_2\!\left(W_2\, f_1\!\left(W_1 x + b_1\right) + b_2\right)$

where $f_1$ and $f_2$ are activation functions. This is $h(x) = f_2(g(f_1(k(x))))$ — a composition.

To train the network, we need $\partial \mathcal{L} / \partial W_1$ — how does the loss at the output depend on the weights deep in the network? By the chain rule:

$\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}$

Backpropagation is exactly this: traverse the computation graph from output to input, accumulating the product of local derivatives at each step. The chain rule is not merely related to backprop — it is backprop.

A Concrete ML Example: Differentiating Through a Neuron

Consider a single neuron with weight $w$ , input $x$ , bias $b$ , and sigmoid activation $\sigma(z) = 1/(1 + e^{-z})$ :

$a = \sigma(wx + b)$

The loss is the squared error $\mathcal{L} = (a - y)^2$ for target $y$ . We want $\partial \mathcal{L}/\partial w$ .

By the chain rule: $\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$

where $z = wx + b$ .

$\partial \mathcal{L}/\partial a = 2(a - y)$ (power rule on the loss)
$\partial a/\partial z = \sigma(z)(1 - \sigma(z))$ (sigmoid derivative — derived in r4)
$\partial z/\partial w = x$ (linear in $w$ )

Multiplying: $\partial \mathcal{L}/\partial w = 2(a - y) \cdot \sigma(z)(1-\sigma(z)) \cdot x$

Each factor has a natural interpretation: how wrong the prediction is, how sensitive the activation is, and how strongly the input influenced the pre-activation. This three-way product structure appears throughout deep learning.

PyTorch and TensorFlow

Autograd builds the computation graph during the forward pass and applies the chain rule backward when .backward() is called. You can inspect individual gradients:

import torch

w = torch.tensor(0.5, requires_grad=True)
x_val = torch.tensor(2.0)
b = torch.tensor(0.0)
y = torch.tensor(1.0)

z = w * x_val + b
a = torch.sigmoid(z)
loss = (a - y) ** 2

loss.backward()
print(w.grad.item())
# PyTorch computed dL/dw = 2(a-y) · σ(z)(1-σ(z)) · x automatically

import tensorflow as tf

w = tf.Variable(0.5)
x_val = tf.constant(2.0)
y = tf.constant(1.0)

with tf.GradientTape() as tape:
    z = w * x_val
    a = tf.sigmoid(z)
    loss = (a - y) ** 2

print(tape.gradient(loss, w).numpy())  # same result via chain rule

The framework never needs to know the analytic form of $\partial \mathcal{L}/\partial w$ . It computes it by walking backward through the recorded operations and applying the chain rule at each step.

References

MIT 18.01SC — Session 11 — The Chain Rule

Previous Next →

The Chain Rule

Composed Functions

The Chain Rule

Worked Examples

Chains of Three or More

The Chain Rule and Backpropagation

A Concrete ML Example: Differentiating Through a Neuron

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact