The Chain Rule
- Apply the chain rule to differentiate a composite function f(g(x)), identifying the outer and inner functions
- Extend the chain rule to chains of three or more composed functions
- Connect the chain rule to backpropagation: explain how gradients flow backward through a neural network as a repeated application of the chain rule
Composed Functions
Most functions we care about are not simple polynomials — they are compositions: one function applied inside another. For example:
- — raise a polynomial to a power
- — apply sine to a polynomial
- — take the square root of something
In each case, for some outer function and inner function . The rules from r2 do not handle this — we need the chain rule.
The Chain Rule
If , then:
In words: differentiate the outside, leave the inside alone, then multiply by the derivative of the inside.
In Leibniz notation, let and . Then:
This looks like the 's cancel — they do not literally cancel (they are not fractions), but the intuition is sound and the notation makes the rule easy to remember.
Worked Examples
Example 1:
Outer function: , so Inner function: , so
Example 2:
Outer: , so Inner: , so
Example 3:
Outer: , so Inner: , so
Chains of Three or More
The chain rule extends to any number of composed functions. For :
In Leibniz notation, with , , :
Each link in the chain contributes a multiplicative factor. The gradient of the output with respect to the input is the product of all the local derivatives along the path.
The Chain Rule and Backpropagation
A feedforward neural network is literally a chain of composed functions. For a simple two-layer network:
where and are activation functions. This is — a composition.
To train the network, we need — how does the loss at the output depend on the weights deep in the network? By the chain rule:
Backpropagation is exactly this: traverse the computation graph from output to input, accumulating the product of local derivatives at each step. The chain rule is not merely related to backprop — it is backprop.
A Concrete ML Example: Differentiating Through a Neuron
Consider a single neuron with weight , input , bias , and sigmoid activation :
The loss is the squared error for target . We want .
By the chain rule:
where .
- (power rule on the loss)
- (sigmoid derivative — derived in r4)
- (linear in )
Multiplying:
Each factor has a natural interpretation: how wrong the prediction is, how sensitive the activation is, and how strongly the input influenced the pre-activation. This three-way product structure appears throughout deep learning.
PyTorch and TensorFlow
Autograd builds the computation graph during the forward pass and applies the chain rule backward when .backward() is called. You can inspect individual gradients:
import torch
w = torch.tensor(0.5, requires_grad=True)
x_val = torch.tensor(2.0)
b = torch.tensor(0.0)
y = torch.tensor(1.0)
z = w * x_val + b
a = torch.sigmoid(z)
loss = (a - y) ** 2
loss.backward()
print(w.grad.item())
# PyTorch computed dL/dw = 2(a-y) · σ(z)(1-σ(z)) · x automatically
import tensorflow as tf
w = tf.Variable(0.5)
x_val = tf.constant(2.0)
y = tf.constant(1.0)
with tf.GradientTape() as tape:
z = w * x_val
a = tf.sigmoid(z)
loss = (a - y) ** 2
print(tape.gradient(loss, w).numpy()) # same result via chain rule
The framework never needs to know the analytic form of . It computes it by walking backward through the recorded operations and applying the chain rule at each step.