Prerequisite · Calculus Foundations

Differentiation Rules

14 min read

Audio overview generated with

By the end of this reading you will be able to:

Apply the constant, sum, product, and quotient rules to differentiate combinations of functions
Recall the derivatives of sin x and cos x and apply them with the product and quotient rules
Compute second derivatives and interpret their sign in terms of concavity

From Single Terms to Combinations

The power rule handles $x^n$ in isolation. Real functions — loss functions, activations, weight penalty terms — are built by combining simpler pieces: sums, products, quotients. This reading gives the rules for differentiating those combinations.

Constant and Sum Rules

Constant multiple: Scaling a function scales its derivative: $\frac{d}{dx}[c \cdot f(x)] = c \cdot f'(x)$

Sum/difference: The derivative distributes over addition: $\frac{d}{dx}[f(x) \pm g(x)] = f'(x) \pm g'(x)$

Together, these let us differentiate any polynomial term by term:

Example: $f(x) = 5x^3 - 2x + 7$ $f'(x) = 5 \cdot 3x^2 - 2 \cdot 1 + 0 = 15x^2 - 2$

Each term is differentiated independently. The constant $7$ vanishes because a flat term contributes zero slope.

The Product Rule

Differentiating a product $f(x) \cdot g(x)$ is not simply $f'(x) \cdot g'(x)$ . The correct rule:

$\frac{d}{dx}[f(x)\,g(x)] = f'(x)\,g(x) + f(x)\,g'(x)$

Intuition. Think of $f$ and $g$ as the side lengths of a rectangle with area $A = f \cdot g$ . If both sides grow by small increments $df$ and $dg$ , the new area gains two strips — $f\,dg$ and $g\,df$ — plus a tiny corner $df \cdot dg$ that is negligible as the increments shrink. So $dA/dx \approx f\,g' + g\,f'$ .

Example: $h(x) = x^2 \sin x$ $h'(x) = 2x \cdot \sin x + x^2 \cdot \cos x$

Neither factor alone is enough — both must contribute.

The Quotient Rule

For $h(x) = f(x)/g(x)$ :

$h'(x) = \frac{f'(x)\,g(x) - f(x)\,g'(x)}{[g(x)]^2}$

A useful mnemonic: "low d-high minus high d-low, over low squared" — where "high" is the numerator $f$ and "low" is the denominator $g$ .

Example: $h(x) = \frac{x^2}{x + 1}$

$h'(x) = \frac{2x(x+1) - x^2 \cdot 1}{(x+1)^2} = \frac{x^2 + 2x}{(x+1)^2}$

Note the minus sign: the numerator factor involving $f$ is subtracted, not added. Getting this sign wrong is a common error.

Derivatives of Trigonometric Functions

Since you know trigonometry, these belong in your toolkit:

$\frac{d}{dx}[\sin x] = \cos x \qquad \frac{d}{dx}[\cos x] = -\sin x$

$\frac{d}{dx}[\tan x] = \sec^2 x$

These follow from the limit definition using the angle-addition identities for $\sin(x+h)$ and $\cos(x+h)$ . The results are what we need in practice.

Pattern: Differentiating repeatedly cycles $\sin$ and $\cos$ with alternating signs: $\sin x \xrightarrow{d/dx} \cos x \xrightarrow{d/dx} -\sin x \xrightarrow{d/dx} -\cos x \xrightarrow{d/dx} \sin x$

Higher Derivatives

Because $f'(x)$ is itself a function, we can differentiate it again:

$f''(x)$ — the second derivative: how rapidly the slope is changing
$f^{(n)}(x)$ — the $n$ th derivative, applying differentiation $n$ times

Alternate notation: $d^2y/dx^2$ for the second derivative.

Geometric meaning of $f''$ : If $f''(x) > 0$ , the slope is increasing — the graph curves upward (concave up). If $f''(x) < 0$ , the slope is decreasing — the graph curves downward (concave down). This will be central to detecting minima in r5.

Example: For $f(x) = x^3 - 3x$ : $f'(x) = 3x^2 - 3 \qquad f''(x) = 6x$

At $x = 1$ : $f'' = 6 > 0$ (concave up). At $x = -1$ : $f'' = -6 < 0$ (concave down).

Why This Matters for ML

L2 weight penalty. The regularization term $\frac{\lambda}{2}w^2$ added to the loss has gradient: $\frac{d}{dw}\!\left[\frac{\lambda}{2}w^2\right] = \lambda w$

This is the "weight decay" effect: the regularization gradient pulls every weight toward zero with force proportional to its magnitude. The $\frac{1}{2}$ in the penalty is purely for notational convenience — it cancels the 2 from the power rule.

Mean squared error. For a prediction $\hat{y}$ against target $y$ , the loss $(\hat{y} - y)^2$ has gradient: $\frac{d}{d\hat{y}}[(\hat{y} - y)^2] = 2(\hat{y} - y)$

The derivative is proportional to the error. Large errors produce large gradients; small errors produce small updates. This is why MSE naturally self-scales with the size of the mistake.

PyTorch and TensorFlow

Autograd applies these rules internally. You can verify any of them:

import torch

# Product rule: d/dx[x² sin(x)] at x = 1
x = torch.tensor(1.0, requires_grad=True)
h = x**2 * torch.sin(x)
h.backward()
print(x.grad.item())
# = 2·sin(1) + 1²·cos(1) ≈ 2(0.841) + 0.540 ≈ 2.222

# L2 regularization gradient
w = torch.tensor(0.5, requires_grad=True)
lam = 0.01
loss = (lam / 2) * w**2
loss.backward()
print(w.grad.item())  # lam * w = 0.01 * 0.5 = 0.005

import tensorflow as tf

x = tf.Variable(1.0)
with tf.GradientTape() as tape:
    h = x**2 * tf.sin(x)
print(tape.gradient(h, x).numpy())  # ≈ 2.222

w = tf.Variable(0.5)
lam = 0.01
with tf.GradientTape() as tape:
    loss = (lam / 2) * w**2
print(tape.gradient(loss, w).numpy())  # 0.005

References

MIT 18.01SC — Sessions 6, 9–11 — Calculating Derivatives; Product, Quotient, and Chain Rules

Previous Next →

Differentiation Rules

From Single Terms to Combinations

Constant and Sum Rules

The Product Rule

The Quotient Rule

Derivatives of Trigonometric Functions

Higher Derivatives

Why This Matters for ML

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact