Prerequisite · Calculus Foundations

Differentiation Rules

14 min read
0:00
Audio overview generated with
By the end of this reading you will be able to:
  • Apply the constant, sum, product, and quotient rules to differentiate combinations of functions
  • Recall the derivatives of sin x and cos x and apply them with the product and quotient rules
  • Compute second derivatives and interpret their sign in terms of concavity

From Single Terms to Combinations

The power rule handles xnx^n in isolation. Real functions — loss functions, activations, weight penalty terms — are built by combining simpler pieces: sums, products, quotients. This reading gives the rules for differentiating those combinations.


Constant and Sum Rules

Constant multiple: Scaling a function scales its derivative: ddx[cf(x)]=cf(x)\frac{d}{dx}[c \cdot f(x)] = c \cdot f'(x)

Sum/difference: The derivative distributes over addition: ddx[f(x)±g(x)]=f(x)±g(x)\frac{d}{dx}[f(x) \pm g(x)] = f'(x) \pm g'(x)

Together, these let us differentiate any polynomial term by term:

Example: f(x)=5x32x+7f(x) = 5x^3 - 2x + 7 f(x)=53x221+0=15x22f'(x) = 5 \cdot 3x^2 - 2 \cdot 1 + 0 = 15x^2 - 2

Each term is differentiated independently. The constant 77 vanishes because a flat term contributes zero slope.


The Product Rule

Differentiating a product f(x)g(x)f(x) \cdot g(x) is not simply f(x)g(x)f'(x) \cdot g'(x). The correct rule:

ddx[f(x)g(x)]=f(x)g(x)+f(x)g(x)\frac{d}{dx}[f(x)\,g(x)] = f'(x)\,g(x) + f(x)\,g'(x)

Intuition. Think of ff and gg as the side lengths of a rectangle with area A=fgA = f \cdot g. If both sides grow by small increments dfdf and dgdg, the new area gains two strips — fdgf\,dg and gdfg\,df — plus a tiny corner dfdgdf \cdot dg that is negligible as the increments shrink. So dA/dxfg+gfdA/dx \approx f\,g' + g\,f'.

Example: h(x)=x2sinxh(x) = x^2 \sin x h(x)=2xsinx+x2cosxh'(x) = 2x \cdot \sin x + x^2 \cdot \cos x

Neither factor alone is enough — both must contribute.


The Quotient Rule

For h(x)=f(x)/g(x)h(x) = f(x)/g(x):

h(x)=f(x)g(x)f(x)g(x)[g(x)]2h'(x) = \frac{f'(x)\,g(x) - f(x)\,g'(x)}{[g(x)]^2}

A useful mnemonic: "low d-high minus high d-low, over low squared" — where "high" is the numerator ff and "low" is the denominator gg.

Example: h(x)=x2x+1h(x) = \frac{x^2}{x + 1}

h(x)=2x(x+1)x21(x+1)2=x2+2x(x+1)2h'(x) = \frac{2x(x+1) - x^2 \cdot 1}{(x+1)^2} = \frac{x^2 + 2x}{(x+1)^2}

Note the minus sign: the numerator factor involving ff is subtracted, not added. Getting this sign wrong is a common error.


Derivatives of Trigonometric Functions

Since you know trigonometry, these belong in your toolkit:

ddx[sinx]=cosxddx[cosx]=sinx\frac{d}{dx}[\sin x] = \cos x \qquad \frac{d}{dx}[\cos x] = -\sin x

ddx[tanx]=sec2x\frac{d}{dx}[\tan x] = \sec^2 x

These follow from the limit definition using the angle-addition identities for sin(x+h)\sin(x+h) and cos(x+h)\cos(x+h). The results are what we need in practice.

Pattern: Differentiating repeatedly cycles sin\sin and cos\cos with alternating signs: sinxd/dxcosxd/dxsinxd/dxcosxd/dxsinx\sin x \xrightarrow{d/dx} \cos x \xrightarrow{d/dx} -\sin x \xrightarrow{d/dx} -\cos x \xrightarrow{d/dx} \sin x


Higher Derivatives

Because f(x)f'(x) is itself a function, we can differentiate it again:

  • f(x)f''(x) — the second derivative: how rapidly the slope is changing
  • f(n)(x)f^{(n)}(x) — the nnth derivative, applying differentiation nn times

Alternate notation: d2y/dx2d^2y/dx^2 for the second derivative.

Geometric meaning of ff'': If f(x)>0f''(x) > 0, the slope is increasing — the graph curves upward (concave up). If f(x)<0f''(x) < 0, the slope is decreasing — the graph curves downward (concave down). This will be central to detecting minima in r5.

Example: For f(x)=x33xf(x) = x^3 - 3x: f(x)=3x23f(x)=6xf'(x) = 3x^2 - 3 \qquad f''(x) = 6x

At x=1x = 1: f=6>0f'' = 6 > 0 (concave up). At x=1x = -1: f=6<0f'' = -6 < 0 (concave down).


Why This Matters for ML

L2 weight penalty. The regularization term λ2w2\frac{\lambda}{2}w^2 added to the loss has gradient: ddw ⁣[λ2w2]=λw\frac{d}{dw}\!\left[\frac{\lambda}{2}w^2\right] = \lambda w

This is the "weight decay" effect: the regularization gradient pulls every weight toward zero with force proportional to its magnitude. The 12\frac{1}{2} in the penalty is purely for notational convenience — it cancels the 2 from the power rule.

Mean squared error. For a prediction y^\hat{y} against target yy, the loss (y^y)2(\hat{y} - y)^2 has gradient: ddy^[(y^y)2]=2(y^y)\frac{d}{d\hat{y}}[(\hat{y} - y)^2] = 2(\hat{y} - y)

The derivative is proportional to the error. Large errors produce large gradients; small errors produce small updates. This is why MSE naturally self-scales with the size of the mistake.


PyTorch and TensorFlow

Autograd applies these rules internally. You can verify any of them:

import torch

# Product rule: d/dx[x² sin(x)] at x = 1
x = torch.tensor(1.0, requires_grad=True)
h = x**2 * torch.sin(x)
h.backward()
print(x.grad.item())
# = 2·sin(1) + 1²·cos(1) ≈ 2(0.841) + 0.540 ≈ 2.222

# L2 regularization gradient
w = torch.tensor(0.5, requires_grad=True)
lam = 0.01
loss = (lam / 2) * w**2
loss.backward()
print(w.grad.item())  # lam * w = 0.01 * 0.5 = 0.005
import tensorflow as tf

x = tf.Variable(1.0)
with tf.GradientTape() as tape:
    h = x**2 * tf.sin(x)
print(tape.gradient(h, x).numpy())  # ≈ 2.222

w = tf.Variable(0.5)
lam = 0.01
with tf.GradientTape() as tape:
    loss = (lam / 2) * w**2
print(tape.gradient(loss, w).numpy())  # 0.005