Differentiation Rules
- Apply the constant, sum, product, and quotient rules to differentiate combinations of functions
- Recall the derivatives of sin x and cos x and apply them with the product and quotient rules
- Compute second derivatives and interpret their sign in terms of concavity
From Single Terms to Combinations
The power rule handles in isolation. Real functions — loss functions, activations, weight penalty terms — are built by combining simpler pieces: sums, products, quotients. This reading gives the rules for differentiating those combinations.
Constant and Sum Rules
Constant multiple: Scaling a function scales its derivative:
Sum/difference: The derivative distributes over addition:
Together, these let us differentiate any polynomial term by term:
Example:
Each term is differentiated independently. The constant vanishes because a flat term contributes zero slope.
The Product Rule
Differentiating a product is not simply . The correct rule:
Intuition. Think of and as the side lengths of a rectangle with area . If both sides grow by small increments and , the new area gains two strips — and — plus a tiny corner that is negligible as the increments shrink. So .
Example:
Neither factor alone is enough — both must contribute.
The Quotient Rule
For :
A useful mnemonic: "low d-high minus high d-low, over low squared" — where "high" is the numerator and "low" is the denominator .
Example:
Note the minus sign: the numerator factor involving is subtracted, not added. Getting this sign wrong is a common error.
Derivatives of Trigonometric Functions
Since you know trigonometry, these belong in your toolkit:
These follow from the limit definition using the angle-addition identities for and . The results are what we need in practice.
Pattern: Differentiating repeatedly cycles and with alternating signs:
Higher Derivatives
Because is itself a function, we can differentiate it again:
- — the second derivative: how rapidly the slope is changing
- — the th derivative, applying differentiation times
Alternate notation: for the second derivative.
Geometric meaning of : If , the slope is increasing — the graph curves upward (concave up). If , the slope is decreasing — the graph curves downward (concave down). This will be central to detecting minima in r5.
Example: For :
At : (concave up). At : (concave down).
Why This Matters for ML
L2 weight penalty. The regularization term added to the loss has gradient:
This is the "weight decay" effect: the regularization gradient pulls every weight toward zero with force proportional to its magnitude. The in the penalty is purely for notational convenience — it cancels the 2 from the power rule.
Mean squared error. For a prediction against target , the loss has gradient:
The derivative is proportional to the error. Large errors produce large gradients; small errors produce small updates. This is why MSE naturally self-scales with the size of the mistake.
PyTorch and TensorFlow
Autograd applies these rules internally. You can verify any of them:
import torch
# Product rule: d/dx[x² sin(x)] at x = 1
x = torch.tensor(1.0, requires_grad=True)
h = x**2 * torch.sin(x)
h.backward()
print(x.grad.item())
# = 2·sin(1) + 1²·cos(1) ≈ 2(0.841) + 0.540 ≈ 2.222
# L2 regularization gradient
w = torch.tensor(0.5, requires_grad=True)
lam = 0.01
loss = (lam / 2) * w**2
loss.backward()
print(w.grad.item()) # lam * w = 0.01 * 0.5 = 0.005
import tensorflow as tf
x = tf.Variable(1.0)
with tf.GradientTape() as tape:
h = x**2 * tf.sin(x)
print(tape.gradient(h, x).numpy()) # ≈ 2.222
w = tf.Variable(0.5)
lam = 0.01
with tf.GradientTape() as tape:
loss = (lam / 2) * w**2
print(tape.gradient(loss, w).numpy()) # 0.005