Derivatives: Rate of Change and the Limit
- Explain what a derivative is as the limit of a difference quotient and interpret it geometrically as the slope of a tangent line
- Compute the derivative of a power function using the power rule and state the derivative of a constant
- Interpret a partial derivative in an ML context as the rate at which a loss function changes when one parameter is adjusted
The Central Question of Calculus
When something changes continuously — a car accelerating, a neural network's loss falling with each training step — it changes at some rate at each instant. The average rate of change over an interval is easy: divide the change in output by the change in input. But what is the rate of change at a single moment?
This question, well-posed but deceptively subtle, is what calculus was invented to answer. The answer is the derivative.
Average Rate of Change: The Secant Line
Given a function , the average rate of change between and is:
Geometrically this is the slope of the secant line — the straight line connecting and on the graph.
Example. For , the average rate of change from to (so ):
But the slope of is clearly different at versus . The value 5 is only a coarse average across the interval, not the instantaneous rate at either endpoint.
The Limit: From Secant to Tangent
As shrinks toward zero, the secant line pivots toward a limiting position: the tangent line at . The slope of that tangent line is the instantaneous rate of change.
The derivative of at , written , is:
means: the value this expression converges to as gets arbitrarily close to zero (without reaching it). We do not plug in directly — that gives . We simplify first, then let vanish.
Worked example:
Once simplified, letting gives .
- At : slope . The parabola is rising steeply.
- At : slope . The parabola is flat at its vertex.
- At : slope . The parabola is falling.
The derivative is itself a function — it gives the slope at every point.
Notation
Several notations for the derivative appear in the literature:
| Notation | Read as |
|---|---|
| "f prime of x" | |
| "dy by dx" (when ) | |
| "d by dx of f" |
The Leibniz notation is especially useful in ML because it makes explicit which variable we are differentiating with respect to — essential when a model has thousands of parameters. It also makes the chain rule (r3) read like canceling fractions.
The Power Rule
Carrying out the difference-quotient calculation for in general (using the binomial theorem to expand ) yields a clean pattern:
This holds for any real exponent — positive integers, fractions, negatives.
| Function | Derivative |
|---|---|
| (any constant) |
A constant function is horizontal — zero slope everywhere.
Differentiability
The limit exists only when the ratio converges to a definite value from both sides. This requires:
- is continuous at (no jumps or holes)
- No sharp corner or cusp at
The absolute value is continuous but not differentiable at : the left-hand slope is and the right-hand slope is , so the limit does not exist. Every differentiable function is continuous, but not vice versa.
In practice, the activation functions and loss functions used in ML are differentiable almost everywhere. ReLU is the notable exception — its derivative is undefined at exactly , which frameworks handle by convention.
Why This Matters for ML
Training a neural network is an optimization problem: minimize a loss function over parameters . Gradient descent updates each parameter by:
The symbol denotes a partial derivative: the derivative of with respect to while all other parameters are held fixed. Partial derivatives follow exactly the same rules as ordinary derivatives — the notation simply acknowledges that there are multiple variables.
The term asks: if I nudge up by a tiny amount, how much does the loss increase? If the answer is positive, decreasing decreases the loss — which is exactly what the minus sign in the update achieves.
Without derivatives, training is impossible. With them, every update is a principled step toward lower loss.
PyTorch and TensorFlow
Both frameworks compute derivatives automatically via automatic differentiation (autograd). They record each operation during the forward pass and apply the derivative rules from this reading (and the next three) to compute exact derivatives — no numerical approximations.
import torch
x = torch.tensor(3.0, requires_grad=True)
f = x ** 2 # f(x) = x²
f.backward() # accumulate gradients
print(x.grad) # tensor(6.) ← f'(3) = 2·3 = 6 ✓
import tensorflow as tf
x = tf.Variable(3.0)
with tf.GradientTape() as tape:
f = x ** 2
df_dx = tape.gradient(f, x)
print(df_dx.numpy()) # 6.0
The requires_grad=True flag (PyTorch) and GradientTape context (TensorFlow) tell the framework to track operations so that derivatives can be computed afterward. This is the machinery that makes backpropagation possible at scale.