Prerequisite · Calculus Foundations

Derivatives: Rate of Change and the Limit

15 min read
0:00
Audio overview generated with
By the end of this reading you will be able to:
  • Explain what a derivative is as the limit of a difference quotient and interpret it geometrically as the slope of a tangent line
  • Compute the derivative of a power function using the power rule and state the derivative of a constant
  • Interpret a partial derivative in an ML context as the rate at which a loss function changes when one parameter is adjusted

The Central Question of Calculus

When something changes continuously — a car accelerating, a neural network's loss falling with each training step — it changes at some rate at each instant. The average rate of change over an interval is easy: divide the change in output by the change in input. But what is the rate of change at a single moment?

This question, well-posed but deceptively subtle, is what calculus was invented to answer. The answer is the derivative.


Average Rate of Change: The Secant Line

Given a function f(x)f(x), the average rate of change between xx and x+hx+h is:

f(x+h)f(x)h\frac{f(x+h) - f(x)}{h}

Geometrically this is the slope of the secant line — the straight line connecting (x,f(x))(x,\, f(x)) and (x+h,f(x+h))(x+h,\, f(x+h)) on the graph.

Example. For f(x)=x2f(x) = x^2, the average rate of change from x=2x = 2 to x=3x = 3 (so h=1h = 1): 32221=941=5\frac{3^2 - 2^2}{1} = \frac{9 - 4}{1} = 5

But the slope of x2x^2 is clearly different at x=2x = 2 versus x=3x = 3. The value 5 is only a coarse average across the interval, not the instantaneous rate at either endpoint.


The Limit: From Secant to Tangent

As hh shrinks toward zero, the secant line pivots toward a limiting position: the tangent line at xx. The slope of that tangent line is the instantaneous rate of change.

The derivative of ff at xx, written f(x)f'(x), is:

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

limh0\lim_{h \to 0} means: the value this expression converges to as hh gets arbitrarily close to zero (without reaching it). We do not plug in h=0h = 0 directly — that gives 0/00/0. We simplify first, then let hh vanish.

Worked example: f(x)=x2f(x) = x^2

(x+h)2x2h=x2+2xh+h2x2h=2xh+h2h=2x+h\frac{(x+h)^2 - x^2}{h} = \frac{x^2 + 2xh + h^2 - x^2}{h} = \frac{2xh + h^2}{h} = 2x + h

Once simplified, letting h0h \to 0 gives f(x)=2xf'(x) = 2x.

  • At x=3x = 3: slope =6= 6. The parabola is rising steeply.
  • At x=0x = 0: slope =0= 0. The parabola is flat at its vertex.
  • At x=2x = -2: slope =4= -4. The parabola is falling.

The derivative f(x)=2xf'(x) = 2x is itself a function — it gives the slope at every point.


Notation

Several notations for the derivative appear in the literature:

Notation Read as
f(x)f'(x) "f prime of x"
dydx\frac{dy}{dx} "dy by dx" (when y=f(x)y = f(x))
ddx[f(x)]\frac{d}{dx}[f(x)] "d by dx of f"

The Leibniz notation dy/dxdy/dx is especially useful in ML because it makes explicit which variable we are differentiating with respect to — essential when a model has thousands of parameters. It also makes the chain rule (r3) read like canceling fractions.


The Power Rule

Carrying out the difference-quotient calculation for xnx^n in general (using the binomial theorem to expand (x+h)n(x+h)^n) yields a clean pattern:

ddx[xn]=nxn1\frac{d}{dx}\left[x^n\right] = n x^{n-1}

This holds for any real exponent — positive integers, fractions, negatives.

Function Derivative
x4x^4 4x34x^3
x=x1/2\sqrt{x} = x^{1/2} 12x1/2=12x\frac{1}{2}x^{-1/2} = \frac{1}{2\sqrt{x}}
x1=1/xx^{-1} = 1/x x2=1/x2-x^{-2} = -1/x^2
cc (any constant) 00

A constant function is horizontal — zero slope everywhere.


Differentiability

The limit f(x)f'(x) exists only when the ratio converges to a definite value from both sides. This requires:

  • ff is continuous at xx (no jumps or holes)
  • No sharp corner or cusp at xx

The absolute value x|x| is continuous but not differentiable at x=0x = 0: the left-hand slope is 1-1 and the right-hand slope is +1+1, so the limit does not exist. Every differentiable function is continuous, but not vice versa.

In practice, the activation functions and loss functions used in ML are differentiable almost everywhere. ReLU is the notable exception — its derivative is undefined at exactly x=0x = 0, which frameworks handle by convention.


Why This Matters for ML

Training a neural network is an optimization problem: minimize a loss function L(θ)\mathcal{L}(\theta) over parameters θ\theta. Gradient descent updates each parameter by:

wwηLww \leftarrow w - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}

The symbol \partial denotes a partial derivative: the derivative of L\mathcal{L} with respect to ww while all other parameters are held fixed. Partial derivatives follow exactly the same rules as ordinary derivatives — the notation simply acknowledges that there are multiple variables.

The term L/w\partial \mathcal{L}/\partial w asks: if I nudge ww up by a tiny amount, how much does the loss increase? If the answer is positive, decreasing ww decreases the loss — which is exactly what the minus sign in the update achieves.

Without derivatives, training is impossible. With them, every update is a principled step toward lower loss.


PyTorch and TensorFlow

Both frameworks compute derivatives automatically via automatic differentiation (autograd). They record each operation during the forward pass and apply the derivative rules from this reading (and the next three) to compute exact derivatives — no numerical approximations.

import torch

x = torch.tensor(3.0, requires_grad=True)
f = x ** 2          # f(x) = x²

f.backward()        # accumulate gradients
print(x.grad)       # tensor(6.)  ← f'(3) = 2·3 = 6  ✓
import tensorflow as tf

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    f = x ** 2

df_dx = tape.gradient(f, x)
print(df_dx.numpy())   # 6.0

The requires_grad=True flag (PyTorch) and GradientTape context (TensorFlow) tell the framework to track operations so that derivatives can be computed afterward. This is the machinery that makes backpropagation possible at scale.