Prerequisite · Calculus Foundations

Derivatives: Rate of Change and the Limit

15 min read

Audio overview generated with

By the end of this reading you will be able to:

Explain what a derivative is as the limit of a difference quotient and interpret it geometrically as the slope of a tangent line
Compute the derivative of a power function using the power rule and state the derivative of a constant
Interpret a partial derivative in an ML context as the rate at which a loss function changes when one parameter is adjusted

The Central Question of Calculus

When something changes continuously — a car accelerating, a neural network's loss falling with each training step — it changes at some rate at each instant. The average rate of change over an interval is easy: divide the change in output by the change in input. But what is the rate of change at a single moment?

This question, well-posed but deceptively subtle, is what calculus was invented to answer. The answer is the derivative.

Average Rate of Change: The Secant Line

Given a function $f(x)$ , the average rate of change between $x$ and $x+h$ is:

$\frac{f(x+h) - f(x)}{h}$

Geometrically this is the slope of the secant line — the straight line connecting $(x,\, f(x))$ and $(x+h,\, f(x+h))$ on the graph.

Example. For $f(x) = x^2$ , the average rate of change from $x = 2$ to $x = 3$ (so $h = 1$ ): $\frac{3^2 - 2^2}{1} = \frac{9 - 4}{1} = 5$

But the slope of $x^2$ is clearly different at $x = 2$ versus $x = 3$ . The value 5 is only a coarse average across the interval, not the instantaneous rate at either endpoint.

The Limit: From Secant to Tangent

As $h$ shrinks toward zero, the secant line pivots toward a limiting position: the tangent line at $x$ . The slope of that tangent line is the instantaneous rate of change.

The derivative of $f$ at $x$ , written $f'(x)$ , is:

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

$\lim_{h \to 0}$ means: the value this expression converges to as $h$ gets arbitrarily close to zero (without reaching it). We do not plug in $h = 0$ directly — that gives $0/0$ . We simplify first, then let $h$ vanish.

Worked example: $f(x) = x^2$

$\frac{(x+h)^2 - x^2}{h} = \frac{x^2 + 2xh + h^2 - x^2}{h} = \frac{2xh + h^2}{h} = 2x + h$

Once simplified, letting $h \to 0$ gives $f'(x) = 2x$ .

At $x = 3$ : slope $= 6$ . The parabola is rising steeply.
At $x = 0$ : slope $= 0$ . The parabola is flat at its vertex.
At $x = -2$ : slope $= -4$ . The parabola is falling.

The derivative $f'(x) = 2x$ is itself a function — it gives the slope at every point.

Notation

Several notations for the derivative appear in the literature:

Notation	Read as
$f'(x)$	"f prime of x"
$\frac{dy}{dx}$	"dy by dx" (when $y = f(x)$ )
$\frac{d}{dx}[f(x)]$	"d by dx of f"

The Leibniz notation $dy/dx$ is especially useful in ML because it makes explicit which variable we are differentiating with respect to — essential when a model has thousands of parameters. It also makes the chain rule (r3) read like canceling fractions.

The Power Rule

Carrying out the difference-quotient calculation for $x^n$ in general (using the binomial theorem to expand $(x+h)^n$ ) yields a clean pattern:

$\frac{d}{dx}\left[x^n\right] = n x^{n-1}$

This holds for any real exponent — positive integers, fractions, negatives.

Function	Derivative
$x^4$	$4x^3$
$\sqrt{x} = x^{1/2}$	$\frac{1}{2}x^{-1/2} = \frac{1}{2\sqrt{x}}$
$x^{-1} = 1/x$	$-x^{-2} = -1/x^2$
$c$ (any constant)	$0$

A constant function is horizontal — zero slope everywhere.

Differentiability

The limit $f'(x)$ exists only when the ratio converges to a definite value from both sides. This requires:

$f$ is continuous at $x$ (no jumps or holes)
No sharp corner or cusp at $x$

The absolute value $|x|$ is continuous but not differentiable at $x = 0$ : the left-hand slope is $-1$ and the right-hand slope is $+1$ , so the limit does not exist. Every differentiable function is continuous, but not vice versa.

In practice, the activation functions and loss functions used in ML are differentiable almost everywhere. ReLU is the notable exception — its derivative is undefined at exactly $x = 0$ , which frameworks handle by convention.

Why This Matters for ML

Training a neural network is an optimization problem: minimize a loss function $\mathcal{L}(\theta)$ over parameters $\theta$ . Gradient descent updates each parameter by:

$w \leftarrow w - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}$

The symbol $\partial$ denotes a partial derivative: the derivative of $\mathcal{L}$ with respect to $w$ while all other parameters are held fixed. Partial derivatives follow exactly the same rules as ordinary derivatives — the notation simply acknowledges that there are multiple variables.

The term $\partial \mathcal{L}/\partial w$ asks: if I nudge $w$ up by a tiny amount, how much does the loss increase? If the answer is positive, decreasing $w$ decreases the loss — which is exactly what the minus sign in the update achieves.

Without derivatives, training is impossible. With them, every update is a principled step toward lower loss.

PyTorch and TensorFlow

Both frameworks compute derivatives automatically via automatic differentiation (autograd). They record each operation during the forward pass and apply the derivative rules from this reading (and the next three) to compute exact derivatives — no numerical approximations.

import torch

x = torch.tensor(3.0, requires_grad=True)
f = x ** 2          # f(x) = x²

f.backward()        # accumulate gradients
print(x.grad)       # tensor(6.)  ← f'(3) = 2·3 = 6  ✓

import tensorflow as tf

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    f = x ** 2

df_dx = tape.gradient(f, x)
print(df_dx.numpy())   # 6.0

The requires_grad=True flag (PyTorch) and GradientTape context (TensorFlow) tell the framework to track operations so that derivatives can be computed afterward. This is the machinery that makes backpropagation possible at scale.

References

MIT 18.01SC — Sessions 1–3 — Derivatives, Slope, Velocity, Rate of Change

Overview Next →

Derivatives: Rate of Change and the Limit

The Central Question of Calculus

Average Rate of Change: The Secant Line

The Limit: From Secant to Tangent

Notation

The Power Rule

Differentiability

Why This Matters for ML

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact