Prerequisite · Calculus Foundations

Optimization and Linear Approximation

16 min read

Audio overview generated with

By the end of this reading you will be able to:

Identify critical points by setting the derivative to zero and classify them using the second derivative test
Apply the first-order (linear) approximation f(x₀ + δ) ≈ f(x₀) + f'(x₀)·δ to estimate function values near a point
Explain gradient descent as iterative application of the linear approximation, and justify why the negative gradient direction decreases the function

The Goal: Finding Minima

Optimization is the process of finding input values that minimize (or maximize) a function. In ML, we want to find parameters $\theta$ that minimize a loss function $\mathcal{L}(\theta)$ . Calculus tells us exactly where to look.

Critical Points

At a minimum of a smooth function, the tangent line is horizontal — the slope is zero. So:

$f'(x) = 0 \quad \Rightarrow \quad x \text{ is a critical point}$

Critical points are candidates for minima, but also for maxima and saddle points. Setting the derivative to zero tells us where to look; we need more information to classify.

Example. $f(x) = x^3 - 3x$ $f'(x) = 3x^2 - 3 = 0 \quad \Rightarrow \quad x^2 = 1 \quad \Rightarrow \quad x = \pm 1$

Two critical points. Are they minima, maxima, or neither?

The Second Derivative Test

The second derivative $f''(x)$ measures how the slope is changing. At a critical point where $f'(x_0) = 0$ :

$f''(x_0)$	Shape of graph	Type
$> 0$	Curves upward (concave up)	Local minimum
$< 0$	Curves downward (concave down)	Local maximum
$= 0$	Test inconclusive	Need further analysis

Intuition: If $f''(x_0) > 0$ , the slope is increasing through zero — it was negative just before $x_0$ (function falling) and positive just after (function rising). So $x_0$ is a valley bottom.

Example continued. $f''(x) = 6x$

At $x = 1$ : $f'' = 6 > 0$ → local minimum
At $x = -1$ : $f'' = -6 < 0$ → local maximum

Local vs Global

A local minimum is the lowest point in some neighborhood — but there may be lower points elsewhere. A global minimum is the lowest point over the entire domain.

For training neural networks:

The loss landscape is high-dimensional and non-convex — many local minima and saddle points exist
Reaching the global minimum is generally not required or expected
In practice, local minima found by gradient descent tend to generalize well, because flat minima (small $|f''|$ ) often correspond to simpler solutions

A function where every critical point is a global minimum is convex — its graph always curves upward ( $f'' \geq 0$ everywhere). Logistic regression loss is convex; most deep network losses are not.

Linear Approximation

For a differentiable function $f$ , near any point $x_0$ , the tangent line provides a close approximation:

$f(x_0 + \delta) \approx f(x_0) + f'(x_0) \cdot \delta$

This is the first-order Taylor approximation — valid when $\delta$ is small. The approximation says: start at $f(x_0)$ , then move by slope $\times$ step.

Example. Estimate $\sqrt{4.1}$ .

Let $f(x) = \sqrt{x}$ , $x_0 = 4$ , $\delta = 0.1$ . $f'(x) = \frac{1}{2\sqrt{x}} \quad \Rightarrow \quad f'(4) = \frac{1}{4}$ $\sqrt{4.1} \approx \sqrt{4} + \frac{1}{4}(0.1) = 2 + 0.025 = 2.025$

Actual value: $\sqrt{4.1} \approx 2.0248$ . The error is about $0.0002$ — excellent for a one-term approximation.

Gradient Descent

The linear approximation is the mathematical foundation of gradient descent. Suppose we want to decrease $f$ starting from $x_0$ . The approximation says:

$f(x_0 + \delta) \approx f(x_0) + f'(x_0) \cdot \delta$

To decrease $f$ , we want this change to be negative: $f'(x_0) \cdot \delta < 0$

The simplest choice: set $\delta = -\eta \cdot f'(x_0)$ for small $\eta > 0$ (the learning rate). Then: $f(x_0 + \delta) \approx f(x_0) - \eta \cdot [f'(x_0)]^2 \leq f(x_0)$

The update $x \leftarrow x - \eta \cdot f'(x)$ is guaranteed to decrease $f$ locally — as long as $\eta$ is small enough that the linear approximation remains valid.

This is gradient descent — the core algorithm for training neural networks. In multiple dimensions, $f'(x)$ becomes $\nabla_\theta \mathcal{L}$ (the gradient vector), and the update becomes $\theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}$ .

Choosing the Learning Rate

The learning rate $\eta$ controls the step size:

Too large: The linear approximation breaks down. We might overshoot and land at a point with higher loss.
Too small: Progress is glacially slow. Many steps needed to converge.
Just right: Consistent decrease per step, converging to a minimum in reasonable time.

This is why learning rate scheduling (warmup, decay) and adaptive methods (Adam, RMSProp) matter: they adjust $\eta$ based on observed gradient behavior, compensating for the fact that the linear approximation has varying accuracy across the loss surface.

PyTorch and TensorFlow

import torch

# Manual gradient descent on f(x) = x² - 4x + 5  (minimum at x=2)
x = torch.tensor(0.0, requires_grad=True)
lr = 0.1

for step in range(20):
    f = x**2 - 4*x + 5
    f.backward()
    with torch.no_grad():
        x -= lr * x.grad   # gradient descent step
    x.grad.zero_()
    if step % 5 == 0:
        print(f"step {step}: x={x.item():.4f}, f={f.item():.4f}")
# Converges to x ≈ 2.0,  f ≈ 1.0

import tensorflow as tf

# Same function, TF-style
x = tf.Variable(0.0)
lr = 0.1

for step in range(20):
    with tf.GradientTape() as tape:
        f = x**2 - 4*x + 5
    grad = tape.gradient(f, x)
    x.assign_sub(lr * grad)  # x = x - lr * grad
    if step % 5 == 0:
        print(f"step {step}: x={x.numpy():.4f}, f={f.numpy():.4f}")

These loops implement exactly the theory: compute the derivative at the current point, step in the negative gradient direction, repeat. The zero_grad() call in PyTorch is necessary because autograd accumulates gradients — without clearing them, each step's gradient would pile onto the previous.

References

MIT 18.01SC — Sessions 23–24, 28–30 — Linear Approximation and Applications of Differentiation

Previous Next →

Optimization and Linear Approximation

The Goal: Finding Minima

Critical Points

The Second Derivative Test

Local vs Global

Linear Approximation

Gradient Descent

Choosing the Learning Rate

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact