Prerequisite · Calculus Foundations

Optimization and Linear Approximation

16 min read
0:00
Audio overview generated with
By the end of this reading you will be able to:
  • Identify critical points by setting the derivative to zero and classify them using the second derivative test
  • Apply the first-order (linear) approximation f(x₀ + δ) ≈ f(x₀) + f'(x₀)·δ to estimate function values near a point
  • Explain gradient descent as iterative application of the linear approximation, and justify why the negative gradient direction decreases the function

The Goal: Finding Minima

Optimization is the process of finding input values that minimize (or maximize) a function. In ML, we want to find parameters θ\theta that minimize a loss function L(θ)\mathcal{L}(\theta). Calculus tells us exactly where to look.


Critical Points

At a minimum of a smooth function, the tangent line is horizontal — the slope is zero. So:

f(x)=0x is a critical pointf'(x) = 0 \quad \Rightarrow \quad x \text{ is a critical point}

Critical points are candidates for minima, but also for maxima and saddle points. Setting the derivative to zero tells us where to look; we need more information to classify.

Example. f(x)=x33xf(x) = x^3 - 3x f(x)=3x23=0x2=1x=±1f'(x) = 3x^2 - 3 = 0 \quad \Rightarrow \quad x^2 = 1 \quad \Rightarrow \quad x = \pm 1

Two critical points. Are they minima, maxima, or neither?


The Second Derivative Test

The second derivative f(x)f''(x) measures how the slope is changing. At a critical point where f(x0)=0f'(x_0) = 0:

f(x0)f''(x_0) Shape of graph Type
>0> 0 Curves upward (concave up) Local minimum
<0< 0 Curves downward (concave down) Local maximum
=0= 0 Test inconclusive Need further analysis

Intuition: If f(x0)>0f''(x_0) > 0, the slope is increasing through zero — it was negative just before x0x_0 (function falling) and positive just after (function rising). So x0x_0 is a valley bottom.

Example continued. f(x)=6xf''(x) = 6x

  • At x=1x = 1: f=6>0f'' = 6 > 0local minimum
  • At x=1x = -1: f=6<0f'' = -6 < 0local maximum

Local vs Global

A local minimum is the lowest point in some neighborhood — but there may be lower points elsewhere. A global minimum is the lowest point over the entire domain.

For training neural networks:

  • The loss landscape is high-dimensional and non-convex — many local minima and saddle points exist
  • Reaching the global minimum is generally not required or expected
  • In practice, local minima found by gradient descent tend to generalize well, because flat minima (small f|f''|) often correspond to simpler solutions

A function where every critical point is a global minimum is convex — its graph always curves upward (f0f'' \geq 0 everywhere). Logistic regression loss is convex; most deep network losses are not.


Linear Approximation

For a differentiable function ff, near any point x0x_0, the tangent line provides a close approximation:

f(x0+δ)f(x0)+f(x0)δf(x_0 + \delta) \approx f(x_0) + f'(x_0) \cdot \delta

This is the first-order Taylor approximation — valid when δ\delta is small. The approximation says: start at f(x0)f(x_0), then move by slope ×\times step.

Example. Estimate 4.1\sqrt{4.1}.

Let f(x)=xf(x) = \sqrt{x}, x0=4x_0 = 4, δ=0.1\delta = 0.1. f(x)=12xf(4)=14f'(x) = \frac{1}{2\sqrt{x}} \quad \Rightarrow \quad f'(4) = \frac{1}{4} 4.14+14(0.1)=2+0.025=2.025\sqrt{4.1} \approx \sqrt{4} + \frac{1}{4}(0.1) = 2 + 0.025 = 2.025

Actual value: 4.12.0248\sqrt{4.1} \approx 2.0248. The error is about 0.00020.0002 — excellent for a one-term approximation.


Gradient Descent

The linear approximation is the mathematical foundation of gradient descent. Suppose we want to decrease ff starting from x0x_0. The approximation says:

f(x0+δ)f(x0)+f(x0)δf(x_0 + \delta) \approx f(x_0) + f'(x_0) \cdot \delta

To decrease ff, we want this change to be negative: f(x0)δ<0f'(x_0) \cdot \delta < 0

The simplest choice: set δ=ηf(x0)\delta = -\eta \cdot f'(x_0) for small η>0\eta > 0 (the learning rate). Then: f(x0+δ)f(x0)η[f(x0)]2f(x0)f(x_0 + \delta) \approx f(x_0) - \eta \cdot [f'(x_0)]^2 \leq f(x_0)

The update xxηf(x)x \leftarrow x - \eta \cdot f'(x) is guaranteed to decrease ff locally — as long as η\eta is small enough that the linear approximation remains valid.

This is gradient descent — the core algorithm for training neural networks. In multiple dimensions, f(x)f'(x) becomes θL\nabla_\theta \mathcal{L} (the gradient vector), and the update becomes θθηθL\theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}.


Choosing the Learning Rate

The learning rate η\eta controls the step size:

  • Too large: The linear approximation breaks down. We might overshoot and land at a point with higher loss.
  • Too small: Progress is glacially slow. Many steps needed to converge.
  • Just right: Consistent decrease per step, converging to a minimum in reasonable time.

This is why learning rate scheduling (warmup, decay) and adaptive methods (Adam, RMSProp) matter: they adjust η\eta based on observed gradient behavior, compensating for the fact that the linear approximation has varying accuracy across the loss surface.


PyTorch and TensorFlow

import torch

# Manual gradient descent on f(x) = x² - 4x + 5  (minimum at x=2)
x = torch.tensor(0.0, requires_grad=True)
lr = 0.1

for step in range(20):
    f = x**2 - 4*x + 5
    f.backward()
    with torch.no_grad():
        x -= lr * x.grad   # gradient descent step
    x.grad.zero_()
    if step % 5 == 0:
        print(f"step {step}: x={x.item():.4f}, f={f.item():.4f}")
# Converges to x ≈ 2.0,  f ≈ 1.0
import tensorflow as tf

# Same function, TF-style
x = tf.Variable(0.0)
lr = 0.1

for step in range(20):
    with tf.GradientTape() as tape:
        f = x**2 - 4*x + 5
    grad = tape.gradient(f, x)
    x.assign_sub(lr * grad)  # x = x - lr * grad
    if step % 5 == 0:
        print(f"step {step}: x={x.numpy():.4f}, f={f.numpy():.4f}")

These loops implement exactly the theory: compute the derivative at the current point, step in the negative gradient direction, repeat. The zero_grad() call in PyTorch is necessary because autograd accumulates gradients — without clearing them, each step's gradient would pile onto the previous.