Optimization and Linear Approximation
- Identify critical points by setting the derivative to zero and classify them using the second derivative test
- Apply the first-order (linear) approximation f(x₀ + δ) ≈ f(x₀) + f'(x₀)·δ to estimate function values near a point
- Explain gradient descent as iterative application of the linear approximation, and justify why the negative gradient direction decreases the function
The Goal: Finding Minima
Optimization is the process of finding input values that minimize (or maximize) a function. In ML, we want to find parameters that minimize a loss function . Calculus tells us exactly where to look.
Critical Points
At a minimum of a smooth function, the tangent line is horizontal — the slope is zero. So:
Critical points are candidates for minima, but also for maxima and saddle points. Setting the derivative to zero tells us where to look; we need more information to classify.
Example.
Two critical points. Are they minima, maxima, or neither?
The Second Derivative Test
The second derivative measures how the slope is changing. At a critical point where :
| Shape of graph | Type | |
|---|---|---|
| Curves upward (concave up) | Local minimum | |
| Curves downward (concave down) | Local maximum | |
| Test inconclusive | Need further analysis |
Intuition: If , the slope is increasing through zero — it was negative just before (function falling) and positive just after (function rising). So is a valley bottom.
Example continued.
- At : → local minimum
- At : → local maximum
Local vs Global
A local minimum is the lowest point in some neighborhood — but there may be lower points elsewhere. A global minimum is the lowest point over the entire domain.
For training neural networks:
- The loss landscape is high-dimensional and non-convex — many local minima and saddle points exist
- Reaching the global minimum is generally not required or expected
- In practice, local minima found by gradient descent tend to generalize well, because flat minima (small ) often correspond to simpler solutions
A function where every critical point is a global minimum is convex — its graph always curves upward ( everywhere). Logistic regression loss is convex; most deep network losses are not.
Linear Approximation
For a differentiable function , near any point , the tangent line provides a close approximation:
This is the first-order Taylor approximation — valid when is small. The approximation says: start at , then move by slope step.
Example. Estimate .
Let , , .
Actual value: . The error is about — excellent for a one-term approximation.
Gradient Descent
The linear approximation is the mathematical foundation of gradient descent. Suppose we want to decrease starting from . The approximation says:
To decrease , we want this change to be negative:
The simplest choice: set for small (the learning rate). Then:
The update is guaranteed to decrease locally — as long as is small enough that the linear approximation remains valid.
This is gradient descent — the core algorithm for training neural networks. In multiple dimensions, becomes (the gradient vector), and the update becomes .
Choosing the Learning Rate
The learning rate controls the step size:
- Too large: The linear approximation breaks down. We might overshoot and land at a point with higher loss.
- Too small: Progress is glacially slow. Many steps needed to converge.
- Just right: Consistent decrease per step, converging to a minimum in reasonable time.
This is why learning rate scheduling (warmup, decay) and adaptive methods (Adam, RMSProp) matter: they adjust based on observed gradient behavior, compensating for the fact that the linear approximation has varying accuracy across the loss surface.
PyTorch and TensorFlow
import torch
# Manual gradient descent on f(x) = x² - 4x + 5 (minimum at x=2)
x = torch.tensor(0.0, requires_grad=True)
lr = 0.1
for step in range(20):
f = x**2 - 4*x + 5
f.backward()
with torch.no_grad():
x -= lr * x.grad # gradient descent step
x.grad.zero_()
if step % 5 == 0:
print(f"step {step}: x={x.item():.4f}, f={f.item():.4f}")
# Converges to x ≈ 2.0, f ≈ 1.0
import tensorflow as tf
# Same function, TF-style
x = tf.Variable(0.0)
lr = 0.1
for step in range(20):
with tf.GradientTape() as tape:
f = x**2 - 4*x + 5
grad = tape.gradient(f, x)
x.assign_sub(lr * grad) # x = x - lr * grad
if step % 5 == 0:
print(f"step {step}: x={x.numpy():.4f}, f={f.numpy():.4f}")
These loops implement exactly the theory: compute the derivative at the current point, step in the negative gradient direction, repeat. The zero_grad() call in PyTorch is necessary because autograd accumulates gradients — without clearing them, each step's gradient would pile onto the previous.