Prerequisite · Calculus Foundations

Gradient Descent from Scratch

Colab Notebook · ~50 min
Google Colab Notebook
Gradient Descent from Scratch
Python · ~50 min
Open in Colab
Lab Objectives
1
Implement a gradient descent loop from scratch and plot the convergence trajectory for f(x) = x² − 4x + 5.
2
Run three learning rate experiments (too small, optimal, too large) and describe the qualitative behavior of each.
3
Extend gradient descent to 2D and visualize the optimization path on a contour plot of f(x,y) = x² + y².
4
Derive and implement the MSE gradients ∂L/∂w and ∂L/∂b analytically, then use them to fit a line to noisy data.
5
Confirm that your hand-rolled gradient descent matches torch.optim.SGD step-for-step on the linear regression problem.

Lab 2: Gradient Descent from Scratch

Gradient descent is the algorithm that makes neural network training possible. In this lab you will implement it without any framework machinery — just the update rule xxηf(x)x \leftarrow x - \eta \cdot f'(x) applied repeatedly — and build the intuition for why learning rate is the most important hyperparameter.

What You'll Build

  • A 1D gradient descent loop on f(x)=x24x+5f(x) = x^2 - 4x + 5, logging xx and f(x)f(x) at each step and plotting the convergence trajectory overlaid on the function
  • A learning rate comparison: three runs with η\eta too small, just right, and too large — with annotated plots showing slow convergence, clean convergence, and divergence/oscillation
  • A 2D gradient descent on the quadratic bowl f(x,y)=x2+y2f(x,y) = x^2 + y^2, with contour plots and gradient arrows showing the path to the origin
  • A linear regression via GD: fit a line y=wx+by = wx + b to noisy data by minimizing MSE, implementing the parameter updates wwηL/ww \leftarrow w - \eta \cdot \partial\mathcal{L}/\partial w and bbηL/bb \leftarrow b - \eta \cdot \partial\mathcal{L}/\partial b by hand
  • A loss curve comparison between the hand-rolled update and torch.optim.SGD confirming they converge identically

Key Concepts Practiced

By the end you will have built an intuition for the loss surface, understand why the negative gradient direction decreases the function (from the linear approximation argument), and see directly how the learning rate determines whether training converges, diverges, or oscillates.