Before You Start: Prerequisites & Learning Path
- Identify which background knowledge areas are assumed by this course and locate the platform supplements that cover each
- Recognize the canonical implementation sequence for deep RL algorithms (VPG → DQN → A2C → PPO → DDPG) and how this course maps onto it
Before diving into RL theory and code, it helps to know exactly what background is assumed and where to fill any gaps. This reading maps the prerequisites identified in Joshua Achiam's Spinning Up as a Deep RL Researcher essay onto the platform's own supplements — so you can check your footing before starting Module 1.
What This Course Assumes
The table below lists every background area the essay calls out, with the corresponding platform supplement where each is covered.
Mathematics
| Topic | What you need | Where to review |
|---|---|---|
| Probability & statistics | Random variables, expected values, std dev, Bayes' theorem, chain rule of probability, importance sampling | Probability Foundations prereq |
| Linear algebra & calculus | Vectors, matrices, gradients | Matrix Algebra Foundations prereq · Calculus Foundations prereq |
| Taylor series | Understanding second-order approximations (used in TRPO) | Optional — not required for the main track |
Deep Learning
| Topic | What you need | Where to review |
|---|---|---|
| Neural network architectures | MLP, vanilla RNN, LSTM, GRU, convolutional layers, ResNets, attention mechanisms, transformer | Neural Network Architectures supplement |
| Regularization | Weight decay (L2), dropout | Regularization supplement |
| Normalization | Batch normalization, layer normalization, weight normalization | Normalization supplement |
| Optimizers | SGD, momentum SGD, Adam | Optimizers supplement |
| Reparameterization trick | How to backpropagate through a stochastic sample | Covered in this course — r8 (SAC) |
Framework
You should be able to write a simple supervised learning loop in PyTorch or TensorFlow before starting. If you can load data, build an nn.Module, compute a loss, call .backward(), and step an optimizer — you're ready.
The Implementation Sequence
The essay recommends implementing these five algorithms from scratch, in order, starting with the simplest:
| Order | Algorithm | Why this order |
|---|---|---|
| 1 | VPG / REINFORCE | Simplest policy gradient — bare log-prob × return, ~80–150 lines |
| 2 | DQN | Introduces Q-learning, replay buffer, and target networks for discrete actions |
| 3 | A2C | Synchronous actor-critic; adds a value function baseline to VPG |
| 4 | PPO | Extends VPG with a clipped surrogate objective for stable multi-step updates |
| 5 | DDPG | Extends Q-learning to continuous action spaces with a deterministic policy |
This course covers VPG (r4), PPO (r6), DDPG (r7), TD3, and SAC. DQN and A2C are outside the Spinning Up scope and not covered here — the exercises in lab1 are the right place to attempt those if you want to follow the full sequence.
Write single-threaded code first. Only parallelize after you have a working, correct implementation. Broken RL code almost always fails silently — the agent just never learns — so the ability to read your own code critically and know exactly what it should be doing is more important than any hyperparameter setting.
Debug Environments
When testing any implementation, start here before trying anything more complex:
# Discrete (policy gradients, DQN)
env = gym.make('CartPole-v1') # solves in <100 epochs with VPG
env = gym.make('FrozenLake-v1') # sparse reward, good for Q-learning
# Continuous (actor-critic, DDPG, PPO)
env = gym.make('InvertedPendulum-v2') # fast convergence, good smoke test
env = gym.make('HalfCheetah-v2') # use max_ep_len=150 initially
Target turnaround: < 5 minutes per debug experiment on your local CPU. If an experiment takes 30 minutes and the agent doesn't learn, you've lost 30 minutes to a single bug. Keep it fast until you're confident the implementation is correct, then scale up.
What to Measure
The essay's debugging checklist — instrument everything from day one:
AverageEpRet— mean cumulative reward per episode (the headline metric)EpLen— mean episode length (tells you if the agent is surviving longer)VVals— value function estimates (should track actual returns over time)LossPi/LossQ— policy and Q-function lossesEntropy— policy entropy for stochastic methods (should not collapse to zero early)KL— KL divergence between old and new policy (for TRPO/PPO)
Spinning Up's progress.txt logs all of these automatically. Read it, don't just watch the terminal.