Deep Reinforcement Learning · RL Foundations

Before You Start: Prerequisites & Learning Path

8 min read
By the end of this reading you will be able to:
  • Identify which background knowledge areas are assumed by this course and locate the platform supplements that cover each
  • Recognize the canonical implementation sequence for deep RL algorithms (VPG → DQN → A2C → PPO → DDPG) and how this course maps onto it

Before diving into RL theory and code, it helps to know exactly what background is assumed and where to fill any gaps. This reading maps the prerequisites identified in Joshua Achiam's Spinning Up as a Deep RL Researcher essay onto the platform's own supplements — so you can check your footing before starting Module 1.


What This Course Assumes

The table below lists every background area the essay calls out, with the corresponding platform supplement where each is covered.

Mathematics

Topic What you need Where to review
Probability & statistics Random variables, expected values, std dev, Bayes' theorem, chain rule of probability, importance sampling Probability Foundations prereq
Linear algebra & calculus Vectors, matrices, gradients Matrix Algebra Foundations prereq · Calculus Foundations prereq
Taylor series Understanding second-order approximations (used in TRPO) Optional — not required for the main track

Deep Learning

Topic What you need Where to review
Neural network architectures MLP, vanilla RNN, LSTM, GRU, convolutional layers, ResNets, attention mechanisms, transformer Neural Network Architectures supplement
Regularization Weight decay (L2), dropout Regularization supplement
Normalization Batch normalization, layer normalization, weight normalization Normalization supplement
Optimizers SGD, momentum SGD, Adam Optimizers supplement
Reparameterization trick How to backpropagate through a stochastic sample Covered in this course — r8 (SAC)

Framework

You should be able to write a simple supervised learning loop in PyTorch or TensorFlow before starting. If you can load data, build an nn.Module, compute a loss, call .backward(), and step an optimizer — you're ready.


The Implementation Sequence

The essay recommends implementing these five algorithms from scratch, in order, starting with the simplest:

Order Algorithm Why this order
1 VPG / REINFORCE Simplest policy gradient — bare log-prob × return, ~80–150 lines
2 DQN Introduces Q-learning, replay buffer, and target networks for discrete actions
3 A2C Synchronous actor-critic; adds a value function baseline to VPG
4 PPO Extends VPG with a clipped surrogate objective for stable multi-step updates
5 DDPG Extends Q-learning to continuous action spaces with a deterministic policy

This course covers VPG (r4), PPO (r6), DDPG (r7), TD3, and SAC. DQN and A2C are outside the Spinning Up scope and not covered here — the exercises in lab1 are the right place to attempt those if you want to follow the full sequence.

Write single-threaded code first. Only parallelize after you have a working, correct implementation. Broken RL code almost always fails silently — the agent just never learns — so the ability to read your own code critically and know exactly what it should be doing is more important than any hyperparameter setting.


Debug Environments

When testing any implementation, start here before trying anything more complex:

# Discrete (policy gradients, DQN)
env = gym.make('CartPole-v1')        # solves in <100 epochs with VPG
env = gym.make('FrozenLake-v1')      # sparse reward, good for Q-learning

# Continuous (actor-critic, DDPG, PPO)
env = gym.make('InvertedPendulum-v2')  # fast convergence, good smoke test
env = gym.make('HalfCheetah-v2')       # use max_ep_len=150 initially

Target turnaround: < 5 minutes per debug experiment on your local CPU. If an experiment takes 30 minutes and the agent doesn't learn, you've lost 30 minutes to a single bug. Keep it fast until you're confident the implementation is correct, then scale up.


What to Measure

The essay's debugging checklist — instrument everything from day one:

  • AverageEpRet — mean cumulative reward per episode (the headline metric)
  • EpLen — mean episode length (tells you if the agent is surviving longer)
  • VVals — value function estimates (should track actual returns over time)
  • LossPi / LossQ — policy and Q-function losses
  • Entropy — policy entropy for stochastic methods (should not collapse to zero early)
  • KL — KL divergence between old and new policy (for TRPO/PPO)

Spinning Up's progress.txt logs all of these automatically. Read it, don't just watch the terminal.