Deep Reinforcement Learning · RL Foundations

Before You Start: Prerequisites & Learning Path

8 min read

By the end of this reading you will be able to:

Identify which background knowledge areas are assumed by this course and locate the platform supplements that cover each
Recognize the canonical implementation sequence for deep RL algorithms (VPG → DQN → A2C → PPO → DDPG) and how this course maps onto it

Before diving into RL theory and code, it helps to know exactly what background is assumed and where to fill any gaps. This reading maps the prerequisites identified in Joshua Achiam's Spinning Up as a Deep RL Researcher essay onto the platform's own supplements — so you can check your footing before starting Module 1.

What This Course Assumes

The table below lists every background area the essay calls out, with the corresponding platform supplement where each is covered.

Mathematics

Topic	What you need	Where to review
Probability & statistics	Random variables, expected values, std dev, Bayes' theorem, chain rule of probability, importance sampling	Probability Foundations prereq
Linear algebra & calculus	Vectors, matrices, gradients	Matrix Algebra Foundations prereq · Calculus Foundations prereq
Taylor series	Understanding second-order approximations (used in TRPO)	Optional — not required for the main track

Deep Learning

Topic	What you need	Where to review
Neural network architectures	MLP, vanilla RNN, LSTM, GRU, convolutional layers, ResNets, attention mechanisms, transformer	Neural Network Architectures supplement
Regularization	Weight decay (L2), dropout	Regularization supplement
Normalization	Batch normalization, layer normalization, weight normalization	Normalization supplement
Optimizers	SGD, momentum SGD, Adam	Optimizers supplement
Reparameterization trick	How to backpropagate through a stochastic sample	Covered in this course — r8 (SAC)

Framework

You should be able to write a simple supervised learning loop in PyTorch or TensorFlow before starting. If you can load data, build an nn.Module, compute a loss, call .backward(), and step an optimizer — you're ready.

The Implementation Sequence

The essay recommends implementing these five algorithms from scratch, in order, starting with the simplest:

Order	Algorithm	Why this order
1	VPG / REINFORCE	Simplest policy gradient — bare log-prob × return, ~80–150 lines
2	DQN	Introduces Q-learning, replay buffer, and target networks for discrete actions
3	A2C	Synchronous actor-critic; adds a value function baseline to VPG
4	PPO	Extends VPG with a clipped surrogate objective for stable multi-step updates
5	DDPG	Extends Q-learning to continuous action spaces with a deterministic policy

This course covers VPG (r4), PPO (r6), DDPG (r7), TD3, and SAC. DQN and A2C are outside the Spinning Up scope and not covered here — the exercises in lab1 are the right place to attempt those if you want to follow the full sequence.

Write single-threaded code first. Only parallelize after you have a working, correct implementation. Broken RL code almost always fails silently — the agent just never learns — so the ability to read your own code critically and know exactly what it should be doing is more important than any hyperparameter setting.

Debug Environments

When testing any implementation, start here before trying anything more complex:

# Discrete (policy gradients, DQN)
env = gym.make('CartPole-v1')        # solves in <100 epochs with VPG
env = gym.make('FrozenLake-v1')      # sparse reward, good for Q-learning

# Continuous (actor-critic, DDPG, PPO)
env = gym.make('InvertedPendulum-v2')  # fast convergence, good smoke test
env = gym.make('HalfCheetah-v2')       # use max_ep_len=150 initially

Target turnaround: < 5 minutes per debug experiment on your local CPU. If an experiment takes 30 minutes and the agent doesn't learn, you've lost 30 minutes to a single bug. Keep it fast until you're confident the implementation is correct, then scale up.

What to Measure

The essay's debugging checklist — instrument everything from day one:

AverageEpRet — mean cumulative reward per episode (the headline metric)
EpLen — mean episode length (tells you if the agent is surviving longer)
VVals — value function estimates (should track actual returns over time)
LossPi / LossQ — policy and Q-function losses
Entropy — policy entropy for stochastic methods (should not collapse to zero early)
KL — KL divergence between old and new policy (for TRPO/PPO)

Spinning Up's progress.txt logs all of these automatically. Read it, don't just watch the terminal.

References

undefined — Spinning Up as a Deep RL Researcher (Achiam, 2018)

Overview Next →

Before You Start: Prerequisites & Learning Path

What This Course Assumes

Mathematics

Deep Learning

Framework

The Implementation Sequence

Debug Environments

What to Measure

Privacy Policy

What we collect

What we don't collect

Your choices

Contact