The RL Algorithm Landscape
- Distinguish model-free from model-based RL and explain the key upside (planning) and downside (model bias) of model-based approaches
- Compare policy optimization and Q-learning along the dimensions of stability, sample efficiency, and directness of optimization
- Classify VPG, TRPO, PPO, DDPG, TD3, SAC, DQN, and A3C into the correct branches of the RL algorithm taxonomy
The Big Picture
Modern RL encompasses dozens of algorithms that differ in what they learn, how they learn it, and what assumptions they make about the environment. Before diving into individual algorithms, it pays to understand the major branching points in algorithm design.
Model-Free vs. Model-Based
The first major branching point: does the agent learn (or use) a model of the environment? A model is a function that predicts state transitions and rewards: given , predict and .
Model-Based RL
Upside: planning. With a model, the agent can think ahead — simulate possible futures and choose actions that lead to better outcomes. AlphaZero does this with Monte Carlo Tree Search over a learned model and achieves superhuman performance in Go and Chess.
Downside: model bias. A learned model is never perfect. The agent may overfit to the model's errors, learning to exploit inaccuracies in ways that don't transfer to the real environment. Model learning is hard, and even after significant compute investment, the bias can cause catastrophic failures at deployment.
Model-Free RL
Model-free methods forego the model entirely and learn directly from environment interactions. This sacrifices some potential for planning but:
- Is far simpler to implement and tune.
- Avoids model bias.
- Has been more extensively validated on benchmarks.
All algorithms in this course (VPG, TRPO, PPO, DDPG, TD3, SAC) are model-free.
Policy Optimization vs. Q-Learning
Within model-free RL, the second major split is what to learn.
Policy Optimization
Represent and directly optimize a policy . The agent explicitly stores and updates the policy network. Updates are performed on-policy: each gradient step uses data collected with the current policy.
Examples: A2C/A3C, VPG, TRPO, PPO.
Strengths: Principled — you optimize directly for what you care about. Stable and reliable when implemented correctly.
Weakness: Sample inefficient — on-policy data goes stale after one gradient update, so every new batch of experience requires fresh rollouts.
Q-Learning
Learn an approximation of the optimal action-value function . The policy is recovered implicitly: . Updates are performed off-policy: any previously collected transition is usable, because the Bellman equation doesn't care which policy collected the data.
Examples: DQN, C51, HER.
Strengths: Much more sample efficient — data can be stored in a replay buffer and reused many times.
Weakness: Less stable. Q-learning solves a self-consistency equation (the Bellman equation), not a direct objective. There are many failure modes, especially with function approximation.
Interpolating Between the Two
Policy optimization and Q-learning are not binary opposites — some algorithms live between them:
DDPG: Learns both a deterministic policy and a Q-function, each used to improve the other. Off-policy (uses a replay buffer), but also directly updates a policy network.
SAC: Like DDPG but with a stochastic policy and entropy regularization, giving better exploration and stability.
These hybrid approaches try to get the sample efficiency of Q-learning while maintaining some of the stability of policy optimization.
The Taxonomy at a Glance
| Family | Algorithm | On/Off-Policy | Action Space |
|---|---|---|---|
| Policy Optimization | VPG | On | Discrete or Continuous |
| Policy Optimization | TRPO | On | Discrete or Continuous |
| Policy Optimization | PPO | On | Discrete or Continuous |
| Q-Learning | DQN | Off | Discrete |
| Actor-Critic (hybrid) | DDPG | Off | Continuous only |
| Actor-Critic (hybrid) | TD3 | Off | Continuous only |
| Actor-Critic (hybrid) | SAC | Off | Continuous only |
Model-Based Variants (Brief Overview)
Model-based methods are an active research area. Notable approaches:
- Pure planning (MPC): Replan every step using a learned model, execute only the first action. Used in MBMF.
- Expert iteration: Use planning (MCTS) to generate better actions, then distill into the policy. Used in ExIt and AlphaZero.
- Data augmentation: Use a model to generate synthetic rollouts that supplement real experience. Used in MBVE and World Models.
These approaches can dramatically improve sample efficiency when the model is accurate, but require careful handling of model uncertainty.