Deep Reinforcement Learning · RL Foundations

The RL Algorithm Landscape

10 min read
By the end of this reading you will be able to:
  • Distinguish model-free from model-based RL and explain the key upside (planning) and downside (model bias) of model-based approaches
  • Compare policy optimization and Q-learning along the dimensions of stability, sample efficiency, and directness of optimization
  • Classify VPG, TRPO, PPO, DDPG, TD3, SAC, DQN, and A3C into the correct branches of the RL algorithm taxonomy

The Big Picture

Modern RL encompasses dozens of algorithms that differ in what they learn, how they learn it, and what assumptions they make about the environment. Before diving into individual algorithms, it pays to understand the major branching points in algorithm design.

Model-Free vs. Model-Based

The first major branching point: does the agent learn (or use) a model of the environment? A model is a function that predicts state transitions and rewards: given (s,a)(s, a), predict ss' and rr.

Model-Based RL

Upside: planning. With a model, the agent can think ahead — simulate possible futures and choose actions that lead to better outcomes. AlphaZero does this with Monte Carlo Tree Search over a learned model and achieves superhuman performance in Go and Chess.

Downside: model bias. A learned model is never perfect. The agent may overfit to the model's errors, learning to exploit inaccuracies in ways that don't transfer to the real environment. Model learning is hard, and even after significant compute investment, the bias can cause catastrophic failures at deployment.

Model-Free RL

Model-free methods forego the model entirely and learn directly from environment interactions. This sacrifices some potential for planning but:

  • Is far simpler to implement and tune.
  • Avoids model bias.
  • Has been more extensively validated on benchmarks.

All algorithms in this course (VPG, TRPO, PPO, DDPG, TD3, SAC) are model-free.

Policy Optimization vs. Q-Learning

Within model-free RL, the second major split is what to learn.

Policy Optimization

Represent and directly optimize a policy πθ(as)\pi_\theta(a|s). The agent explicitly stores and updates the policy network. Updates are performed on-policy: each gradient step uses data collected with the current policy.

Examples: A2C/A3C, VPG, TRPO, PPO.

Strengths: Principled — you optimize directly for what you care about. Stable and reliable when implemented correctly.

Weakness: Sample inefficient — on-policy data goes stale after one gradient update, so every new batch of experience requires fresh rollouts.

Q-Learning

Learn an approximation Qθ(s,a)Q_\theta(s,a) of the optimal action-value function QQ^*. The policy is recovered implicitly: a(s)=argmaxaQθ(s,a)a^*(s) = \arg\max_a Q_\theta(s,a). Updates are performed off-policy: any previously collected transition (s,a,r,s)(s,a,r,s') is usable, because the Bellman equation doesn't care which policy collected the data.

Examples: DQN, C51, HER.

Strengths: Much more sample efficient — data can be stored in a replay buffer and reused many times.

Weakness: Less stable. Q-learning solves a self-consistency equation (the Bellman equation), not a direct objective. There are many failure modes, especially with function approximation.

Interpolating Between the Two

Policy optimization and Q-learning are not binary opposites — some algorithms live between them:

DDPG: Learns both a deterministic policy and a Q-function, each used to improve the other. Off-policy (uses a replay buffer), but also directly updates a policy network.

SAC: Like DDPG but with a stochastic policy and entropy regularization, giving better exploration and stability.

These hybrid approaches try to get the sample efficiency of Q-learning while maintaining some of the stability of policy optimization.

The Taxonomy at a Glance

Family Algorithm On/Off-Policy Action Space
Policy Optimization VPG On Discrete or Continuous
Policy Optimization TRPO On Discrete or Continuous
Policy Optimization PPO On Discrete or Continuous
Q-Learning DQN Off Discrete
Actor-Critic (hybrid) DDPG Off Continuous only
Actor-Critic (hybrid) TD3 Off Continuous only
Actor-Critic (hybrid) SAC Off Continuous only

Model-Based Variants (Brief Overview)

Model-based methods are an active research area. Notable approaches:

  • Pure planning (MPC): Replan every step using a learned model, execute only the first action. Used in MBMF.
  • Expert iteration: Use planning (MCTS) to generate better actions, then distill into the policy. Used in ExIt and AlphaZero.
  • Data augmentation: Use a model to generate synthetic rollouts that supplement real experience. Used in MBVE and World Models.

These approaches can dramatically improve sample efficiency when the model is accurate, but require careful handling of model uncertainty.