Deep Reinforcement Learning · RL Foundations

The RL Algorithm Landscape

10 min read

By the end of this reading you will be able to:

Distinguish model-free from model-based RL and explain the key upside (planning) and downside (model bias) of model-based approaches
Compare policy optimization and Q-learning along the dimensions of stability, sample efficiency, and directness of optimization
Classify VPG, TRPO, PPO, DDPG, TD3, SAC, DQN, and A3C into the correct branches of the RL algorithm taxonomy

The Big Picture

Modern RL encompasses dozens of algorithms that differ in what they learn, how they learn it, and what assumptions they make about the environment. Before diving into individual algorithms, it pays to understand the major branching points in algorithm design.

Model-Free vs. Model-Based

The first major branching point: does the agent learn (or use) a model of the environment? A model is a function that predicts state transitions and rewards: given $(s, a)$ , predict $s'$ and $r$ .

Model-Based RL

Upside: planning. With a model, the agent can think ahead — simulate possible futures and choose actions that lead to better outcomes. AlphaZero does this with Monte Carlo Tree Search over a learned model and achieves superhuman performance in Go and Chess.

Downside: model bias. A learned model is never perfect. The agent may overfit to the model's errors, learning to exploit inaccuracies in ways that don't transfer to the real environment. Model learning is hard, and even after significant compute investment, the bias can cause catastrophic failures at deployment.

Model-Free RL

Model-free methods forego the model entirely and learn directly from environment interactions. This sacrifices some potential for planning but:

Is far simpler to implement and tune.
Avoids model bias.
Has been more extensively validated on benchmarks.

All algorithms in this course (VPG, TRPO, PPO, DDPG, TD3, SAC) are model-free.

Policy Optimization vs. Q-Learning

Within model-free RL, the second major split is what to learn.

Policy Optimization

Represent and directly optimize a policy $\pi_\theta(a|s)$ . The agent explicitly stores and updates the policy network. Updates are performed on-policy: each gradient step uses data collected with the current policy.

Examples: A2C/A3C, VPG, TRPO, PPO.

Strengths: Principled — you optimize directly for what you care about. Stable and reliable when implemented correctly.

Weakness: Sample inefficient — on-policy data goes stale after one gradient update, so every new batch of experience requires fresh rollouts.

Q-Learning

Learn an approximation $Q_\theta(s,a)$ of the optimal action-value function $Q^*$ . The policy is recovered implicitly: $a^*(s) = \arg\max_a Q_\theta(s,a)$ . Updates are performed off-policy: any previously collected transition $(s,a,r,s')$ is usable, because the Bellman equation doesn't care which policy collected the data.

Examples: DQN, C51, HER.

Strengths: Much more sample efficient — data can be stored in a replay buffer and reused many times.

Weakness: Less stable. Q-learning solves a self-consistency equation (the Bellman equation), not a direct objective. There are many failure modes, especially with function approximation.

Interpolating Between the Two

Policy optimization and Q-learning are not binary opposites — some algorithms live between them:

DDPG: Learns both a deterministic policy and a Q-function, each used to improve the other. Off-policy (uses a replay buffer), but also directly updates a policy network.

SAC: Like DDPG but with a stochastic policy and entropy regularization, giving better exploration and stability.

These hybrid approaches try to get the sample efficiency of Q-learning while maintaining some of the stability of policy optimization.

The Taxonomy at a Glance

Family	Algorithm	On/Off-Policy	Action Space
Policy Optimization	VPG	On	Discrete or Continuous
Policy Optimization	TRPO	On	Discrete or Continuous
Policy Optimization	PPO	On	Discrete or Continuous
Q-Learning	DQN	Off	Discrete
Actor-Critic (hybrid)	DDPG	Off	Continuous only
Actor-Critic (hybrid)	TD3	Off	Continuous only
Actor-Critic (hybrid)	SAC	Off	Continuous only

Model-Based Variants (Brief Overview)

Model-based methods are an active research area. Notable approaches:

Pure planning (MPC): Replan every step using a learned model, execute only the first action. Used in MBMF.
Expert iteration: Use planning (MCTS) to generate better actions, then distill into the policy. Used in ExIt and AlphaZero.
Data augmentation: Use a model to generate synthetic rollouts that supplement real experience. Used in MBVE and World Models.

These approaches can dramatically improve sample efficiency when the model is accurate, but require careful handling of model uncertainty.

References

OpenAI Spinning Up — Part 2: Kinds of RL Algorithms

Mnih et al. 2013 — DQN: Playing Atari with Deep Reinforcement Learning

Silver et al. 2017 — AlphaZero

Previous Next →

The RL Algorithm Landscape

The Big Picture

Model-Free vs. Model-Based

Model-Based RL

Model-Free RL

Policy Optimization vs. Q-Learning

Policy Optimization

Q-Learning

Interpolating Between the Two

The Taxonomy at a Glance

Model-Based Variants (Brief Overview)

Privacy Policy

What we collect

What we don't collect

Your choices

Contact