This reading is a curated reference list, adapted from the Spinning Up documentation. It covers the papers most worth reading if you want to understand where deep RL comes from and where it is going. Far from comprehensive, it is a useful starting map for anyone looking to do research or applied work in the field.
1. Model-Free RL
a. Deep Q-Learning
| # |
Paper |
| 1 |
Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN. |
| 2 |
Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht and Stone, 2015. Algorithm: Deep Recurrent Q-Learning. |
| 3 |
Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, 2015. Algorithm: Dueling DQN. |
| 4 |
Deep Reinforcement Learning with Double Q-learning, Hasselt et al, 2015. Algorithm: Double DQN. |
| 5 |
Prioritized Experience Replay, Schaul et al, 2015. Algorithm: PER. |
| 6 |
Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al, 2017. Algorithm: Rainbow DQN. |
b. Policy Gradients
| # |
Paper |
| 7 |
Asynchronous Methods for Deep Reinforcement Learning, Mnih et al, 2016. Algorithm: A3C. |
| 8 |
Trust Region Policy Optimization, Schulman et al, 2015. Algorithm: TRPO. |
| 9 |
High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015. Algorithm: GAE. |
| 10 |
Proximal Policy Optimization Algorithms, Schulman et al, 2017. Algorithm: PPO-Clip, PPO-Penalty. |
| 11 |
Emergence of Locomotion Behaviours in Rich Environments, Heess et al, 2017. Algorithm: PPO-Penalty. |
| 12 |
Scalable trust-region method using Kronecker-factored approximation, Wu et al, 2017. Algorithm: ACKTR. |
| 13 |
Sample Efficient Actor-Critic with Experience Replay, Wang et al, 2016. Algorithm: ACER. |
| 14 |
Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor, Haarnoja et al, 2018. Algorithm: SAC. |
c. Deterministic Policy Gradients
d. Distributional RL
Distributional RL learns the full distribution of returns rather than just the mean. This often improves sample efficiency and final performance.
| # |
Paper |
| 18 |
A Distributional Perspective on Reinforcement Learning, Bellemare et al, 2017. Algorithm: C51. |
| 19 |
Distributional Reinforcement Learning with Quantile Regression, Dabney et al, 2017. Algorithm: QR-DQN. |
| 20 |
Implicit Quantile Networks for Distributional Reinforcement Learning, Dabney et al, 2018. Algorithm: IQN. |
| 21 |
Dopamine: A Research Framework for Deep Reinforcement Learning, Anonymous, 2018. Contains implementations of DQN, C51, IQN, and Rainbow. |
e. Policy Gradients with Action-Dependent Baselines
f. Path-Consistency Learning
g. Combining Policy Learning and Q-Learning
h. Evolutionary Algorithms
2. Exploration
a. Intrinsic Motivation
When extrinsic rewards are sparse, intrinsic motivation — reward signals derived from the agent's own curiosity or surprise — can drive exploration.
| # |
Paper |
| 32 |
VIME: Variational Information Maximizing Exploration, Houthooft et al, 2016. |
| 33 |
Unifying Count-Based Exploration and Intrinsic Motivation, Bellemare et al, 2016. |
| 34 |
Count-Based Exploration with Neural Density Models, Ostrovski et al, 2017. |
| 35 |
#Exploration: A Study of Count-Based Exploration for Deep RL, Tang et al, 2016. |
| 36 |
EX2: Exploration with Exemplar Models, Fu et al, 2017. |
| 37 |
Curiosity-driven Exploration by Self-supervised Prediction, Pathak et al, 2017. Algorithm: ICM. |
| 38 |
Large-Scale Study of Curiosity-Driven Learning, Burda et al, 2018. |
| 39 |
Exploration by Random Network Distillation, Burda et al, 2018. Algorithm: RND. |
b. Unsupervised RL
3. Transfer and Multitask RL
| # |
Paper |
| 43 |
Progressive Neural Networks, Rusu et al, 2016. |
| 44 |
Universal Value Function Approximators, Schaul et al, 2015. Algorithm: UVFA. |
| 45 |
Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et al, 2016. Algorithm: UNREAL. |
| 46 |
The Intentional Unintentional Agent, Cabi et al, 2017. Algorithm: IU Agent. |
| 47 |
PathNet: Evolution Channels Gradient Descent in Super Neural Networks, Fernando et al, 2017. |
| 48 |
Mutual Alignment Transfer Learning, Wulfmeier et al, 2017. |
| 49 |
Learning an Embedding Space for Transferable Robot Skills, Hausman et al, 2018. |
| 50 |
Hindsight Experience Replay, Andrychowicz et al, 2017. Algorithm: HER. |
4. Hierarchy
5. Memory
| # |
Paper |
| 54 |
Model-Free Episodic Control, Blundell et al, 2016. Algorithm: MFEC. |
| 55 |
Neural Episodic Control, Pritzel et al, 2017. Algorithm: NEC. |
| 56 |
Neural Map: Structured Memory for Deep RL, Parisotto and Salakhutdinov, 2017. |
| 57 |
Unsupervised Predictive Memory in a Goal-Directed Agent, Wayne et al, 2018. Algorithm: MERLIN. |
| 58 |
Relational Recurrent Neural Networks, Santoro et al, 2018. Algorithm: RMC. |
6. Model-Based RL
a. Model is Learned
These methods learn a dynamics model from experience and use it to plan or augment training.
| # |
Paper |
| 59 |
Imagination-Augmented Agents for Deep RL, Weber et al, 2017. Algorithm: I2A. |
| 60 |
Neural Network Dynamics for Model-Based Deep RL with Model-Free Fine-Tuning, Nagabandi et al, 2017. Algorithm: MBMF. |
| 61 |
Model-Based Value Expansion for Efficient Model-Free RL, Feinberg et al, 2018. Algorithm: MVE. |
| 62 |
Sample-Efficient RL with Stochastic Ensemble Value Expansion, Buckman et al, 2018. Algorithm: STEVE. |
| 63 |
Model-Ensemble Trust-Region Policy Optimization, Kurutach et al, 2018. Algorithm: ME-TRPO. |
| 64 |
Model-Based RL via Meta-Policy Optimization, Clavera et al, 2018. Algorithm: MB-MPO. |
| 65 |
Recurrent World Models Facilitate Policy Evolution, Ha and Schmidhuber, 2018. |
b. Model is Given
Meta-RL systems learn to learn — they develop policies that can quickly adapt to new tasks from a small number of interactions.
8. Scaling RL
| # |
Paper |
| 72 |
Accelerated Methods for Deep Reinforcement Learning, Stooke and Abbeel, 2018. Systematic analysis of parallelization in deep RL. |
| 73 |
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures, Espeholt et al, 2018. Algorithm: IMPALA. |
| 74 |
Distributed Prioritized Experience Replay, Horgan et al, 2018. Algorithm: Ape-X. |
| 75 |
Recurrent Experience Replay in Distributed Reinforcement Learning, Anonymous, 2018. Algorithm: R2D2. |
| 76 |
RLlib: Abstractions for Distributed Reinforcement Learning, Liang et al, 2017. A scalable library of RL algorithm implementations. |
9. RL in the Real World
10. Safety
| # |
Paper |
| 81 |
Concrete Problems in AI Safety, Amodei et al, 2016. Establishes a taxonomy of safety problems for AI systems. |
| 82 |
Deep Reinforcement Learning From Human Preferences, Christiano et al, 2017. Algorithm: LFP. |
| 83 |
Constrained Policy Optimization, Achiam et al, 2017. Algorithm: CPO. |
| 84 |
Safe Exploration in Continuous Action Spaces, Dalal et al, 2018. Algorithm: DDPG+Safety Layer. |
| 85 |
Trial without Error: Towards Safe RL via Human Intervention, Saunders et al, 2017. Algorithm: HIRL. |
| 86 |
Leave No Trace: Learning to Reset for Safe and Autonomous RL, Eysenbach et al, 2017. |
11. Imitation Learning and Inverse Reinforcement Learning
IL and IRL address the case where you have demonstrations of good behavior but no explicit reward signal.
| # |
Paper |
| 87 |
Modeling Purposeful Adaptive Behavior with Maximum Causal Entropy, Ziebart, 2010. Crisp formulation of maximum entropy IRL. |
| 88 |
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Finn et al, 2016. Algorithm: GCL. |
| 89 |
Generative Adversarial Imitation Learning, Ho and Ermon, 2016. Algorithm: GAIL. |
| 90 |
DeepMimic: Example-Guided Deep RL of Physics-Based Character Skills, Peng et al, 2018. Algorithm: DeepMimic. |
| 91 |
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs, Peng et al, 2018. Algorithm: VAIL. |
| 92 |
One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL, Le Paine et al, 2018. Algorithm: MetaMimic. |
12. Reproducibility, Analysis, and Critique
This category is essential reading for anyone who wants to trust empirical RL results. The reproducibility literature reveals how fragile benchmark claims can be.
| # |
Paper |
| 93 |
Benchmarking Deep RL for Continuous Control, Duan et al, 2016. Introduced rllab. |
| 94 |
Reproducibility of Benchmarked Deep RL Tasks for Continuous Control, Islam et al, 2017. |
| 95 |
Deep Reinforcement Learning that Matters, Henderson et al, 2017. Shows how hyperparameters, random seeds, and implementation details can dominate performance differences. |
| 96 |
Where Did My Optimum Go?: Gradient Descent Optimization in Policy Gradient Methods, Henderson et al, 2018. |
| 97 |
Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?, Ilyas et al, 2018. |
| 98 |
Simple Random Search Provides a Competitive Approach to RL, Mania et al, 2018. |
| 99 |
Benchmarking Model-Based Reinforcement Learning, Wang et al, 2019. |
13. Classic Papers in RL Theory or Review
These pre-deep-RL papers established the mathematical foundations that modern methods rely on.
| # |
Paper |
| 100 |
Policy Gradient Methods for RL with Function Approximation, Sutton et al, 2000. Established the policy gradient theorem. |
| 101 |
An Analysis of Temporal-Difference Learning with Function Approximation, Tsitsiklis and Van Roy, 1997. Convergence results and counter-examples for value learning. |
| 102 |
Reinforcement Learning of Motor Skills with Policy Gradients, Peters and Schaal, 2008. Thorough review of policy gradient methods. |
| 103 |
Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford, 2002. Early monotonic improvement theory, roots of TRPO. |
| 104 |
A Natural Policy Gradient, Kakade, 2002. Brought natural gradients into RL, precursor to TRPO, ACKTR. |
| 105 |
Algorithms for Reinforcement Learning, Szepesvari, 2009. Foundational reference on RL theory. |