Deep Reinforcement Learning · Off-Policy Methods & Tooling

Deep Deterministic Policy Gradient (DDPG)

13 min read

By the end of this reading you will be able to:

Explain how DDPG extends DQN to continuous action spaces using a deterministic policy network that differentiably maximizes Q
Describe the roles of replay buffer, target networks, and polyak averaging in DDPG and explain why each is necessary for stability
Configure DDPG's key hyperparameters: polyak, act_noise, start_steps, replay_size, batch_size, update_after, and update_every

The Core Idea

DDPG is essentially deep Q-learning for continuous action spaces. In discrete spaces, computing $\arg\max_a Q(s,a)$ is trivial — just evaluate Q for all actions. In continuous spaces, this becomes an optimization problem that would be run at every environment step, making it prohibitively slow.

DDPG's insight: if $Q^*(s,a)$ is differentiable with respect to $a$ , we can maintain a separate policy network $\mu_\theta(s)$ trained to maximize Q. Instead of running optimization every step, we just forward-pass the policy: $\max_a Q(s,a) \approx Q(s, \mu_\theta(s))$

Quick Facts

Off-policy algorithm (uses a replay buffer)
Continuous action spaces only
Deterministic policy $\mu_\theta(s)$
No MPI parallelization support

The Two Learning Problems

1. Learning the Q-Function (Bellman Minimization)

DDPG minimizes the mean-squared Bellman error (MSBE): $L(\phi) = \mathbb{E}_{(s,a,r,s',d) \sim \mathcal{D}}\left[\left(Q_\phi(s,a) - y(r,s',d)\right)^2\right]$

where the target $y$ uses a target policy network $\mu_{\theta_{\text{targ}}}$ and a target Q-network $Q_{\phi_{\text{targ}}}$ : $y(r,s',d) = r + \gamma(1-d)\, Q_{\phi_{\text{targ}}}(s', \mu_{\theta_{\text{targ}}}(s'))$

2. Learning the Policy (Gradient Ascent on Q)

$\max_\theta \; \mathbb{E}_{s \sim \mathcal{D}}\left[Q_\phi(s, \mu_\theta(s))\right]$

Since $Q_\phi$ is differentiable through $\mu_\theta$ , this is just gradient ascent — no sampling needed.

Three Stability Tricks

Trick 1: Replay Buffer

DDPG stores all past transitions $(s,a,r,s',d)$ in a fixed-size buffer and samples random minibatches for training. This breaks the temporal correlation between consecutive samples (crucial for stable neural network training) and allows data reuse — the key advantage over on-policy methods.

The Bellman equation holds for any transition regardless of how it was collected, which is exactly why off-policy replay works.

Trick 2: Target Networks

If we compute the Bellman target using the same parameters we're updating, the target is a moving goalpost — the network can enter feedback loops and diverge. The solution: maintain target networks $Q_{\phi_{\text{targ}}}$ and $\mu_{\theta_{\text{targ}}}$ that are updated slowly.

Trick 3: Polyak Averaging

Unlike DQN (which copies the main network periodically), DDPG uses polyak averaging — a soft update every single step: $\phi_{\text{targ}} \leftarrow \rho \phi_{\text{targ}} + (1-\rho) \phi$

With $\rho = 0.995$ (default), the target network tracks the main network very slowly and smoothly.

Exploration

The policy is deterministic, so the agent won't explore on its own. At training time, noise is added to actions: $a = \text{clip}(\mu_\theta(s) + \epsilon, a_{Low}, a_{High}), \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)$

The original paper used Ornstein-Uhlenbeck (time-correlated) noise, but uncorrelated Gaussian noise works just as well in practice.

Start steps trick: For the first start_steps environment interactions, actions are sampled uniformly at random (ignoring the policy entirely). This pre-fills the replay buffer with diverse experiences before policy learning begins.

PyTorch Implementation

from spinup import ddpg_pytorch as ddpg
import gym

ddpg(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    steps_per_epoch=4000,
    epochs=100,
    replay_size=int(1e6),   # replay buffer size
    gamma=0.99,
    polyak=0.995,           # ρ for target network updates
    pi_lr=1e-3,
    q_lr=1e-3,
    batch_size=100,
    start_steps=10000,      # random exploration before policy kicks in
    update_after=1000,      # wait until buffer has enough data
    update_every=50,        # env steps between gradient updates
    act_noise=0.1,          # Gaussian noise std for exploration
    num_test_episodes=10,
    max_ep_len=1000,
    logger_kwargs=dict(output_dir='/tmp/ddpg', exp_name='ddpg')
)

Key Hyperparameters

Hyperparameter	Default	Role
`polyak`	0.995	Target network smoothing. Close to 1 = very slow update.
`act_noise`	0.1	Gaussian noise std during training. Tune for exploration.
`start_steps`	10000	Steps of random action before policy is used.
`update_after`	1000	Don't start updates until buffer has this many samples.
`update_every`	50	Steps between update batches (ratio of env steps to gradient steps = 1).
`batch_size`	100	Minibatch size. 100–256 typical.
`replay_size`	1e6	Max replay buffer size. Usually sufficient.

DDPG's Weakness

DDPG can be brittle. The most common failure mode is Q-value overestimation: the Q-network develops spurious peaks in the action space, and the policy learns to exploit these errors, producing deterministic actions that perform well in the model but poorly in the environment. This feedback loop degrades performance rapidly.

TD3 (next reading) was designed specifically to address this failure mode.

References

Lillicrap et al. 2016 — Continuous Control With Deep Reinforcement Learning

Silver et al. 2014 — Deterministic Policy Gradient Algorithms

OpenAI Spinning Up — DDPG Documentation

Overview Next →