Deep Reinforcement Learning · Off-Policy Methods & Tooling

Deep Deterministic Policy Gradient (DDPG)

13 min read
By the end of this reading you will be able to:
  • Explain how DDPG extends DQN to continuous action spaces using a deterministic policy network that differentiably maximizes Q
  • Describe the roles of replay buffer, target networks, and polyak averaging in DDPG and explain why each is necessary for stability
  • Configure DDPG's key hyperparameters: polyak, act_noise, start_steps, replay_size, batch_size, update_after, and update_every

The Core Idea

DDPG is essentially deep Q-learning for continuous action spaces. In discrete spaces, computing argmaxaQ(s,a)\arg\max_a Q(s,a) is trivial — just evaluate Q for all actions. In continuous spaces, this becomes an optimization problem that would be run at every environment step, making it prohibitively slow.

DDPG's insight: if Q(s,a)Q^*(s,a) is differentiable with respect to aa, we can maintain a separate policy network μθ(s)\mu_\theta(s) trained to maximize Q. Instead of running optimization every step, we just forward-pass the policy: maxaQ(s,a)Q(s,μθ(s))\max_a Q(s,a) \approx Q(s, \mu_\theta(s))

Quick Facts

  • Off-policy algorithm (uses a replay buffer)
  • Continuous action spaces only
  • Deterministic policy μθ(s)\mu_\theta(s)
  • No MPI parallelization support

The Two Learning Problems

1. Learning the Q-Function (Bellman Minimization)

DDPG minimizes the mean-squared Bellman error (MSBE): L(ϕ)=E(s,a,r,s,d)D[(Qϕ(s,a)y(r,s,d))2]L(\phi) = \mathbb{E}_{(s,a,r,s',d) \sim \mathcal{D}}\left[\left(Q_\phi(s,a) - y(r,s',d)\right)^2\right]

where the target yy uses a target policy network μθtarg\mu_{\theta_{\text{targ}}} and a target Q-network QϕtargQ_{\phi_{\text{targ}}}: y(r,s,d)=r+γ(1d)Qϕtarg(s,μθtarg(s))y(r,s',d) = r + \gamma(1-d)\, Q_{\phi_{\text{targ}}}(s', \mu_{\theta_{\text{targ}}}(s'))

2. Learning the Policy (Gradient Ascent on Q)

maxθ  EsD[Qϕ(s,μθ(s))]\max_\theta \; \mathbb{E}_{s \sim \mathcal{D}}\left[Q_\phi(s, \mu_\theta(s))\right]

Since QϕQ_\phi is differentiable through μθ\mu_\theta, this is just gradient ascent — no sampling needed.

Three Stability Tricks

Trick 1: Replay Buffer

DDPG stores all past transitions (s,a,r,s,d)(s,a,r,s',d) in a fixed-size buffer and samples random minibatches for training. This breaks the temporal correlation between consecutive samples (crucial for stable neural network training) and allows data reuse — the key advantage over on-policy methods.

The Bellman equation holds for any transition regardless of how it was collected, which is exactly why off-policy replay works.

Trick 2: Target Networks

If we compute the Bellman target using the same parameters we're updating, the target is a moving goalpost — the network can enter feedback loops and diverge. The solution: maintain target networks QϕtargQ_{\phi_{\text{targ}}} and μθtarg\mu_{\theta_{\text{targ}}} that are updated slowly.

Trick 3: Polyak Averaging

Unlike DQN (which copies the main network periodically), DDPG uses polyak averaging — a soft update every single step: ϕtargρϕtarg+(1ρ)ϕ\phi_{\text{targ}} \leftarrow \rho \phi_{\text{targ}} + (1-\rho) \phi

With ρ=0.995\rho = 0.995 (default), the target network tracks the main network very slowly and smoothly.

Exploration

The policy is deterministic, so the agent won't explore on its own. At training time, noise is added to actions: a=clip(μθ(s)+ϵ,aLow,aHigh),ϵN(0,σ2I)a = \text{clip}(\mu_\theta(s) + \epsilon, a_{Low}, a_{High}), \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)

The original paper used Ornstein-Uhlenbeck (time-correlated) noise, but uncorrelated Gaussian noise works just as well in practice.

Start steps trick: For the first start_steps environment interactions, actions are sampled uniformly at random (ignoring the policy entirely). This pre-fills the replay buffer with diverse experiences before policy learning begins.

PyTorch Implementation

from spinup import ddpg_pytorch as ddpg
import gym

ddpg(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    steps_per_epoch=4000,
    epochs=100,
    replay_size=int(1e6),   # replay buffer size
    gamma=0.99,
    polyak=0.995,           # ρ for target network updates
    pi_lr=1e-3,
    q_lr=1e-3,
    batch_size=100,
    start_steps=10000,      # random exploration before policy kicks in
    update_after=1000,      # wait until buffer has enough data
    update_every=50,        # env steps between gradient updates
    act_noise=0.1,          # Gaussian noise std for exploration
    num_test_episodes=10,
    max_ep_len=1000,
    logger_kwargs=dict(output_dir='/tmp/ddpg', exp_name='ddpg')
)

Key Hyperparameters

Hyperparameter Default Role
polyak 0.995 Target network smoothing. Close to 1 = very slow update.
act_noise 0.1 Gaussian noise std during training. Tune for exploration.
start_steps 10000 Steps of random action before policy is used.
update_after 1000 Don't start updates until buffer has this many samples.
update_every 50 Steps between update batches (ratio of env steps to gradient steps = 1).
batch_size 100 Minibatch size. 100–256 typical.
replay_size 1e6 Max replay buffer size. Usually sufficient.

DDPG's Weakness

DDPG can be brittle. The most common failure mode is Q-value overestimation: the Q-network develops spurious peaks in the action space, and the policy learns to exploit these errors, producing deterministic actions that perform well in the model but poorly in the environment. This feedback loop degrades performance rapidly.

TD3 (next reading) was designed specifically to address this failure mode.