Deep Deterministic Policy Gradient (DDPG)
- Explain how DDPG extends DQN to continuous action spaces using a deterministic policy network that differentiably maximizes Q
- Describe the roles of replay buffer, target networks, and polyak averaging in DDPG and explain why each is necessary for stability
- Configure DDPG's key hyperparameters: polyak, act_noise, start_steps, replay_size, batch_size, update_after, and update_every
The Core Idea
DDPG is essentially deep Q-learning for continuous action spaces. In discrete spaces, computing is trivial — just evaluate Q for all actions. In continuous spaces, this becomes an optimization problem that would be run at every environment step, making it prohibitively slow.
DDPG's insight: if is differentiable with respect to , we can maintain a separate policy network trained to maximize Q. Instead of running optimization every step, we just forward-pass the policy:
Quick Facts
- Off-policy algorithm (uses a replay buffer)
- Continuous action spaces only
- Deterministic policy
- No MPI parallelization support
The Two Learning Problems
1. Learning the Q-Function (Bellman Minimization)
DDPG minimizes the mean-squared Bellman error (MSBE):
where the target uses a target policy network and a target Q-network :
2. Learning the Policy (Gradient Ascent on Q)
Since is differentiable through , this is just gradient ascent — no sampling needed.
Three Stability Tricks
Trick 1: Replay Buffer
DDPG stores all past transitions in a fixed-size buffer and samples random minibatches for training. This breaks the temporal correlation between consecutive samples (crucial for stable neural network training) and allows data reuse — the key advantage over on-policy methods.
The Bellman equation holds for any transition regardless of how it was collected, which is exactly why off-policy replay works.
Trick 2: Target Networks
If we compute the Bellman target using the same parameters we're updating, the target is a moving goalpost — the network can enter feedback loops and diverge. The solution: maintain target networks and that are updated slowly.
Trick 3: Polyak Averaging
Unlike DQN (which copies the main network periodically), DDPG uses polyak averaging — a soft update every single step:
With (default), the target network tracks the main network very slowly and smoothly.
Exploration
The policy is deterministic, so the agent won't explore on its own. At training time, noise is added to actions:
The original paper used Ornstein-Uhlenbeck (time-correlated) noise, but uncorrelated Gaussian noise works just as well in practice.
Start steps trick: For the first start_steps environment interactions, actions are sampled uniformly at random (ignoring the policy entirely). This pre-fills the replay buffer with diverse experiences before policy learning begins.
PyTorch Implementation
from spinup import ddpg_pytorch as ddpg
import gym
ddpg(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[256, 256]),
steps_per_epoch=4000,
epochs=100,
replay_size=int(1e6), # replay buffer size
gamma=0.99,
polyak=0.995, # ρ for target network updates
pi_lr=1e-3,
q_lr=1e-3,
batch_size=100,
start_steps=10000, # random exploration before policy kicks in
update_after=1000, # wait until buffer has enough data
update_every=50, # env steps between gradient updates
act_noise=0.1, # Gaussian noise std for exploration
num_test_episodes=10,
max_ep_len=1000,
logger_kwargs=dict(output_dir='/tmp/ddpg', exp_name='ddpg')
)
Key Hyperparameters
| Hyperparameter | Default | Role |
|---|---|---|
polyak |
0.995 | Target network smoothing. Close to 1 = very slow update. |
act_noise |
0.1 | Gaussian noise std during training. Tune for exploration. |
start_steps |
10000 | Steps of random action before policy is used. |
update_after |
1000 | Don't start updates until buffer has this many samples. |
update_every |
50 | Steps between update batches (ratio of env steps to gradient steps = 1). |
batch_size |
100 | Minibatch size. 100–256 typical. |
replay_size |
1e6 | Max replay buffer size. Usually sufficient. |
DDPG's Weakness
DDPG can be brittle. The most common failure mode is Q-value overestimation: the Q-network develops spurious peaks in the action space, and the policy learns to exploit these errors, producing deterministic actions that perform well in the model but poorly in the environment. This feedback loop degrades performance rapidly.
TD3 (next reading) was designed specifically to address this failure mode.