DDPG & SAC in TensorFlow
Setup
Verify your TensorFlow 1.x setup for Spinning Up:
# Spinning Up's TF implementations require TF 1.x
pip install tensorflow==1.15
# Test DDPG quickly on Pendulum:
python -m spinup.run ddpg_tf1 --env Pendulum-v1 --epochs 5
You should see training metrics including TestEpRet and QVals.
Exercise 1: Baseline DDPG on Pendulum
Run DDPG on Pendulum-v1 (a simple continuous control task) to understand the output structure and baseline performance:
from spinup import ddpg_tf1 as ddpg
import gym
ddpg(
env_fn=lambda: gym.make('Pendulum-v1'),
ac_kwargs=dict(hidden_sizes=[64, 64]),
steps_per_epoch=4000,
epochs=50,
replay_size=int(1e6),
gamma=0.99,
polyak=0.995,
pi_lr=1e-3,
q_lr=1e-3,
batch_size=100,
start_steps=10000,
update_after=1000,
update_every=50,
act_noise=0.1,
num_test_episodes=10,
max_ep_len=200,
logger_kwargs=dict(output_dir='/tmp/ddpg-pendulum', exp_name='ddpg-pendulum')
)
Tasks:
- Inspect
progress.txt— identify which columns are most informative - Check
QValsover training — does it grow monotonically? What does divergence look like? - Load the saved policy and run 10 test episodes:
python -m spinup.run test_policy /tmp/ddpg-pendulum/ -n 10
Exercise 2: Observing DDPG Instability
DDPG is notoriously sensitive to hyperparameters. Deliberately create instability:
# Unstable DDPG: high learning rate, reduced polyak smoothing
ddpg(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[256, 256]),
epochs=50,
polyak=0.99, # less smoothing than default 0.995
pi_lr=0.01, # 10x default — unstable
q_lr=0.01,
act_noise=0.1,
start_steps=1000, # much less random exploration
update_after=100,
logger_kwargs=dict(output_dir='/tmp/ddpg-unstable', exp_name='ddpg-unstable')
)
Compare against stable DDPG with defaults:
ddpg(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[256, 256]),
epochs=50,
logger_kwargs=dict(output_dir='/tmp/ddpg-stable', exp_name='ddpg-stable')
)
python -m spinup.run plot /tmp/ddpg-unstable/ /tmp/ddpg-stable/
Observations:
- Does
QValsdiverge in the unstable run? - Does
TestEpRetcrash after peaking? - How does the
polyakvalue affect stability?
Exercise 3: SAC vs DDPG on HalfCheetah
Compare SAC and DDPG head-to-head with matched compute budgets:
from spinup import sac_tf1 as sac, ddpg_tf1 as ddpg
shared_kwargs = dict(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[256, 256]),
steps_per_epoch=4000,
epochs=100,
start_steps=10000,
update_after=1000,
update_every=50,
batch_size=100,
num_test_episodes=10,
max_ep_len=1000,
)
# Run DDPG with 3 seeds:
for seed in [0, 10, 20]:
ddpg(**shared_kwargs,
gamma=0.99, polyak=0.995, pi_lr=1e-3, q_lr=1e-3, act_noise=0.1,
seed=seed,
logger_kwargs=dict(output_dir=f'/tmp/ddpg-hc-s{seed}', exp_name='ddpg-hc'))
# Run SAC with 3 seeds:
for seed in [0, 10, 20]:
sac(**shared_kwargs,
gamma=0.99, polyak=0.995, lr=1e-3, alpha=0.2,
seed=seed,
logger_kwargs=dict(output_dir=f'/tmp/sac-hc-s{seed}', exp_name='sac-hc'))
# Alternatively, use spinup.run for cleaner seeded runs:
# python -m spinup.run ddpg_tf1 --env HalfCheetah-v2 --seed 0 10 20 --epochs 100
# python -m spinup.run sac_tf1 --env HalfCheetah-v2 --seed 0 10 20 --epochs 100
python -m spinup.run plot /tmp/ddpg-hc-s0/ /tmp/ddpg-hc-s10/ /tmp/ddpg-hc-s20/ \
/tmp/sac-hc-s0/ /tmp/sac-hc-s10/ /tmp/sac-hc-s20/
Analysis:
- Which algorithm achieves higher
TestEpRetby epoch 100? - Which has less variance across seeds?
- At what epoch does SAC surpass DDPG's final performance?
Exercise 4: SAC Temperature Ablation
SAC's entropy temperature controls the exploration-exploitation tradeoff. Run an ablation:
from spinup.utils.run_utils import ExperimentGrid
from spinup import sac_tf1
eg = ExperimentGrid(name='sac-alpha-sweep')
eg.add('env_name', 'Hopper-v2', '', True)
eg.add('seed', [0, 10, 20])
eg.add('epochs', 100)
eg.add('alpha', [0.05, 0.1, 0.2, 0.5], 'alpha')
eg.add('ac_kwargs:hidden_sizes', [(256, 256)], 'hid')
eg.run(sac_tf1, num_cpu=1)
Discussion:
- How does very low (0.05) change behavior compared to high (0.5)?
- Which converges fastest on Hopper?
- Why might a high work well in early training but hurt final performance?
Note on adaptive : Spinning Up uses a fixed , but the SAC paper also describes an automatic version that adjusts to maintain a target entropy level. This generally works better in practice and is available in newer implementations.
Exercise 5: Custom Environment
Create a custom Gym-compatible environment and train SAC on it:
import gym
import numpy as np
class SimpleReachEnv(gym.Env):
"""2D point-mass reaching task. Continuous 2D action space."""
def __init__(self):
self.observation_space = gym.spaces.Box(
low=-2, high=2, shape=(4,), dtype=np.float32
) # [x, y, goal_x, goal_y]
self.action_space = gym.spaces.Box(
low=-1, high=1, shape=(2,), dtype=np.float32
) # [dx, dy]
self.goal = np.zeros(2)
self.pos = np.zeros(2)
def reset(self):
self.pos = np.random.uniform(-1, 1, 2).astype(np.float32)
self.goal = np.random.uniform(-1, 1, 2).astype(np.float32)
return np.concatenate([self.pos, self.goal])
def step(self, action):
self.pos = np.clip(self.pos + 0.1 * action, -2, 2)
dist = np.linalg.norm(self.pos - self.goal)
reward = -dist # negative distance = dense reward
done = dist < 0.05 # reached goal
return np.concatenate([self.pos, self.goal]), reward, done, {}
# Register and train:
gym.envs.register(id='SimpleReach-v0', entry_point=SimpleReachEnv, max_episode_steps=50)
sac(
env_fn=lambda: gym.make('SimpleReach-v0'),
ac_kwargs=dict(hidden_sizes=[64, 64]),
epochs=50,
alpha=0.1,
start_steps=1000,
update_after=500,
logger_kwargs=dict(output_dir='/tmp/sac-reach', exp_name='sac-reach')
)
Tasks:
- Verify the agent learns to reach the goal (average episode length should decrease)
- Try a sparse reward version:
reward = 1.0 if dist < 0.05 else 0.0— does SAC still learn?
Problem Set 2 — Exercise 2.1: Value Function Fitting in TRPO
This exercise uses Spinning Up's TF1 TRPO implementation to demonstrate how dramatically value function quality affects policy gradient performance.
Background. GAE-Lambda depends on the value function baseline V^π to estimate advantages. If V^π is poorly fit, advantage estimates have high variance and policy updates become unreliable.
Instructions. Run the following command to compare TRPO with train_v_iters=0 versus train_v_iters=80:
python -m spinup.run trpo_tf1 --env Hopper-v2 \
--train_v_iters[v] 0 80 \
--exp_name ex2-1 \
--epochs 250 \
--steps_per_epoch 4000 \
--seed 0 10 20 \
--dt
This runs 6 experiments (2 settings × 3 seeds). Use --dt to timestamp run directories.
When complete, plot results:
python -m spinup.run plot /path/to/ex2-1/
Analysis questions:
- What is the AverageEpRet gap between
train_v_iters=0andtrain_v_iters=80at epoch 250? - Does the
train_v_iters=0run make any learning progress, or does it flatline? - Why does a poor value function hurt so much? (Hint: think about what high-variance advantage estimates do to the policy loss gradient.)
Solution. The official solution is available at: https://spinningup.openai.com/en/latest/spinningup/exercise2_1_soln.html
Problem Set 2 — Exercise 2.2: Silent Bug in DDPG
This exercise demonstrates one of the most important lessons in RL engineering: failures are frequently silent. Code will run without errors, but the agent will never learn.
Instructions. Navigate to the exercise file and run it:
python spinup/exercises/tf1/problem_set_2/exercise2_2.py
This launches 6 DDPG experiments: 3 seeds with a bug, 3 seeds without. When complete:
python -m spinup.run plot /path/to/exercise2_2_results/
Your task. Before looking at the solution:
- Observe the performance gap between bugged and non-bugged runs.
- Read the bugged
exercise2_2.py— specifically the actor-critic network creation code. Do not look atddpg/core.py. - Form a hypothesis: what exactly is the bug, and how does it break the DDPG computation graph?
Hint. Recall the DDPG computation graph:
# Bellman backup
backup = tf.stop_gradient(r + gamma * (1 - d) * q_pi_targ)
# Losses
pi_loss = -tf.reduce_mean(q_pi)
q_loss = tf.reduce_mean((q - backup)**2)
A bug in the actor-critic code could affect what q_pi, q, and q_pi_targ refer to. Think about shared versus independent network weights.
Bonus. Are there hyperparameter settings that would have hidden the effects of the bug — where the bugged version still appears to learn? Why?
Solution. The official solution is available at: https://spinningup.openai.com/en/latest/spinningup/exercise2_2_soln.html