Deep Reinforcement Learning · Off-Policy Methods & Tooling

DDPG & SAC in TensorFlow

Colab Notebook · ~60 min

Google Colab Notebook

Python · ~60 min

Lab Objectives

Run a baseline DDPG experiment on Pendulum-v1, interpret every column in progress.txt, and load the saved policy for test evaluation

Deliberately destabilize DDPG by increasing learning rates and reducing polyak smoothing; compare QVals and TestEpRet curves against the stable baseline

Run a controlled SAC vs DDPG comparison on HalfCheetah-v2 across 3 seeds each and identify the epoch at which SAC surpasses DDPG's final performance

Conduct a SAC temperature (α) ablation across α ∈ {0.05, 0.1, 0.2, 0.5} using ExperimentGrid on Hopper-v2 and analyze the exploration-exploitation tradeoff

Implement a custom Gym-compatible continuous-control environment and train SAC on it, including a sparse reward variant

Compare TRPO with train_v_iters=0 vs 80 on Hopper-v2 across 3 seeds to observe how value function quality drives policy gradient performance (Problem Set 2.1)

Observe a silent DDPG bug in vivo, diagnose it from degraded learning curves, and explain how shared actor-critic network weights break the computation graph (Problem Set 2.2)

Setup

Verify your TensorFlow 1.x setup for Spinning Up:

# Spinning Up's TF implementations require TF 1.x
pip install tensorflow==1.15

# Test DDPG quickly on Pendulum:
python -m spinup.run ddpg_tf1 --env Pendulum-v1 --epochs 5

You should see training metrics including TestEpRet and QVals.

Exercise 1: Baseline DDPG on Pendulum

Run DDPG on Pendulum-v1 (a simple continuous control task) to understand the output structure and baseline performance:

from spinup import ddpg_tf1 as ddpg
import gym

ddpg(
    env_fn=lambda: gym.make('Pendulum-v1'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=50,
    replay_size=int(1e6),
    gamma=0.99,
    polyak=0.995,
    pi_lr=1e-3,
    q_lr=1e-3,
    batch_size=100,
    start_steps=10000,
    update_after=1000,
    update_every=50,
    act_noise=0.1,
    num_test_episodes=10,
    max_ep_len=200,
    logger_kwargs=dict(output_dir='/tmp/ddpg-pendulum', exp_name='ddpg-pendulum')
)

Tasks:

Inspect progress.txt — identify which columns are most informative
Check QVals over training — does it grow monotonically? What does divergence look like?
Load the saved policy and run 10 test episodes:

python -m spinup.run test_policy /tmp/ddpg-pendulum/ -n 10

Exercise 2: Observing DDPG Instability

DDPG is notoriously sensitive to hyperparameters. Deliberately create instability:

# Unstable DDPG: high learning rate, reduced polyak smoothing
ddpg(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    epochs=50,
    polyak=0.99,        # less smoothing than default 0.995
    pi_lr=0.01,         # 10x default — unstable
    q_lr=0.01,
    act_noise=0.1,
    start_steps=1000,   # much less random exploration
    update_after=100,
    logger_kwargs=dict(output_dir='/tmp/ddpg-unstable', exp_name='ddpg-unstable')
)

Compare against stable DDPG with defaults:

ddpg(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    epochs=50,
    logger_kwargs=dict(output_dir='/tmp/ddpg-stable', exp_name='ddpg-stable')
)

python -m spinup.run plot /tmp/ddpg-unstable/ /tmp/ddpg-stable/

Observations:

Does QVals diverge in the unstable run?
Does TestEpRet crash after peaking?
How does the polyak value affect stability?

Exercise 3: SAC vs DDPG on HalfCheetah

Compare SAC and DDPG head-to-head with matched compute budgets:

from spinup import sac_tf1 as sac, ddpg_tf1 as ddpg

shared_kwargs = dict(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    steps_per_epoch=4000,
    epochs=100,
    start_steps=10000,
    update_after=1000,
    update_every=50,
    batch_size=100,
    num_test_episodes=10,
    max_ep_len=1000,
)

# Run DDPG with 3 seeds:
for seed in [0, 10, 20]:
    ddpg(**shared_kwargs,
         gamma=0.99, polyak=0.995, pi_lr=1e-3, q_lr=1e-3, act_noise=0.1,
         seed=seed,
         logger_kwargs=dict(output_dir=f'/tmp/ddpg-hc-s{seed}', exp_name='ddpg-hc'))

# Run SAC with 3 seeds:
for seed in [0, 10, 20]:
    sac(**shared_kwargs,
        gamma=0.99, polyak=0.995, lr=1e-3, alpha=0.2,
        seed=seed,
        logger_kwargs=dict(output_dir=f'/tmp/sac-hc-s{seed}', exp_name='sac-hc'))

# Alternatively, use spinup.run for cleaner seeded runs:
# python -m spinup.run ddpg_tf1 --env HalfCheetah-v2 --seed 0 10 20 --epochs 100
# python -m spinup.run sac_tf1 --env HalfCheetah-v2 --seed 0 10 20 --epochs 100

python -m spinup.run plot /tmp/ddpg-hc-s0/ /tmp/ddpg-hc-s10/ /tmp/ddpg-hc-s20/ \
                          /tmp/sac-hc-s0/ /tmp/sac-hc-s10/ /tmp/sac-hc-s20/

Analysis:

Which algorithm achieves higher TestEpRet by epoch 100?
Which has less variance across seeds?
At what epoch does SAC surpass DDPG's final performance?

Exercise 4: SAC Temperature Ablation

SAC's entropy temperature $\alpha$ controls the exploration-exploitation tradeoff. Run an ablation:

from spinup.utils.run_utils import ExperimentGrid
from spinup import sac_tf1

eg = ExperimentGrid(name='sac-alpha-sweep')
eg.add('env_name', 'Hopper-v2', '', True)
eg.add('seed', [0, 10, 20])
eg.add('epochs', 100)
eg.add('alpha', [0.05, 0.1, 0.2, 0.5], 'alpha')
eg.add('ac_kwargs:hidden_sizes', [(256, 256)], 'hid')

eg.run(sac_tf1, num_cpu=1)

Discussion:

How does very low $\alpha$ (0.05) change behavior compared to high $\alpha$ (0.5)?
Which $\alpha$ converges fastest on Hopper?
Why might a high $\alpha$ work well in early training but hurt final performance?

Note on adaptive $\alpha$ : Spinning Up uses a fixed $\alpha$ , but the SAC paper also describes an automatic version that adjusts $\alpha$ to maintain a target entropy level. This generally works better in practice and is available in newer implementations.

Exercise 5: Custom Environment

Create a custom Gym-compatible environment and train SAC on it:

import gym
import numpy as np

class SimpleReachEnv(gym.Env):
    """2D point-mass reaching task. Continuous 2D action space."""
    
    def __init__(self):
        self.observation_space = gym.spaces.Box(
            low=-2, high=2, shape=(4,), dtype=np.float32
        )  # [x, y, goal_x, goal_y]
        self.action_space = gym.spaces.Box(
            low=-1, high=1, shape=(2,), dtype=np.float32
        )  # [dx, dy]
        self.goal = np.zeros(2)
        self.pos = np.zeros(2)
    
    def reset(self):
        self.pos = np.random.uniform(-1, 1, 2).astype(np.float32)
        self.goal = np.random.uniform(-1, 1, 2).astype(np.float32)
        return np.concatenate([self.pos, self.goal])
    
    def step(self, action):
        self.pos = np.clip(self.pos + 0.1 * action, -2, 2)
        dist = np.linalg.norm(self.pos - self.goal)
        reward = -dist  # negative distance = dense reward
        done = dist < 0.05  # reached goal
        return np.concatenate([self.pos, self.goal]), reward, done, {}

# Register and train:
gym.envs.register(id='SimpleReach-v0', entry_point=SimpleReachEnv, max_episode_steps=50)

sac(
    env_fn=lambda: gym.make('SimpleReach-v0'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    epochs=50,
    alpha=0.1,
    start_steps=1000,
    update_after=500,
    logger_kwargs=dict(output_dir='/tmp/sac-reach', exp_name='sac-reach')
)

Tasks:

Verify the agent learns to reach the goal (average episode length should decrease)
Try a sparse reward version: reward = 1.0 if dist < 0.05 else 0.0 — does SAC still learn?

Problem Set 2 — Exercise 2.1: Value Function Fitting in TRPO

This exercise uses Spinning Up's TF1 TRPO implementation to demonstrate how dramatically value function quality affects policy gradient performance.

Background. GAE-Lambda depends on the value function baseline V^π to estimate advantages. If V^π is poorly fit, advantage estimates have high variance and policy updates become unreliable.

Instructions. Run the following command to compare TRPO with train_v_iters=0 versus train_v_iters=80:

python -m spinup.run trpo_tf1 --env Hopper-v2 \
    --train_v_iters[v] 0 80 \
    --exp_name ex2-1 \
    --epochs 250 \
    --steps_per_epoch 4000 \
    --seed 0 10 20 \
    --dt

This runs 6 experiments (2 settings × 3 seeds). Use --dt to timestamp run directories.

When complete, plot results:

python -m spinup.run plot /path/to/ex2-1/

Analysis questions:

What is the AverageEpRet gap between train_v_iters=0 and train_v_iters=80 at epoch 250?
Does the train_v_iters=0 run make any learning progress, or does it flatline?
Why does a poor value function hurt so much? (Hint: think about what high-variance advantage estimates do to the policy loss gradient.)

Solution. The official solution is available at: https://spinningup.openai.com/en/latest/spinningup/exercise2_1_soln.html

Problem Set 2 — Exercise 2.2: Silent Bug in DDPG

This exercise demonstrates one of the most important lessons in RL engineering: failures are frequently silent. Code will run without errors, but the agent will never learn.

Instructions. Navigate to the exercise file and run it:

python spinup/exercises/tf1/problem_set_2/exercise2_2.py

This launches 6 DDPG experiments: 3 seeds with a bug, 3 seeds without. When complete:

python -m spinup.run plot /path/to/exercise2_2_results/

Your task. Before looking at the solution:

Observe the performance gap between bugged and non-bugged runs.
Read the bugged exercise2_2.py — specifically the actor-critic network creation code. Do not look at ddpg/core.py.
Form a hypothesis: what exactly is the bug, and how does it break the DDPG computation graph?

Hint. Recall the DDPG computation graph:

# Bellman backup
backup = tf.stop_gradient(r + gamma * (1 - d) * q_pi_targ)
# Losses
pi_loss = -tf.reduce_mean(q_pi)
q_loss = tf.reduce_mean((q - backup)**2)

A bug in the actor-critic code could affect what q_pi, q, and q_pi_targ refer to. Think about shared versus independent network weights.

Bonus. Are there hyperparameter settings that would have hidden the effects of the bug — where the bugged version still appears to learn? Why?

Solution. The official solution is available at: https://spinningup.openai.com/en/latest/spinningup/exercise2_2_soln.html

DDPG & SAC in TensorFlow

Setup

Exercise 1: Baseline DDPG on Pendulum

Exercise 2: Observing DDPG Instability

Exercise 3: SAC vs DDPG on HalfCheetah

Exercise 4: SAC Temperature Ablation

Exercise 5: Custom Environment

Problem Set 2 — Exercise 2.1: Value Function Fitting in TRPO

Problem Set 2 — Exercise 2.2: Silent Bug in DDPG

Privacy Policy

What we collect

What we don't collect

Your choices

Contact