Deep Reinforcement Learning · Policy Gradient Algorithms

VPG & PPO in PyTorch

Colab Notebook · ~60 min

Google Colab Notebook

Python · ~60 min

Lab Objectives

Implement a minimal Vanilla Policy Gradient training loop from scratch — policy network, trajectory collection, rewards-to-go, and a single gradient step — and verify it learns CartPole-v1 within 50 epochs

Extend VPG to PPO-Clip by adding multiple gradient steps per epoch, the clipped surrogate objective, and approximate KL early stopping; ablate clip_ratio across 0.1 / 0.2 / 0.3 / 0.5

Run Spinning Up's full PPO implementation on LunarLander-v3 and compare the effect of varying λ (0.9 vs 0.97), network size, and the VPG baseline

Design and execute a systematic ExperimentGrid sweep across 54 PPO configurations (3 seeds × 3 architectures × 3 clip ratios × 2 λ values) and interpret the results

Implement the diagonal Gaussian log-likelihood function and auto-verify it against Spinning Up's reference implementation (Problem Set 1.1)

Implement an MLP diagonal Gaussian policy compatible with Spinning Up's PPO training loop and verify it achieves average score > 500 on InvertedPendulum-v2 within 20 epochs (Problem Set 1.2)

Implement the TD3 critic and policy loss functions from starter code and verify learning on HalfCheetah-v2 and InvertedPendulum-v2 (Problem Set 1.3)

Setup

Install Spinning Up and verify your environment:

# Clone and install Spinning Up
git clone https://github.com/openai/spinningup.git
cd spinningup
pip install -e .

# Verify installation:
python -m spinup.run vpg_pytorch --env CartPole-v1 --epochs 5

You should see epoch-by-epoch logging of AverageEpRet and other metrics.

Exercise 1: Minimal VPG from Scratch

Implement a minimal VPG training loop without using Spinning Up's VPG implementation (use it only for reference). Your implementation should:

Build a categorical policy network (for CartPole's discrete actions)
Collect one epoch of trajectories by rolling out the policy
Compute rewards-to-go for each timestep
Compute the pseudo-loss and take one gradient step

import torch
import torch.nn as nn
from torch.distributions import Categorical
import gym

class PolicyNet(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, act_dim)
        )
    
    def forward(self, obs):
        return Categorical(logits=self.net(obs))

def rewards_to_go(rewards):
    """Compute reward-to-go for each timestep."""
    n = len(rewards)
    rtg = torch.zeros(n)
    running_sum = 0
    for t in reversed(range(n)):
        running_sum = rewards[t] + running_sum  # no discount for simplicity
        rtg[t] = running_sum
    return rtg

def collect_epoch(env, policy, steps=4000):
    obs_buf, act_buf, rtg_buf = [], [], []
    obs = env.reset()
    ep_rewards = []
    
    for _ in range(steps):
        obs_t = torch.as_tensor(obs, dtype=torch.float32)
        dist = policy(obs_t)
        act = dist.sample()
        
        obs_buf.append(obs)
        act_buf.append(act.item())
        
        obs, rew, done, _ = env.step(act.item())
        ep_rewards.append(rew)
        
        if done:
            ep_rtg = rewards_to_go(ep_rewards)
            rtg_buf.extend(ep_rtg.tolist())
            obs = env.reset()
            ep_rewards = []
    
    return (
        torch.as_tensor(obs_buf, dtype=torch.float32),
        torch.as_tensor(act_buf, dtype=torch.int32),
        torch.as_tensor(rtg_buf, dtype=torch.float32)
    )

def train_vpg(env_name='CartPole-v1', epochs=50, steps=4000, lr=3e-4):
    env = gym.make(env_name)
    policy = PolicyNet(env.observation_space.shape[0], env.action_space.n)
    optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
    
    for epoch in range(epochs):
        obs, acts, rtg = collect_epoch(env, policy, steps)
        
        optimizer.zero_grad()
        log_probs = policy(obs).log_prob(acts)
        loss = -(log_probs * rtg).mean()
        loss.backward()
        optimizer.step()
        
        print(f'Epoch {epoch+1}: mean_rtg={rtg.mean():.1f}')
    
    return policy

if __name__ == '__main__':
    train_vpg()

Task: Run this for 50 epochs. Then:

Add a value function network and use advantage (RTG - baseline) as weights
Compare learning curves: RTG weights vs. advantage weights

Exercise 2: Implement PPO-Clip

Extend your VPG implementation to PPO by adding:

Multiple gradient steps per epoch (train_pi_iters)
The clipped surrogate objective
Approximate KL early stopping

def compute_ppo_loss(obs, acts, adv, logp_old, policy, clip_ratio=0.2):
    """Compute PPO-Clip objective."""
    dist = policy(obs)
    logp = dist.log_prob(acts)
    
    # Probability ratio
    ratio = torch.exp(logp - logp_old)
    
    # Clipped surrogate objective
    clipped_ratio = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio)
    loss = -torch.min(ratio * adv, clipped_ratio * adv).mean()
    
    # Approximate KL for early stopping
    approx_kl = (logp_old - logp).mean().item()
    return loss, approx_kl

def train_ppo_epoch(obs, acts, adv, logp_old, policy, optimizer,
                   clip_ratio=0.2, train_iters=80, target_kl=0.01):
    for i in range(train_iters):
        optimizer.zero_grad()
        loss, kl = compute_ppo_loss(obs, acts, adv, logp_old, policy, clip_ratio)
        if kl > 1.5 * target_kl:
            print(f'  Early stop at step {i}, KL={kl:.4f}')
            break
        loss.backward()
        optimizer.step()

Tasks:

Compare your PPO implementation against Spinning Up's on CartPole
Vary clip_ratio (0.1, 0.2, 0.3) and plot final performance
What happens with clip_ratio=0.5? With clip_ratio=0.01?

Exercise 3: Run Spinning Up's PPO on LunarLander

Use Spinning Up's full PPO implementation (which includes GAE-Lambda, proper value function fitting, and logging) on a harder environment:

from spinup import ppo_pytorch as ppo
import gym

# Baseline run:
ppo(
    env_fn=lambda: gym.make('LunarLander-v3'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=150,
    gamma=0.99,
    lam=0.97,
    clip_ratio=0.2,
    pi_lr=3e-4,
    vf_lr=1e-3,
    train_pi_iters=80,
    train_v_iters=80,
    target_kl=0.01,
    logger_kwargs=dict(output_dir='/tmp/ppo-lunar', exp_name='ppo-lunar-baseline')
)

# Then plot:
python -m spinup.run plot /tmp/ppo-lunar/

Experiments (run each with seeds 0, 10, 20 for statistical validity):

Baseline: lam=0.97, clip_ratio=0.2
Lower lambda: lam=0.9 (more bias, less variance)
Smaller architecture: hidden_sizes=[32,32]
Compare with VPG: python -m spinup.run vpg_pytorch --env LunarLander-v3 --epochs 150 --seed 0 10 20

Analysis questions:

How many epochs until PPO reaches average return > 200 ("solved")?
Does VPG converge at all on LunarLander with 150 epochs?
Which lam value converges faster?

Exercise 4: ExperimentGrid Sweep

Use ExperimentGrid to run a systematic hyperparameter search:

from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo_pytorch
import torch

eg = ExperimentGrid(name='ppo-lunar-sweep')
eg.add('env_name', 'LunarLander-v3', '', True)
eg.add('seed', [0, 10, 20])
eg.add('epochs', 100)
eg.add('ac_kwargs:hidden_sizes', [(32,32), (64,64), (128,128)], 'hid')
eg.add('clip_ratio', [0.1, 0.2, 0.3], 'clip')
eg.add('lam', [0.95, 0.97], 'lam')

eg.run(ppo_pytorch, num_cpu=1)

This launches 3 seeds × 3 arch × 3 clip × 2 lam = 54 experiments.

After all runs complete:

python -m spinup.run plot /path/to/ppo-lunar-sweep/

Discussion: From the results, identify:

Which architecture performed best on average?
Is there an interaction between clip_ratio and lam?
Which configuration has the lowest variance across seeds?

Problem Set 1 — Exercise 1.1: Gaussian Log-Likelihood

These exercises are from the official Spinning Up problem sets, located in the cloned repository under spinup/exercises/pytorch/problem_set_1/.

Task. Write a function that takes in the means and log-stds of a batch of diagonal Gaussian distributions, along with previously-generated samples, and returns the log-likelihoods of those samples.

For a diagonal Gaussian with mean $\mu$ and diagonal covariance $\text{diag}(\sigma^2)$ , the log-likelihood of a sample $x$ is:

$\log \pi(x|\mu, \sigma) = -\frac{1}{2} \sum_i \left( \frac{(x_i - \mu_i)^2}{\sigma_i^2} + 2 \log \sigma_i + \log 2\pi \right)$

Open exercise1_1.py and implement your solution, then run it to auto-check against a known-good implementation:

cd spinningup
python spinup/exercises/pytorch/problem_set_1/exercise1_1.py

Evaluation. Outputs are compared against a reference implementation using a batch of random inputs. All elements of the output tensor should match within numerical tolerance.

Hint. For a diagonal covariance matrix, the multivariate log-likelihood decomposes into a sum over independent univariate Gaussians.

Problem Set 1 — Exercise 1.2: MLP Diagonal Gaussian Policy for PPO

Task. Implement an MLP diagonal Gaussian policy for PPO.

Open exercise1_2.py and implement the policy class. The policy must:

Accept observations and return a Normal distribution (or a wrapper that supports .log_prob() and .sample())
Use the log-likelihood function you wrote in Exercise 1.1
Be compatible with Spinning Up's PPO training loop

python spinup/exercises/pytorch/problem_set_1/exercise1_2.py

Evaluation criteria. Your implementation is evaluated by running for 20 epochs on InvertedPendulum-v2. Success is:

Average score > 500 in the last 5 epochs, or
Score of 1000 (the maximum) in the last 5 epochs

Design notes:

The diagonal Gaussian policy needs both a mean network and a learned log-std parameter (a standalone nn.Parameter, not a network output, for stability)
Make sure log_prob returns the sum of per-dimension log-likelihoods (not a vector)
The act method should return a deterministic action (mean) during evaluation and a sampled action during training

Problem Set 1 — Exercise 1.3: TD3 Computation Graph

Task. Implement the main mathematical logic for the TD3 algorithm — the loss functions and intermediate calculations.

Open exercise1_3.py. You are given the entirety of TD3 except for the loss functions. Find # YOUR CODE HERE to begin.

Recall the TD3 update rules:

Critic update (clipped double-Q):

# Target action with smoothing noise
with torch.no_grad():
    noise = torch.clamp(torch.randn_like(a2) * target_noise,
                        -noise_clip, noise_clip)
    a2 = torch.clamp(pi_targ(o2) + noise, -act_limit, act_limit)
    q1_pi_targ = ac_targ.q1(o2, a2)
    q2_pi_targ = ac_targ.q2(o2, a2)
    q_pi_targ = torch.min(q1_pi_targ, q2_pi_targ)
    backup = r + gamma * (1 - d) * q_pi_targ

loss_q1 = ((ac.q1(o, a) - backup)**2).mean()
loss_q2 = ((ac.q2(o, a) - backup)**2).mean()
loss_q = loss_q1 + loss_q2

Policy update (delayed, only every policy_delay steps):

loss_pi = -ac.q1(o, ac.pi(o)).mean()

Run your implementation:

python spinup/exercises/pytorch/problem_set_1/exercise1_3.py --env HalfCheetah-v2
python spinup/exercises/pytorch/problem_set_1/exercise1_3.py --env InvertedPendulum-v2

Use --use_soln to run Spinning Up's reference TD3 for comparison.

Evaluation. Within 10 epochs, HalfCheetah should exceed 300 and InvertedPendulum should max out at 150.

VPG & PPO in PyTorch

Setup

Exercise 1: Minimal VPG from Scratch

Exercise 2: Implement PPO-Clip

Exercise 3: Run Spinning Up's PPO on LunarLander

Exercise 4: ExperimentGrid Sweep

Problem Set 1 — Exercise 1.1: Gaussian Log-Likelihood

Problem Set 1 — Exercise 1.2: MLP Diagonal Gaussian Policy for PPO

Problem Set 1 — Exercise 1.3: TD3 Computation Graph

Privacy Policy

What we collect

What we don't collect

Your choices

Contact