VPG & PPO in PyTorch
Setup
Install Spinning Up and verify your environment:
# Clone and install Spinning Up
git clone https://github.com/openai/spinningup.git
cd spinningup
pip install -e .
# Verify installation:
python -m spinup.run vpg_pytorch --env CartPole-v1 --epochs 5
You should see epoch-by-epoch logging of AverageEpRet and other metrics.
Exercise 1: Minimal VPG from Scratch
Implement a minimal VPG training loop without using Spinning Up's VPG implementation (use it only for reference). Your implementation should:
- Build a categorical policy network (for CartPole's discrete actions)
- Collect one epoch of trajectories by rolling out the policy
- Compute rewards-to-go for each timestep
- Compute the pseudo-loss and take one gradient step
import torch
import torch.nn as nn
from torch.distributions import Categorical
import gym
class PolicyNet(nn.Module):
def __init__(self, obs_dim, act_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh(),
nn.Linear(64, act_dim)
)
def forward(self, obs):
return Categorical(logits=self.net(obs))
def rewards_to_go(rewards):
"""Compute reward-to-go for each timestep."""
n = len(rewards)
rtg = torch.zeros(n)
running_sum = 0
for t in reversed(range(n)):
running_sum = rewards[t] + running_sum # no discount for simplicity
rtg[t] = running_sum
return rtg
def collect_epoch(env, policy, steps=4000):
obs_buf, act_buf, rtg_buf = [], [], []
obs = env.reset()
ep_rewards = []
for _ in range(steps):
obs_t = torch.as_tensor(obs, dtype=torch.float32)
dist = policy(obs_t)
act = dist.sample()
obs_buf.append(obs)
act_buf.append(act.item())
obs, rew, done, _ = env.step(act.item())
ep_rewards.append(rew)
if done:
ep_rtg = rewards_to_go(ep_rewards)
rtg_buf.extend(ep_rtg.tolist())
obs = env.reset()
ep_rewards = []
return (
torch.as_tensor(obs_buf, dtype=torch.float32),
torch.as_tensor(act_buf, dtype=torch.int32),
torch.as_tensor(rtg_buf, dtype=torch.float32)
)
def train_vpg(env_name='CartPole-v1', epochs=50, steps=4000, lr=3e-4):
env = gym.make(env_name)
policy = PolicyNet(env.observation_space.shape[0], env.action_space.n)
optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
for epoch in range(epochs):
obs, acts, rtg = collect_epoch(env, policy, steps)
optimizer.zero_grad()
log_probs = policy(obs).log_prob(acts)
loss = -(log_probs * rtg).mean()
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}: mean_rtg={rtg.mean():.1f}')
return policy
if __name__ == '__main__':
train_vpg()
Task: Run this for 50 epochs. Then:
- Add a value function network and use advantage (RTG - baseline) as weights
- Compare learning curves: RTG weights vs. advantage weights
Exercise 2: Implement PPO-Clip
Extend your VPG implementation to PPO by adding:
- Multiple gradient steps per epoch (
train_pi_iters) - The clipped surrogate objective
- Approximate KL early stopping
def compute_ppo_loss(obs, acts, adv, logp_old, policy, clip_ratio=0.2):
"""Compute PPO-Clip objective."""
dist = policy(obs)
logp = dist.log_prob(acts)
# Probability ratio
ratio = torch.exp(logp - logp_old)
# Clipped surrogate objective
clipped_ratio = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio)
loss = -torch.min(ratio * adv, clipped_ratio * adv).mean()
# Approximate KL for early stopping
approx_kl = (logp_old - logp).mean().item()
return loss, approx_kl
def train_ppo_epoch(obs, acts, adv, logp_old, policy, optimizer,
clip_ratio=0.2, train_iters=80, target_kl=0.01):
for i in range(train_iters):
optimizer.zero_grad()
loss, kl = compute_ppo_loss(obs, acts, adv, logp_old, policy, clip_ratio)
if kl > 1.5 * target_kl:
print(f' Early stop at step {i}, KL={kl:.4f}')
break
loss.backward()
optimizer.step()
Tasks:
- Compare your PPO implementation against Spinning Up's on CartPole
- Vary
clip_ratio(0.1, 0.2, 0.3) and plot final performance - What happens with
clip_ratio=0.5? Withclip_ratio=0.01?
Exercise 3: Run Spinning Up's PPO on LunarLander
Use Spinning Up's full PPO implementation (which includes GAE-Lambda, proper value function fitting, and logging) on a harder environment:
from spinup import ppo_pytorch as ppo
import gym
# Baseline run:
ppo(
env_fn=lambda: gym.make('LunarLander-v3'),
ac_kwargs=dict(hidden_sizes=[64, 64]),
steps_per_epoch=4000,
epochs=150,
gamma=0.99,
lam=0.97,
clip_ratio=0.2,
pi_lr=3e-4,
vf_lr=1e-3,
train_pi_iters=80,
train_v_iters=80,
target_kl=0.01,
logger_kwargs=dict(output_dir='/tmp/ppo-lunar', exp_name='ppo-lunar-baseline')
)
# Then plot:
python -m spinup.run plot /tmp/ppo-lunar/
Experiments (run each with seeds 0, 10, 20 for statistical validity):
- Baseline:
lam=0.97,clip_ratio=0.2 - Lower lambda:
lam=0.9(more bias, less variance) - Smaller architecture:
hidden_sizes=[32,32] - Compare with VPG:
python -m spinup.run vpg_pytorch --env LunarLander-v3 --epochs 150 --seed 0 10 20
Analysis questions:
- How many epochs until PPO reaches average return > 200 ("solved")?
- Does VPG converge at all on LunarLander with 150 epochs?
- Which
lamvalue converges faster?
Exercise 4: ExperimentGrid Sweep
Use ExperimentGrid to run a systematic hyperparameter search:
from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo_pytorch
import torch
eg = ExperimentGrid(name='ppo-lunar-sweep')
eg.add('env_name', 'LunarLander-v3', '', True)
eg.add('seed', [0, 10, 20])
eg.add('epochs', 100)
eg.add('ac_kwargs:hidden_sizes', [(32,32), (64,64), (128,128)], 'hid')
eg.add('clip_ratio', [0.1, 0.2, 0.3], 'clip')
eg.add('lam', [0.95, 0.97], 'lam')
eg.run(ppo_pytorch, num_cpu=1)
This launches 3 seeds × 3 arch × 3 clip × 2 lam = 54 experiments.
After all runs complete:
python -m spinup.run plot /path/to/ppo-lunar-sweep/
Discussion: From the results, identify:
- Which architecture performed best on average?
- Is there an interaction between
clip_ratioandlam? - Which configuration has the lowest variance across seeds?
Problem Set 1 — Exercise 1.1: Gaussian Log-Likelihood
These exercises are from the official Spinning Up problem sets, located in the cloned repository under spinup/exercises/pytorch/problem_set_1/.
Task. Write a function that takes in the means and log-stds of a batch of diagonal Gaussian distributions, along with previously-generated samples, and returns the log-likelihoods of those samples.
For a diagonal Gaussian with mean and diagonal covariance , the log-likelihood of a sample is:
Open exercise1_1.py and implement your solution, then run it to auto-check against a known-good implementation:
cd spinningup
python spinup/exercises/pytorch/problem_set_1/exercise1_1.py
Evaluation. Outputs are compared against a reference implementation using a batch of random inputs. All elements of the output tensor should match within numerical tolerance.
Hint. For a diagonal covariance matrix, the multivariate log-likelihood decomposes into a sum over independent univariate Gaussians.
Problem Set 1 — Exercise 1.2: MLP Diagonal Gaussian Policy for PPO
Task. Implement an MLP diagonal Gaussian policy for PPO.
Open exercise1_2.py and implement the policy class. The policy must:
- Accept observations and return a
Normaldistribution (or a wrapper that supports.log_prob()and.sample()) - Use the log-likelihood function you wrote in Exercise 1.1
- Be compatible with Spinning Up's PPO training loop
python spinup/exercises/pytorch/problem_set_1/exercise1_2.py
Evaluation criteria. Your implementation is evaluated by running for 20 epochs on InvertedPendulum-v2. Success is:
- Average score > 500 in the last 5 epochs, or
- Score of 1000 (the maximum) in the last 5 epochs
Design notes:
- The diagonal Gaussian policy needs both a mean network and a learned log-std parameter (a standalone
nn.Parameter, not a network output, for stability) - Make sure
log_probreturns the sum of per-dimension log-likelihoods (not a vector) - The
actmethod should return a deterministic action (mean) during evaluation and a sampled action during training
Problem Set 1 — Exercise 1.3: TD3 Computation Graph
Task. Implement the main mathematical logic for the TD3 algorithm — the loss functions and intermediate calculations.
Open exercise1_3.py. You are given the entirety of TD3 except for the loss functions. Find # YOUR CODE HERE to begin.
Recall the TD3 update rules:
Critic update (clipped double-Q):
# Target action with smoothing noise
with torch.no_grad():
noise = torch.clamp(torch.randn_like(a2) * target_noise,
-noise_clip, noise_clip)
a2 = torch.clamp(pi_targ(o2) + noise, -act_limit, act_limit)
q1_pi_targ = ac_targ.q1(o2, a2)
q2_pi_targ = ac_targ.q2(o2, a2)
q_pi_targ = torch.min(q1_pi_targ, q2_pi_targ)
backup = r + gamma * (1 - d) * q_pi_targ
loss_q1 = ((ac.q1(o, a) - backup)**2).mean()
loss_q2 = ((ac.q2(o, a) - backup)**2).mean()
loss_q = loss_q1 + loss_q2
Policy update (delayed, only every policy_delay steps):
loss_pi = -ac.q1(o, ac.pi(o)).mean()
Run your implementation:
python spinup/exercises/pytorch/problem_set_1/exercise1_3.py --env HalfCheetah-v2
python spinup/exercises/pytorch/problem_set_1/exercise1_3.py --env InvertedPendulum-v2
Use --use_soln to run Spinning Up's reference TD3 for comparison.
Evaluation. Within 10 epochs, HalfCheetah should exceed 300 and InvertedPendulum should max out at 150.