Proximal Policy Optimization (PPO)
- Explain the PPO-Clip objective: what the ratio r_t(θ) measures and how clipping at (1-ε, 1+ε) prevents large policy updates
- Trace the PPO-Clip objective for positive and negative advantage cases to show that the new policy cannot over-exploit the old data in either direction
- Implement Launch PPO in both PyTorch and TensorFlow and configure clip_ratio, target_kl, train_pi_iters, and lam
The PPO Idea
TRPO solves the policy update stability problem rigorously but expensively (conjugate gradient, line search, second-order methods). PPO asks: Can we get the same safety guarantees with a much simpler first-order method?
The answer is yes — almost. PPO modifies the objective directly so that large deviations from the old policy are penalized or ignored, without needing an explicit constraint. It runs multiple minibatch SGD steps per batch of experience, which makes better use of collected data.
PPO-Clip
Spinning Up implements PPO-Clip (the dominant variant used at OpenAI). Define the probability ratio:
The PPO-Clip objective is:
where:
Why Does This Work?
Case 1: Positive advantage. Action was better than expected. We want to increase its probability. The ratio grows as the new policy increases . But once , the min kicks in and the objective stops improving — no incentive to move further from the old policy.
Case 2: Negative advantage. Action was worse than expected. We want to decrease its probability. The ratio falls as the new policy decreases . But once , the max kicks in and the objective stops improving — again, no incentive to move too far.
In both cases, the objective is flat once the policy has moved away from the old policy. This acts as a soft trust region.
Early Stopping
Spinning Up adds one more safeguard: if the mean KL divergence of the new policy from the old exceeds target_kl, training stops early (before train_pi_iters gradient steps are taken). This prevents pathological updates.
# During PPO inner loop:
for i in range(train_pi_iters):
optimizer.zero_grad()
loss_pi = compute_ppo_clip_loss(data, policy)
kl = compute_approx_kl(data, policy) # cheap approximation
if kl > 1.5 * target_kl: # early stop
break
loss_pi.backward()
optimizer.step()
PyTorch Implementation
from spinup import ppo_pytorch as ppo
import gym
ppo(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[64, 64]),
steps_per_epoch=4000,
epochs=200,
gamma=0.99,
clip_ratio=0.2, # epsilon in the clipping formula
pi_lr=3e-4,
vf_lr=1e-3,
train_pi_iters=80, # max gradient steps on policy per epoch
train_v_iters=80,
lam=0.97,
target_kl=0.01, # early stopping threshold
logger_kwargs=dict(output_dir='/tmp/ppo', exp_name='ppo-halfcheetah')
)
From the command line:
python -m spinup.run ppo_pytorch --env HalfCheetah-v2 \
--epochs 200 --clip_ratio 0.2 --seed 0 10 20
TensorFlow: Identical Interface
from spinup import ppo_tf1 as ppo
# All arguments identical to ppo_pytorch
Key Hyperparameters
| Hyperparameter | Default | Role |
|---|---|---|
clip_ratio |
0.2 | — how far the new policy can move from old. Typical: 0.1–0.3. |
target_kl |
0.01 | KL threshold for early stopping. Typical: 0.01–0.05. |
train_pi_iters |
80 | Max gradient steps per epoch (subject to early stopping). |
train_v_iters |
80 | Value function gradient steps per epoch. |
lam |
0.97 | GAE-Lambda for advantage estimation. |
PPO vs. TRPO vs. VPG
| VPG | TRPO | PPO | |
|---|---|---|---|
| Constraint | None | Hard KL (constrained opt) | Soft (clipped objective) |
| Implementation | Trivial | Complex (CG, line search) | Simple |
| Data reuse | 1 step | 1 step | Multiple minibatch steps |
| Typical performance | Low | High | High (≈ TRPO) |
| TF support | Both | TF only | Both |
PPO in Practice
PPO is the workhorse on-policy algorithm. It was used in:
- OpenAI Five (Dota 2 at superhuman level)
- OpenAI's robotics dexterous manipulation work
- Many large-scale RLHF (Reinforcement Learning from Human Feedback) systems
If you need an on-policy algorithm and aren't sure where to start, use PPO.