Deep Reinforcement Learning · Policy Gradient Algorithms

Proximal Policy Optimization (PPO)

12 min read
By the end of this reading you will be able to:
  • Explain the PPO-Clip objective: what the ratio r_t(θ) measures and how clipping at (1-ε, 1+ε) prevents large policy updates
  • Trace the PPO-Clip objective for positive and negative advantage cases to show that the new policy cannot over-exploit the old data in either direction
  • Implement Launch PPO in both PyTorch and TensorFlow and configure clip_ratio, target_kl, train_pi_iters, and lam

The PPO Idea

TRPO solves the policy update stability problem rigorously but expensively (conjugate gradient, line search, second-order methods). PPO asks: Can we get the same safety guarantees with a much simpler first-order method?

The answer is yes — almost. PPO modifies the objective directly so that large deviations from the old policy are penalized or ignored, without needing an explicit constraint. It runs multiple minibatch SGD steps per batch of experience, which makes better use of collected data.

PPO-Clip

Spinning Up implements PPO-Clip (the dominant variant used at OpenAI). Define the probability ratio: rt(θ)=πθ(atst)πθk(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}

The PPO-Clip objective is: LCLIP(θ)=Et[min(rt(θ)A^t,  g(ϵ,A^t))]L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \; g(\epsilon, \hat{A}_t)\right)\right]

where: g(ϵ,A)={(1+ϵ)AA0(1ϵ)AA<0g(\epsilon, A) = \begin{cases} (1+\epsilon) A & A \geq 0 \\ (1-\epsilon) A & A < 0 \end{cases}

Why Does This Work?

Case 1: Positive advantage. Action ata_t was better than expected. We want to increase its probability. The ratio rt(θ)r_t(\theta) grows as the new policy increases ata_t. But once rt>1+ϵr_t > 1+\epsilon, the min kicks in and the objective stops improving — no incentive to move further from the old policy.

Case 2: Negative advantage. Action ata_t was worse than expected. We want to decrease its probability. The ratio rt(θ)r_t(\theta) falls as the new policy decreases ata_t. But once rt<1ϵr_t < 1-\epsilon, the max kicks in and the objective stops improving — again, no incentive to move too far.

In both cases, the objective is flat once the policy has moved ϵ\epsilon away from the old policy. This acts as a soft trust region.

Early Stopping

Spinning Up adds one more safeguard: if the mean KL divergence of the new policy from the old exceeds target_kl, training stops early (before train_pi_iters gradient steps are taken). This prevents pathological updates.

# During PPO inner loop:
for i in range(train_pi_iters):
    optimizer.zero_grad()
    loss_pi = compute_ppo_clip_loss(data, policy)
    kl = compute_approx_kl(data, policy)   # cheap approximation
    if kl > 1.5 * target_kl:              # early stop
        break
    loss_pi.backward()
    optimizer.step()

PyTorch Implementation

from spinup import ppo_pytorch as ppo
import gym

ppo(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=200,
    gamma=0.99,
    clip_ratio=0.2,       # epsilon in the clipping formula
    pi_lr=3e-4,
    vf_lr=1e-3,
    train_pi_iters=80,    # max gradient steps on policy per epoch
    train_v_iters=80,
    lam=0.97,
    target_kl=0.01,       # early stopping threshold
    logger_kwargs=dict(output_dir='/tmp/ppo', exp_name='ppo-halfcheetah')
)

From the command line:

python -m spinup.run ppo_pytorch --env HalfCheetah-v2 \
  --epochs 200 --clip_ratio 0.2 --seed 0 10 20

TensorFlow: Identical Interface

from spinup import ppo_tf1 as ppo
# All arguments identical to ppo_pytorch

Key Hyperparameters

Hyperparameter Default Role
clip_ratio 0.2 ϵ\epsilon — how far the new policy can move from old. Typical: 0.1–0.3.
target_kl 0.01 KL threshold for early stopping. Typical: 0.01–0.05.
train_pi_iters 80 Max gradient steps per epoch (subject to early stopping).
train_v_iters 80 Value function gradient steps per epoch.
lam 0.97 GAE-Lambda for advantage estimation.

PPO vs. TRPO vs. VPG

VPG TRPO PPO
Constraint None Hard KL (constrained opt) Soft (clipped objective)
Implementation Trivial Complex (CG, line search) Simple
Data reuse 1 step 1 step Multiple minibatch steps
Typical performance Low High High (≈ TRPO)
TF support Both TF only Both

PPO in Practice

PPO is the workhorse on-policy algorithm. It was used in:

  • OpenAI Five (Dota 2 at superhuman level)
  • OpenAI's robotics dexterous manipulation work
  • Many large-scale RLHF (Reinforcement Learning from Human Feedback) systems

If you need an on-policy algorithm and aren't sure where to start, use PPO.