Deep Reinforcement Learning · Policy Gradient Algorithms

Proximal Policy Optimization (PPO)

12 min read

By the end of this reading you will be able to:

Explain the PPO-Clip objective: what the ratio r_t(θ) measures and how clipping at (1-ε, 1+ε) prevents large policy updates
Trace the PPO-Clip objective for positive and negative advantage cases to show that the new policy cannot over-exploit the old data in either direction
Implement Launch PPO in both PyTorch and TensorFlow and configure clip_ratio, target_kl, train_pi_iters, and lam

The PPO Idea

TRPO solves the policy update stability problem rigorously but expensively (conjugate gradient, line search, second-order methods). PPO asks: Can we get the same safety guarantees with a much simpler first-order method?

The answer is yes — almost. PPO modifies the objective directly so that large deviations from the old policy are penalized or ignored, without needing an explicit constraint. It runs multiple minibatch SGD steps per batch of experience, which makes better use of collected data.

PPO-Clip

Spinning Up implements PPO-Clip (the dominant variant used at OpenAI). Define the probability ratio: $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}$

The PPO-Clip objective is: $L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \; g(\epsilon, \hat{A}_t)\right)\right]$

where: $g(\epsilon, A) = \begin{cases} (1+\epsilon) A & A \geq 0 \\ (1-\epsilon) A & A < 0 \end{cases}$

Why Does This Work?

Case 1: Positive advantage. Action $a_t$ was better than expected. We want to increase its probability. The ratio $r_t(\theta)$ grows as the new policy increases $a_t$ . But once $r_t > 1+\epsilon$ , the min kicks in and the objective stops improving — no incentive to move further from the old policy.

Case 2: Negative advantage. Action $a_t$ was worse than expected. We want to decrease its probability. The ratio $r_t(\theta)$ falls as the new policy decreases $a_t$ . But once $r_t < 1-\epsilon$ , the max kicks in and the objective stops improving — again, no incentive to move too far.

In both cases, the objective is flat once the policy has moved $\epsilon$ away from the old policy. This acts as a soft trust region.

Early Stopping

Spinning Up adds one more safeguard: if the mean KL divergence of the new policy from the old exceeds target_kl, training stops early (before train_pi_iters gradient steps are taken). This prevents pathological updates.

# During PPO inner loop:
for i in range(train_pi_iters):
    optimizer.zero_grad()
    loss_pi = compute_ppo_clip_loss(data, policy)
    kl = compute_approx_kl(data, policy)   # cheap approximation
    if kl > 1.5 * target_kl:              # early stop
        break
    loss_pi.backward()
    optimizer.step()

PyTorch Implementation

from spinup import ppo_pytorch as ppo
import gym

ppo(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=200,
    gamma=0.99,
    clip_ratio=0.2,       # epsilon in the clipping formula
    pi_lr=3e-4,
    vf_lr=1e-3,
    train_pi_iters=80,    # max gradient steps on policy per epoch
    train_v_iters=80,
    lam=0.97,
    target_kl=0.01,       # early stopping threshold
    logger_kwargs=dict(output_dir='/tmp/ppo', exp_name='ppo-halfcheetah')
)

From the command line:

python -m spinup.run ppo_pytorch --env HalfCheetah-v2 \
  --epochs 200 --clip_ratio 0.2 --seed 0 10 20

TensorFlow: Identical Interface

from spinup import ppo_tf1 as ppo
# All arguments identical to ppo_pytorch

Key Hyperparameters

Hyperparameter	Default	Role
`clip_ratio`	0.2	$\epsilon$ — how far the new policy can move from old. Typical: 0.1–0.3.
`target_kl`	0.01	KL threshold for early stopping. Typical: 0.01–0.05.
`train_pi_iters`	80	Max gradient steps per epoch (subject to early stopping).
`train_v_iters`	80	Value function gradient steps per epoch.
`lam`	0.97	GAE-Lambda for advantage estimation.

PPO vs. TRPO vs. VPG

	VPG	TRPO	PPO
Constraint	None	Hard KL (constrained opt)	Soft (clipped objective)
Implementation	Trivial	Complex (CG, line search)	Simple
Data reuse	1 step	1 step	Multiple minibatch steps
Typical performance	Low	High	High (≈ TRPO)
TF support	Both	TF only	Both

PPO in Practice

PPO is the workhorse on-policy algorithm. It was used in:

OpenAI Five (Dota 2 at superhuman level)
OpenAI's robotics dexterous manipulation work
Many large-scale RLHF (Reinforcement Learning from Human Feedback) systems

If you need an on-policy algorithm and aren't sure where to start, use PPO.

References

Schulman et al. 2017 — Proximal Policy Optimization Algorithms

Heess et al. 2017 — Emergence of Locomotion Behaviours in Rich Environments

OpenAI Spinning Up — PPO Documentation

Previous Take Quiz →

Proximal Policy Optimization (PPO)

The PPO Idea

PPO-Clip

Why Does This Work?

Early Stopping

PyTorch Implementation

TensorFlow: Identical Interface

Key Hyperparameters

PPO vs. TRPO vs. VPG

PPO in Practice

Privacy Policy

What we collect

What we don't collect

Your choices

Contact