Trust Region Policy Optimization (TRPO)
- Explain the core problem that TRPO solves: why large policy gradient steps can cause performance collapse, and how KL-divergence constrains the update
- Describe the TRPO update rule: surrogate advantage objective, KL constraint, conjugate gradient, and backtracking line search
- Configure TRPO's key hyperparameters: delta (KL limit), damping_coeff, cg_iters, and lam
The Problem with Large Policy Gradient Steps
Vanilla policy gradient keeps new and old policies close in parameter space — it simply takes a step of size along the gradient. But small moves in parameter space can produce large changes in policy behavior, because the mapping from parameters to action distributions is nonlinear.
A single bad gradient step can collapse performance. Recovering from collapse often requires reducing the learning rate and waiting many epochs — effectively wasting all the data you collected.
The TRPO Idea: Constrain the KL Divergence
TRPO asks: What is the largest performance-improving step we can take such that the new policy stays close to the old one in terms of behavior — not just parameter values?
It formalizes "closeness" using KL divergence:
The TRPO update solves a constrained optimization:
where is the surrogate advantage — a measure of how much improves on using old data:
The ratio is the importance weight — it corrects for the fact that we're evaluating the new policy using data from the old one.
Solving the Constrained Problem Efficiently
The exact constrained optimization is expensive. TRPO uses two approximations:
1. Taylor Expansion
Expand objective and constraint to leading order around :
Here is the policy gradient and is the Hessian of the KL constraint. The solution is:
This is exactly the natural policy gradient.
2. Conjugate Gradient
Computing directly is in parameters — infeasible for large neural networks. TRPO instead solves using the conjugate gradient algorithm, which only requires computing matrix-vector products (not storing ). Computing is a second-order AD operation available in PyTorch and TensorFlow.
3. Backtracking Line Search
Because of Taylor approximation errors, the analytical step might violate the constraint. TRPO adds a backtracking line search:
where is the backtrack coefficient and is the smallest integer such that the new policy satisfies the KL constraint and shows a positive surrogate advantage.
TensorFlow API
Note: The Spinning Up TRPO implementation is TensorFlow only — there is no PyTorch version.
from spinup import trpo_tf1 as trpo
import gym
trpo(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[64, 64]),
steps_per_epoch=4000,
epochs=50,
gamma=0.99,
delta=0.01, # KL divergence limit
vf_lr=1e-3,
train_v_iters=80,
damping_coeff=0.1, # Hessian damping for numerical stability
cg_iters=10, # Conjugate gradient iterations
backtrack_iters=10,
backtrack_coeff=0.8,
lam=0.97,
logger_kwargs=dict(output_dir='/tmp/trpo', exp_name='trpo')
)
Key Hyperparameters
| Hyperparameter | Default | Notes |
|---|---|---|
delta |
0.01 | KL limit. Typical values: 0.01–0.05. Smaller = more conservative. |
damping_coeff |
0.1 | Adds to Hessian for stability. Usually don't touch. |
cg_iters |
10 | More iterations → better approximation, but slower. |
backtrack_iters |
10 | Max backtracking steps (rarely reached). |
backtrack_coeff |
0.8 | Shrink factor per backtrack step. |
lam |
0.97 | GAE-Lambda. |
TRPO vs. VPG
| VPG | TRPO | |
|---|---|---|
| Update rule | Unconstrained gradient | KL-constrained optimization |
| Step size tuning | Sensitive to learning rate | Robust to step size (constraint handles it) |
| Computational cost | Cheap | Expensive (conjugate gradient, line search) |
| Performance | Lower ceiling | Higher, more monotonic improvement |
In practice, PPO (next reading) achieves similar performance to TRPO at a fraction of the implementation complexity, which is why TRPO sees less use despite its theoretical elegance.