Deep Reinforcement Learning · Policy Gradient Algorithms

Trust Region Policy Optimization (TRPO)

12 min read
By the end of this reading you will be able to:
  • Explain the core problem that TRPO solves: why large policy gradient steps can cause performance collapse, and how KL-divergence constrains the update
  • Describe the TRPO update rule: surrogate advantage objective, KL constraint, conjugate gradient, and backtracking line search
  • Configure TRPO's key hyperparameters: delta (KL limit), damping_coeff, cg_iters, and lam

The Problem with Large Policy Gradient Steps

Vanilla policy gradient keeps new and old policies close in parameter space — it simply takes a step of size α\alpha along the gradient. But small moves in parameter space can produce large changes in policy behavior, because the mapping from parameters to action distributions is nonlinear.

A single bad gradient step can collapse performance. Recovering from collapse often requires reducing the learning rate and waiting many epochs — effectively wasting all the data you collected.

The TRPO Idea: Constrain the KL Divergence

TRPO asks: What is the largest performance-improving step we can take such that the new policy stays close to the old one in terms of behavior — not just parameter values?

It formalizes "closeness" using KL divergence: DˉKL(θθk)=Esπθk[DKL(πθ(s)πθk(s))]\bar{D}_{KL}(\theta || \theta_k) = \mathbb{E}_{s \sim \pi_{\theta_k}}\left[D_{KL}\left(\pi_\theta(\cdot|s) \,\|\, \pi_{\theta_k}(\cdot|s)\right)\right]

The TRPO update solves a constrained optimization: θk+1=argmaxθ  L(θk,θ)s.t.DˉKL(θθk)δ\theta_{k+1} = \arg\max_\theta \; \mathcal{L}(\theta_k, \theta) \quad \text{s.t.} \quad \bar{D}_{KL}(\theta || \theta_k) \leq \delta

where L(θk,θ)\mathcal{L}(\theta_k, \theta) is the surrogate advantage — a measure of how much πθ\pi_\theta improves on πθk\pi_{\theta_k} using old data: L(θk,θ)=Es,aπθk[πθ(as)πθk(as)Aπθk(s,a)]\mathcal{L}(\theta_k, \theta) = \mathbb{E}_{s,a \sim \pi_{\theta_k}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a)\right]

The ratio πθ(as)/πθk(as)\pi_\theta(a|s) / \pi_{\theta_k}(a|s) is the importance weight — it corrects for the fact that we're evaluating the new policy using data from the old one.

Solving the Constrained Problem Efficiently

The exact constrained optimization is expensive. TRPO uses two approximations:

1. Taylor Expansion

Expand objective and constraint to leading order around θk\theta_k: LgT(θθk),DˉKL12(θθk)TH(θθk)\mathcal{L} \approx g^T(\theta - \theta_k), \quad \bar{D}_{KL} \approx \frac{1}{2}(\theta - \theta_k)^T H (\theta - \theta_k)

Here gg is the policy gradient and HH is the Hessian of the KL constraint. The solution is: θk+1=θk+2δgTH1gH1g\theta_{k+1} = \theta_k + \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g

This is exactly the natural policy gradient.

2. Conjugate Gradient

Computing H1gH^{-1} g directly is O(n2)O(n^2) in parameters — infeasible for large neural networks. TRPO instead solves Hx=gHx = g using the conjugate gradient algorithm, which only requires computing matrix-vector products HvHv (not storing HH). Computing Hv=θ[(θDˉKL)Tv]Hv = \nabla_\theta[(\nabla_\theta \bar{D}_{KL})^T v] is a second-order AD operation available in PyTorch and TensorFlow.

Because of Taylor approximation errors, the analytical step might violate the constraint. TRPO adds a backtracking line search: θk+1=θk+αj2δgTH1gH1g\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g

where α(0,1)\alpha \in (0,1) is the backtrack coefficient and jj is the smallest integer such that the new policy satisfies the KL constraint and shows a positive surrogate advantage.

TensorFlow API

Note: The Spinning Up TRPO implementation is TensorFlow only — there is no PyTorch version.

from spinup import trpo_tf1 as trpo
import gym

trpo(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=50,
    gamma=0.99,
    delta=0.01,          # KL divergence limit
    vf_lr=1e-3,
    train_v_iters=80,
    damping_coeff=0.1,   # Hessian damping for numerical stability
    cg_iters=10,         # Conjugate gradient iterations
    backtrack_iters=10,
    backtrack_coeff=0.8,
    lam=0.97,
    logger_kwargs=dict(output_dir='/tmp/trpo', exp_name='trpo')
)

Key Hyperparameters

Hyperparameter Default Notes
delta 0.01 KL limit. Typical values: 0.01–0.05. Smaller = more conservative.
damping_coeff 0.1 Adds αI\alpha I to Hessian for stability. Usually don't touch.
cg_iters 10 More iterations → better H1gH^{-1}g approximation, but slower.
backtrack_iters 10 Max backtracking steps (rarely reached).
backtrack_coeff 0.8 Shrink factor per backtrack step.
lam 0.97 GAE-Lambda.

TRPO vs. VPG

VPG TRPO
Update rule Unconstrained gradient KL-constrained optimization
Step size tuning Sensitive to learning rate Robust to step size (constraint handles it)
Computational cost Cheap Expensive (conjugate gradient, line search)
Performance Lower ceiling Higher, more monotonic improvement

In practice, PPO (next reading) achieves similar performance to TRPO at a fraction of the implementation complexity, which is why TRPO sees less use despite its theoretical elegance.