Deep Reinforcement Learning · Policy Gradient Algorithms

Trust Region Policy Optimization (TRPO)

12 min read

By the end of this reading you will be able to:

Explain the core problem that TRPO solves: why large policy gradient steps can cause performance collapse, and how KL-divergence constrains the update
Describe the TRPO update rule: surrogate advantage objective, KL constraint, conjugate gradient, and backtracking line search
Configure TRPO's key hyperparameters: delta (KL limit), damping_coeff, cg_iters, and lam

The Problem with Large Policy Gradient Steps

Vanilla policy gradient keeps new and old policies close in parameter space — it simply takes a step of size $\alpha$ along the gradient. But small moves in parameter space can produce large changes in policy behavior, because the mapping from parameters to action distributions is nonlinear.

A single bad gradient step can collapse performance. Recovering from collapse often requires reducing the learning rate and waiting many epochs — effectively wasting all the data you collected.

The TRPO Idea: Constrain the KL Divergence

TRPO asks: What is the largest performance-improving step we can take such that the new policy stays close to the old one in terms of behavior — not just parameter values?

It formalizes "closeness" using KL divergence: $\bar{D}_{KL}(\theta || \theta_k) = \mathbb{E}_{s \sim \pi_{\theta_k}}\left[D_{KL}\left(\pi_\theta(\cdot|s) \,\|\, \pi_{\theta_k}(\cdot|s)\right)\right]$

The TRPO update solves a constrained optimization: $\theta_{k+1} = \arg\max_\theta \; \mathcal{L}(\theta_k, \theta) \quad \text{s.t.} \quad \bar{D}_{KL}(\theta || \theta_k) \leq \delta$

where $\mathcal{L}(\theta_k, \theta)$ is the surrogate advantage — a measure of how much $\pi_\theta$ improves on $\pi_{\theta_k}$ using old data: $\mathcal{L}(\theta_k, \theta) = \mathbb{E}_{s,a \sim \pi_{\theta_k}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a)\right]$

The ratio $\pi_\theta(a|s) / \pi_{\theta_k}(a|s)$ is the importance weight — it corrects for the fact that we're evaluating the new policy using data from the old one.

Solving the Constrained Problem Efficiently

The exact constrained optimization is expensive. TRPO uses two approximations:

1. Taylor Expansion

Expand objective and constraint to leading order around $\theta_k$ : $\mathcal{L} \approx g^T(\theta - \theta_k), \quad \bar{D}_{KL} \approx \frac{1}{2}(\theta - \theta_k)^T H (\theta - \theta_k)$

Here $g$ is the policy gradient and $H$ is the Hessian of the KL constraint. The solution is: $\theta_{k+1} = \theta_k + \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g$

This is exactly the natural policy gradient.

2. Conjugate Gradient

Computing $H^{-1} g$ directly is $O(n^2)$ in parameters — infeasible for large neural networks. TRPO instead solves $Hx = g$ using the conjugate gradient algorithm, which only requires computing matrix-vector products $Hv$ (not storing $H$ ). Computing $Hv = \nabla_\theta[(\nabla_\theta \bar{D}_{KL})^T v]$ is a second-order AD operation available in PyTorch and TensorFlow.

3. Backtracking Line Search

Because of Taylor approximation errors, the analytical step might violate the constraint. TRPO adds a backtracking line search: $\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g$

where $\alpha \in (0,1)$ is the backtrack coefficient and $j$ is the smallest integer such that the new policy satisfies the KL constraint and shows a positive surrogate advantage.

TensorFlow API

Note: The Spinning Up TRPO implementation is TensorFlow only — there is no PyTorch version.

from spinup import trpo_tf1 as trpo
import gym

trpo(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=50,
    gamma=0.99,
    delta=0.01,          # KL divergence limit
    vf_lr=1e-3,
    train_v_iters=80,
    damping_coeff=0.1,   # Hessian damping for numerical stability
    cg_iters=10,         # Conjugate gradient iterations
    backtrack_iters=10,
    backtrack_coeff=0.8,
    lam=0.97,
    logger_kwargs=dict(output_dir='/tmp/trpo', exp_name='trpo')
)

Key Hyperparameters

Hyperparameter	Default	Notes
`delta`	0.01	KL limit. Typical values: 0.01–0.05. Smaller = more conservative.
`damping_coeff`	0.1	Adds $\alpha I$ to Hessian for stability. Usually don't touch.
`cg_iters`	10	More iterations → better $H^{-1}g$ approximation, but slower.
`backtrack_iters`	10	Max backtracking steps (rarely reached).
`backtrack_coeff`	0.8	Shrink factor per backtrack step.
`lam`	0.97	GAE-Lambda.

TRPO vs. VPG

	VPG	TRPO
Update rule	Unconstrained gradient	KL-constrained optimization
Step size tuning	Sensitive to learning rate	Robust to step size (constraint handles it)
Computational cost	Cheap	Expensive (conjugate gradient, line search)
Performance	Lower ceiling	Higher, more monotonic improvement

In practice, PPO (next reading) achieves similar performance to TRPO at a fraction of the implementation complexity, which is why TRPO sees less use despite its theoretical elegance.

References

Schulman et al. 2015 — Trust Region Policy Optimization

Kakade & Langford 2002 — Approximately Optimal Approximate Reinforcement Learning

OpenAI Spinning Up — TRPO Documentation

Previous Next →

Trust Region Policy Optimization (TRPO)

The Problem with Large Policy Gradient Steps

The TRPO Idea: Constrain the KL Divergence

Solving the Constrained Problem Efficiently

1. Taylor Expansion

2. Conjugate Gradient

3. Backtracking Line Search

TensorFlow API

Key Hyperparameters

TRPO vs. VPG

Privacy Policy

What we collect

What we don't collect

Your choices

Contact