Policy Gradient Theory
- Derive the basic policy gradient using the log-derivative trick, arriving at the REINFORCE estimator
- Explain the reward-to-go trick and why it reduces variance without introducing bias
- Implement a minimal policy gradient training loop in PyTorch, including the pseudo-loss and gradient step
- Explain why a baseline (e.g., the value function) can be subtracted from the return without biasing the gradient
The Policy Gradient Objective
We want to find policy parameters that maximize expected return:
We'll do this with gradient ascent:
The challenge: how do we compute when the expectation is over trajectories that depend on in a complex way?
Deriving the Policy Gradient
Step 1: Expand the Expectation
Step 2: The Log-Derivative Trick
For any differentiable function, . This turns the gradient of a probability into an expectation:
Step 3: Simplify the Log-Probability
The log-probability of a trajectory factors as:
The gradient kills all terms that don't depend on : the start-state distribution and the environment transitions. Only the policy log-probs survive:
The Basic Policy Gradient (REINFORCE)
This can be estimated from samples. Collect trajectories and compute:
Implementing the Policy Gradient in PyTorch
from torch.distributions import Categorical
def compute_loss(obs, acts, weights, policy_net):
"""Policy gradient pseudo-loss.
The gradient of this function (w.r.t. policy parameters)
equals the policy gradient estimator.
Note: this is NOT a standard supervised loss.
"""
logits = policy_net(obs)
dist = Categorical(logits=logits)
logp = dist.log_prob(acts) # shape: (batch,)
return -(logp * weights).mean() # negative because we want ascent
# Training step:
optimizer.zero_grad()
batch_loss = compute_loss(obs_batch, act_batch, return_batch, policy)
batch_loss.backward()
optimizer.step()
Important warning: This loss is not a loss function in the usual sense.
- Data-dependent parameters: The data must be collected with the current policy. Using stale data breaks the estimator.
- Doesn't measure performance: Minimizing this loss on a fixed batch will not reliably improve . The loss is only useful for one gradient step per batch.
The only metric that matters during training is average episode return.
Reward-to-Go: Don't Let the Past Distract You
In the REINFORCE gradient, every log-prob at time is weighted by the entire trajectory return . But rewards before time cannot have been caused by action . Including them just adds noise.
The reward-to-go is the sum of rewards from time onward:
The policy gradient can equivalently be written as:
Same expectation, but lower variance because past rewards (which are uncorrelated with ) no longer appear as weights.
Baselines: Further Variance Reduction
We can subtract any baseline from the return weights without introducing bias, as long as the baseline doesn't depend on the action:
This is because (the EGLP lemma).
The best baseline is the on-policy value function , because then the weight becomes the advantage:
In practice, a learned value function is used as a critic to estimate . This is the actor-critic architecture:
- Actor (policy ): decides actions.
- Critic (value function ): estimates how good the current state is.
Generalized Advantage Estimation (GAE)
In practice, VPG and its successors use Generalized Advantage Estimation (GAE-λ) to navigate the bias-variance tradeoff between high-variance Monte Carlo returns (λ=1) and low-variance but biased 1-step TD estimates (λ=0):
A typical value is . The Spinning Up implementation uses GAE-Lambda throughout VPG, TRPO, and PPO.