Deep Reinforcement Learning · RL Foundations

Policy Gradient Theory

15 min read
By the end of this reading you will be able to:
  • Derive the basic policy gradient using the log-derivative trick, arriving at the REINFORCE estimator
  • Explain the reward-to-go trick and why it reduces variance without introducing bias
  • Implement a minimal policy gradient training loop in PyTorch, including the pseudo-loss and gradient step
  • Explain why a baseline (e.g., the value function) can be subtracted from the return without biasing the gradient

The Policy Gradient Objective

We want to find policy parameters θ\theta that maximize expected return: J(πθ)=Eτπθ[R(τ)]J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

We'll do this with gradient ascent: θk+1=θk+αθJ(πθ)θk\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta)\big|_{\theta_k}

The challenge: how do we compute θJ(πθ)\nabla_\theta J(\pi_\theta) when the expectation is over trajectories that depend on θ\theta in a complex way?

Deriving the Policy Gradient

Step 1: Expand the Expectation

θJ(πθ)=θτP(τθ)R(τ)\nabla_\theta J(\pi_\theta) = \nabla_\theta \int_\tau P(\tau|\theta)\, R(\tau)

For any differentiable function, θP(τθ)=P(τθ)θlogP(τθ)\nabla_\theta P(\tau|\theta) = P(\tau|\theta) \nabla_\theta \log P(\tau|\theta). This turns the gradient of a probability into an expectation:

θJ=Eτπθ[R(τ)θlogP(τθ)]\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \nabla_\theta \log P(\tau|\theta)\right]

Step 3: Simplify the Log-Probability

The log-probability of a trajectory factors as: logP(τθ)=logρ0(s0)+t=0T[logP(st+1st,at)+logπθ(atst)]\log P(\tau|\theta) = \log \rho_0(s_0) + \sum_{t=0}^{T} \left[\log P(s_{t+1}|s_t,a_t) + \log \pi_\theta(a_t|s_t)\right]

The gradient θ\nabla_\theta kills all terms that don't depend on θ\theta: the start-state distribution and the environment transitions. Only the policy log-probs survive: θlogP(τθ)=t=0Tθlogπθ(atst)\nabla_\theta \log P(\tau|\theta) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t)

The Basic Policy Gradient (REINFORCE)

θJ(πθ)=Eτπθ[t=0Tθlogπθ(atst)R(τ)]\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)\right]

This can be estimated from samples. Collect NN trajectories and compute: g^=1Ni=1Nt=0Tθlogπθ(at(i)st(i))R(τ(i))\hat{g} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \cdot R(\tau^{(i)})

Implementing the Policy Gradient in PyTorch

from torch.distributions import Categorical

def compute_loss(obs, acts, weights, policy_net):
    """Policy gradient pseudo-loss.
    
    The gradient of this function (w.r.t. policy parameters)
    equals the policy gradient estimator.
    Note: this is NOT a standard supervised loss.
    """
    logits = policy_net(obs)
    dist = Categorical(logits=logits)
    logp = dist.log_prob(acts)         # shape: (batch,)
    return -(logp * weights).mean()    # negative because we want ascent

# Training step:
optimizer.zero_grad()
batch_loss = compute_loss(obs_batch, act_batch, return_batch, policy)
batch_loss.backward()
optimizer.step()

Important warning: This loss is not a loss function in the usual sense.

  1. Data-dependent parameters: The data must be collected with the current policy. Using stale data breaks the estimator.
  2. Doesn't measure performance: Minimizing this loss on a fixed batch will not reliably improve J(πθ)J(\pi_\theta). The loss is only useful for one gradient step per batch.

The only metric that matters during training is average episode return.

Reward-to-Go: Don't Let the Past Distract You

In the REINFORCE gradient, every log-prob at time tt is weighted by the entire trajectory return R(τ)R(\tau). But rewards before time tt cannot have been caused by action ata_t. Including them just adds noise.

The reward-to-go R^t\hat{R}_t is the sum of rewards from time tt onward: R^t=t=tTrt\hat{R}_t = \sum_{t'=t}^T r_{t'}

The policy gradient can equivalently be written as: θJ=Eτπθ[t=0Tθlogπθ(atst)R^t]\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{R}_t\right]

Same expectation, but lower variance because past rewards (which are uncorrelated with ata_t) no longer appear as weights.

Baselines: Further Variance Reduction

We can subtract any baseline b(st)b(s_t) from the return weights without introducing bias, as long as the baseline doesn't depend on the action: θJ=Eτπθ[t=0Tθlogπθ(atst)(R^tb(st))]\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \left(\hat{R}_t - b(s_t)\right)\right]

This is because Eaπθ[θlogπθ(as)b(s)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0 (the EGLP lemma).

The best baseline is the on-policy value function Vπ(st)V^\pi(s_t), because then the weight becomes the advantage: R^tVπ(st)Aπ(st,at)\hat{R}_t - V^\pi(s_t) \approx A^\pi(s_t, a_t)

In practice, a learned value function Vϕ(s)V_\phi(s) is used as a critic to estimate VπV^\pi. This is the actor-critic architecture:

  • Actor (policy πθ\pi_\theta): decides actions.
  • Critic (value function VϕV_\phi): estimates how good the current state is.

Generalized Advantage Estimation (GAE)

In practice, VPG and its successors use Generalized Advantage Estimation (GAE-λ) to navigate the bias-variance tradeoff between high-variance Monte Carlo returns (λ=1) and low-variance but biased 1-step TD estimates (λ=0):

A^tGAE(λ)=l=0(γλ)lδt+l,δt=rt+γVϕ(st+1)Vϕ(st)\hat{A}_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

A typical value is λ=0.97\lambda = 0.97. The Spinning Up implementation uses GAE-Lambda throughout VPG, TRPO, and PPO.