Deep Reinforcement Learning · RL Foundations

Policy Gradient Theory

15 min read

By the end of this reading you will be able to:

Derive the basic policy gradient using the log-derivative trick, arriving at the REINFORCE estimator
Explain the reward-to-go trick and why it reduces variance without introducing bias
Implement a minimal policy gradient training loop in PyTorch, including the pseudo-loss and gradient step
Explain why a baseline (e.g., the value function) can be subtracted from the return without biasing the gradient

The Policy Gradient Objective

We want to find policy parameters $\theta$ that maximize expected return: $J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$

We'll do this with gradient ascent: $\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta)\big|_{\theta_k}$

The challenge: how do we compute $\nabla_\theta J(\pi_\theta)$ when the expectation is over trajectories that depend on $\theta$ in a complex way?

Deriving the Policy Gradient

Step 1: Expand the Expectation

$\nabla_\theta J(\pi_\theta) = \nabla_\theta \int_\tau P(\tau|\theta)\, R(\tau)$

Step 2: The Log-Derivative Trick

For any differentiable function, $\nabla_\theta P(\tau|\theta) = P(\tau|\theta) \nabla_\theta \log P(\tau|\theta)$ . This turns the gradient of a probability into an expectation:

$\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \nabla_\theta \log P(\tau|\theta)\right]$

Step 3: Simplify the Log-Probability

The log-probability of a trajectory factors as: $\log P(\tau|\theta) = \log \rho_0(s_0) + \sum_{t=0}^{T} \left[\log P(s_{t+1}|s_t,a_t) + \log \pi_\theta(a_t|s_t)\right]$

The gradient $\nabla_\theta$ kills all terms that don't depend on $\theta$ : the start-state distribution and the environment transitions. Only the policy log-probs survive: $\nabla_\theta \log P(\tau|\theta) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t)$

The Basic Policy Gradient (REINFORCE)

$\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)\right]$

This can be estimated from samples. Collect $N$ trajectories and compute: $\hat{g} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \cdot R(\tau^{(i)})$

Implementing the Policy Gradient in PyTorch

from torch.distributions import Categorical

def compute_loss(obs, acts, weights, policy_net):
    """Policy gradient pseudo-loss.
    
    The gradient of this function (w.r.t. policy parameters)
    equals the policy gradient estimator.
    Note: this is NOT a standard supervised loss.
    """
    logits = policy_net(obs)
    dist = Categorical(logits=logits)
    logp = dist.log_prob(acts)         # shape: (batch,)
    return -(logp * weights).mean()    # negative because we want ascent

# Training step:
optimizer.zero_grad()
batch_loss = compute_loss(obs_batch, act_batch, return_batch, policy)
batch_loss.backward()
optimizer.step()

Important warning: This loss is not a loss function in the usual sense.

Data-dependent parameters: The data must be collected with the current policy. Using stale data breaks the estimator.
Doesn't measure performance: Minimizing this loss on a fixed batch will not reliably improve $J(\pi_\theta)$ . The loss is only useful for one gradient step per batch.

The only metric that matters during training is average episode return.

Reward-to-Go: Don't Let the Past Distract You

In the REINFORCE gradient, every log-prob at time $t$ is weighted by the entire trajectory return $R(\tau)$ . But rewards before time $t$ cannot have been caused by action $a_t$ . Including them just adds noise.

The reward-to-go $\hat{R}_t$ is the sum of rewards from time $t$ onward: $\hat{R}_t = \sum_{t'=t}^T r_{t'}$

The policy gradient can equivalently be written as: $\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{R}_t\right]$

Same expectation, but lower variance because past rewards (which are uncorrelated with $a_t$ ) no longer appear as weights.

Baselines: Further Variance Reduction

We can subtract any baseline $b(s_t)$ from the return weights without introducing bias, as long as the baseline doesn't depend on the action: $\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \left(\hat{R}_t - b(s_t)\right)\right]$

This is because $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0$ (the EGLP lemma).

The best baseline is the on-policy value function $V^\pi(s_t)$ , because then the weight becomes the advantage: $\hat{R}_t - V^\pi(s_t) \approx A^\pi(s_t, a_t)$

In practice, a learned value function $V_\phi(s)$ is used as a critic to estimate $V^\pi$ . This is the actor-critic architecture:

Actor (policy $\pi_\theta$ ): decides actions.
Critic (value function $V_\phi$ ): estimates how good the current state is.

Generalized Advantage Estimation (GAE)

In practice, VPG and its successors use Generalized Advantage Estimation (GAE-λ) to navigate the bias-variance tradeoff between high-variance Monte Carlo returns (λ=1) and low-variance but biased 1-step TD estimates (λ=0):

$\hat{A}_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$

A typical value is $\lambda = 0.97$ . The Spinning Up implementation uses GAE-Lambda throughout VPG, TRPO, and PPO.

References

OpenAI Spinning Up — Part 3: Intro to Policy Optimization

Schulman et al. 2016 — High Dimensional Continuous Control Using Generalized Advantage Estimation

Williams 1992 — Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE)

Previous Take Quiz →

Policy Gradient Theory

The Policy Gradient Objective

Deriving the Policy Gradient

Step 1: Expand the Expectation

Step 2: The Log-Derivative Trick

Step 3: Simplify the Log-Probability

The Basic Policy Gradient (REINFORCE)

Implementing the Policy Gradient in PyTorch

Reward-to-Go: Don't Let the Past Distract You

Baselines: Further Variance Reduction

Generalized Advantage Estimation (GAE)

Privacy Policy

What we collect

What we don't collect

Your choices

Contact