Deep Reinforcement Learning · Policy Gradient Algorithms

Vanilla Policy Gradient (VPG)

12 min read
By the end of this reading you will be able to:
  • Implement VPG's training loop in PyTorch: collect trajectories, compute GAE advantages, take one policy gradient step, and fit the value function
  • Configure the key VPG hyperparameters (steps_per_epoch, gamma, lam, pi_lr, vf_lr, train_v_iters) and explain each one's role
  • Explain why VPG's exploration degrades over training and what practical consequences this has for hyperparameter tuning

The Key Idea

Vanilla Policy Gradient (VPG) is the simplest practical implementation of the policy gradient idea developed in Part 3. The core intuition: push up the probabilities of actions that led to high advantage, push down the probabilities of actions that led to low advantage, and repeat.

VPG is on-policy: every update uses freshly collected trajectories. After one gradient step, the data is discarded.

Quick Facts

  • On-policy algorithm
  • Works for discrete and continuous action spaces
  • Supports MPI parallelization (collect trajectories on multiple workers simultaneously)
  • Uses GAE-Lambda for advantage estimation

The Algorithm

Input: initial policy θ₀, initial value function φ₀
For k = 0, 1, 2, ...:
  1. Collect trajectories D_k by running π_θk in the environment
  2. Compute rewards-to-go R̂_t
  3. Compute advantage estimates Â_t using V_φk (via GAE-λ)
  4. Update policy by gradient ascent:
       θ_{k+1} = θ_k + α * (1/|D_k|) Σ Σ ∇_θ log π_θ(a_t|s_t)|_θk * Â_t
  5. Fit value function by gradient descent on MSE:
       φ_{k+1} = argmin_φ (1/|D_k|T) Σ Σ (V_φ(s_t) - R̂_t)²

PyTorch Implementation

from spinup import vpg_pytorch as vpg
import gym

# Launch VPG on CartPole:
vpg(
    env_fn=lambda: gym.make('CartPole-v1'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=50,
    gamma=0.99,
    pi_lr=3e-4,
    vf_lr=1e-3,
    train_v_iters=80,
    lam=0.97,
    max_ep_len=1000,
    logger_kwargs=dict(output_dir='/tmp/vpg-test', exp_name='vpg')
)

From the command line:

python -m spinup.run vpg_pytorch --env CartPole-v1 --epochs 50

Key Hyperparameters

Hyperparameter Default Role
steps_per_epoch 4000 Number of environment steps per policy update. Larger = lower variance gradient estimate but slower.
gamma 0.99 Discount factor.
lam 0.97 GAE-Lambda. Higher = lower bias, higher variance.
pi_lr 3e-4 Policy learning rate.
vf_lr 1e-3 Value function learning rate (higher than policy is typical).
train_v_iters 80 Gradient steps on value function per epoch.
max_ep_len 1000 Max episode length before forced termination.

Loading a Trained Policy

After training, the PyTorch model is saved as pyt_save/model.pt:

import torch

ac = torch.load('path/to/pyt_save/model.pt')

# Get actions:
obs = env.reset()
action = ac.act(torch.as_tensor(obs, dtype=torch.float32))

Exploration vs. Exploitation

VPG trains a stochastic policy. Early in training, the policy is nearly random (high entropy), which provides exploration. As training proceeds, the policy becomes more deterministic as it exploits discovered high-reward actions.

The danger: the policy may converge prematurely to a local optimum before finding the globally best behavior. VPG has no explicit mechanism to prevent this — unlike SAC, which explicitly encourages entropy throughout training.

When to Use VPG

VPG is primarily useful as a baseline and a learning tool rather than a production algorithm. Its simplicity makes it easy to debug. For competitive performance:

  • Use PPO when an on-policy algorithm is required (simpler and more robust than TRPO).
  • Use SAC when sample efficiency matters and the action space is continuous.