Vanilla Policy Gradient (VPG)
- Implement VPG's training loop in PyTorch: collect trajectories, compute GAE advantages, take one policy gradient step, and fit the value function
- Configure the key VPG hyperparameters (steps_per_epoch, gamma, lam, pi_lr, vf_lr, train_v_iters) and explain each one's role
- Explain why VPG's exploration degrades over training and what practical consequences this has for hyperparameter tuning
The Key Idea
Vanilla Policy Gradient (VPG) is the simplest practical implementation of the policy gradient idea developed in Part 3. The core intuition: push up the probabilities of actions that led to high advantage, push down the probabilities of actions that led to low advantage, and repeat.
VPG is on-policy: every update uses freshly collected trajectories. After one gradient step, the data is discarded.
Quick Facts
- On-policy algorithm
- Works for discrete and continuous action spaces
- Supports MPI parallelization (collect trajectories on multiple workers simultaneously)
- Uses GAE-Lambda for advantage estimation
The Algorithm
Input: initial policy θ₀, initial value function φ₀
For k = 0, 1, 2, ...:
1. Collect trajectories D_k by running π_θk in the environment
2. Compute rewards-to-go R̂_t
3. Compute advantage estimates Â_t using V_φk (via GAE-λ)
4. Update policy by gradient ascent:
θ_{k+1} = θ_k + α * (1/|D_k|) Σ Σ ∇_θ log π_θ(a_t|s_t)|_θk * Â_t
5. Fit value function by gradient descent on MSE:
φ_{k+1} = argmin_φ (1/|D_k|T) Σ Σ (V_φ(s_t) - R̂_t)²
PyTorch Implementation
from spinup import vpg_pytorch as vpg
import gym
# Launch VPG on CartPole:
vpg(
env_fn=lambda: gym.make('CartPole-v1'),
ac_kwargs=dict(hidden_sizes=[64, 64]),
steps_per_epoch=4000,
epochs=50,
gamma=0.99,
pi_lr=3e-4,
vf_lr=1e-3,
train_v_iters=80,
lam=0.97,
max_ep_len=1000,
logger_kwargs=dict(output_dir='/tmp/vpg-test', exp_name='vpg')
)
From the command line:
python -m spinup.run vpg_pytorch --env CartPole-v1 --epochs 50
Key Hyperparameters
| Hyperparameter | Default | Role |
|---|---|---|
steps_per_epoch |
4000 | Number of environment steps per policy update. Larger = lower variance gradient estimate but slower. |
gamma |
0.99 | Discount factor. |
lam |
0.97 | GAE-Lambda. Higher = lower bias, higher variance. |
pi_lr |
3e-4 | Policy learning rate. |
vf_lr |
1e-3 | Value function learning rate (higher than policy is typical). |
train_v_iters |
80 | Gradient steps on value function per epoch. |
max_ep_len |
1000 | Max episode length before forced termination. |
Loading a Trained Policy
After training, the PyTorch model is saved as pyt_save/model.pt:
import torch
ac = torch.load('path/to/pyt_save/model.pt')
# Get actions:
obs = env.reset()
action = ac.act(torch.as_tensor(obs, dtype=torch.float32))
Exploration vs. Exploitation
VPG trains a stochastic policy. Early in training, the policy is nearly random (high entropy), which provides exploration. As training proceeds, the policy becomes more deterministic as it exploits discovered high-reward actions.
The danger: the policy may converge prematurely to a local optimum before finding the globally best behavior. VPG has no explicit mechanism to prevent this — unlike SAC, which explicitly encourages entropy throughout training.
When to Use VPG
VPG is primarily useful as a baseline and a learning tool rather than a production algorithm. Its simplicity makes it easy to debug. For competitive performance:
- Use PPO when an on-policy algorithm is required (simpler and more robust than TRPO).
- Use SAC when sample efficiency matters and the action space is continuous.