Deep Reinforcement Learning · Policy Gradient Algorithms

Vanilla Policy Gradient (VPG)

12 min read

By the end of this reading you will be able to:

Implement VPG's training loop in PyTorch: collect trajectories, compute GAE advantages, take one policy gradient step, and fit the value function
Configure the key VPG hyperparameters (steps_per_epoch, gamma, lam, pi_lr, vf_lr, train_v_iters) and explain each one's role
Explain why VPG's exploration degrades over training and what practical consequences this has for hyperparameter tuning

The Key Idea

Vanilla Policy Gradient (VPG) is the simplest practical implementation of the policy gradient idea developed in Part 3. The core intuition: push up the probabilities of actions that led to high advantage, push down the probabilities of actions that led to low advantage, and repeat.

VPG is on-policy: every update uses freshly collected trajectories. After one gradient step, the data is discarded.

Quick Facts

On-policy algorithm
Works for discrete and continuous action spaces
Supports MPI parallelization (collect trajectories on multiple workers simultaneously)
Uses GAE-Lambda for advantage estimation

The Algorithm

Input: initial policy θ₀, initial value function φ₀
For k = 0, 1, 2, ...:
  1. Collect trajectories D_k by running π_θk in the environment
  2. Compute rewards-to-go R̂_t
  3. Compute advantage estimates Â_t using V_φk (via GAE-λ)
  4. Update policy by gradient ascent:
       θ_{k+1} = θ_k + α * (1/|D_k|) Σ Σ ∇_θ log π_θ(a_t|s_t)|_θk * Â_t
  5. Fit value function by gradient descent on MSE:
       φ_{k+1} = argmin_φ (1/|D_k|T) Σ Σ (V_φ(s_t) - R̂_t)²

PyTorch Implementation

from spinup import vpg_pytorch as vpg
import gym

# Launch VPG on CartPole:
vpg(
    env_fn=lambda: gym.make('CartPole-v1'),
    ac_kwargs=dict(hidden_sizes=[64, 64]),
    steps_per_epoch=4000,
    epochs=50,
    gamma=0.99,
    pi_lr=3e-4,
    vf_lr=1e-3,
    train_v_iters=80,
    lam=0.97,
    max_ep_len=1000,
    logger_kwargs=dict(output_dir='/tmp/vpg-test', exp_name='vpg')
)

From the command line:

python -m spinup.run vpg_pytorch --env CartPole-v1 --epochs 50

Key Hyperparameters

Hyperparameter	Default	Role
`steps_per_epoch`	4000	Number of environment steps per policy update. Larger = lower variance gradient estimate but slower.
`gamma`	0.99	Discount factor.
`lam`	0.97	GAE-Lambda. Higher = lower bias, higher variance.
`pi_lr`	3e-4	Policy learning rate.
`vf_lr`	1e-3	Value function learning rate (higher than policy is typical).
`train_v_iters`	80	Gradient steps on value function per epoch.
`max_ep_len`	1000	Max episode length before forced termination.

Loading a Trained Policy

After training, the PyTorch model is saved as pyt_save/model.pt:

import torch

ac = torch.load('path/to/pyt_save/model.pt')

# Get actions:
obs = env.reset()
action = ac.act(torch.as_tensor(obs, dtype=torch.float32))

Exploration vs. Exploitation

VPG trains a stochastic policy. Early in training, the policy is nearly random (high entropy), which provides exploration. As training proceeds, the policy becomes more deterministic as it exploits discovered high-reward actions.

The danger: the policy may converge prematurely to a local optimum before finding the globally best behavior. VPG has no explicit mechanism to prevent this — unlike SAC, which explicitly encourages entropy throughout training.

When to Use VPG

VPG is primarily useful as a baseline and a learning tool rather than a production algorithm. Its simplicity makes it easy to debug. For competitive performance:

Use PPO when an on-policy algorithm is required (simpler and more robust than TRPO).
Use SAC when sample efficiency matters and the action space is continuous.

References

Sutton et al. 2000 — Policy Gradient Methods for Reinforcement Learning with Function Approximation

Schulman et al. 2016 — High Dimensional Continuous Control Using Generalized Advantage Estimation

OpenAI Spinning Up — VPG Documentation

Overview Next →

Vanilla Policy Gradient (VPG)

The Key Idea

Quick Facts

The Algorithm

PyTorch Implementation

Key Hyperparameters

Loading a Trained Policy

Exploration vs. Exploitation

When to Use VPG

Privacy Policy

What we collect

What we don't collect

Your choices

Contact