Deep Reinforcement Learning · Off-Policy Methods & Tooling

Running, Logging & Benchmarking

10 min read

By the end of this reading you will be able to:

Run any Spinning Up algorithm from the command line using spinup.run, selecting PyTorch vs TensorFlow and setting hyperparameters as flags
Use ExperimentGrid to define a hyperparameter sweep and launch all combinations programmatically
Interpret the experiment output directory (config.json, progress.txt, pyt_save/), load a saved policy with test_policy, and plot results with the Spinning Up plotter

Running from the Command Line

Spinning Up ships with spinup/run.py, a unified launcher for all algorithms:

python -m spinup.run [algo] [flags]

Choosing PyTorch or TensorFlow

# PyTorch version:
python -m spinup.run ppo_pytorch --env HalfCheetah-v2 --exp_name ppo-hc

# TensorFlow version:
python -m spinup.run ppo_tf1 --env HalfCheetah-v2 --exp_name ppo-hc-tf

# Default (reads spinup/user_config.py):
python -m spinup.run ppo --env HalfCheetah-v2

Setting Hyperparameters

Every keyword argument for the algorithm function can be set as a flag:

python -m spinup.run ppo_pytorch --env Walker2d-v2 \
  --exp_name walker-ppo \
  --epochs 200 \
  --gamma 0.99 \
  --clip_ratio 0.2 \
  --hid[h] [64,64]   # shortcut for ac_kwargs:hidden_sizes

Values pass through eval(), so you can pass Python expressions:

--act torch.nn.ELU     # sets activation function directly

Running Multiple Experiments

Provide multiple values for any argument to run all combinations in series:

# Three seeds, two architectures = 6 experiments total:
python -m spinup.run ppo_pytorch --env Walker2d-v2 \
  --seed 0 10 20 \
  --hid[h] [64,64] [256,256]

Experiments run in series (not parallel) because each one uses enough compute to saturate a machine.

ExperimentGrid: Sweeps from Python

For larger hyperparameter searches, use ExperimentGrid in a Python script:

from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo_pytorch
import torch

eg = ExperimentGrid(name='ppo-sweep')
eg.add('env_name', 'HalfCheetah-v2', '', in_name=True)
eg.add('seed', [10 * i for i in range(5)])               # 5 seeds
eg.add('epochs', 200)
eg.add('ac_kwargs:hidden_sizes', [(64, 64), (256, 256)], 'hid')  # 2 architectures
eg.add('ac_kwargs:activation', [torch.nn.Tanh, torch.nn.ReLU], '')  # 2 activations

eg.run(ppo_pytorch, num_cpu=1)  # runs 5 × 2 × 2 = 20 experiments

Key eg.add() parameters:

param_name: the algorithm kwarg name
values: list of values to try
shorthand: used in naming save directories
in_name=True: always include this param in directory name (even if it has one value)

Experiment Outputs

Each experiment run saves to:

data_dir/[timestamp_]exp_name[suffix]/[timestamp_]exp_name[suffix]_s[seed]/
  config.json      ← hyperparameters used (for record-keeping)
  progress.txt     ← tab-separated training metrics by epoch
  vars.pkl         ← pickled environment (if serializable)
  pyt_save/        ← PyTorch: model.pt (ActorCritic nn.Module)
  tf1_save/        ← TF: SavedModel directory

progress.txt columns

Typical columns logged by Spinning Up algorithms:

Column	Meaning
`Epoch`	Current epoch number
`AverageEpRet`	Mean return over test episodes
`StdEpRet`	Std of test episode returns
`MaxEpRet`	Best episode return
`MinEpRet`	Worst episode return
`EpLen`	Mean episode length
`TotalEnvInteracts`	Total environment steps so far
`LossPi`	Policy loss (not a reliable progress metric)
`LossV`	Value function loss
`KL`	Approx KL divergence (on-policy algs)
`Time`	Wall-clock time elapsed

Key rule: Watch AverageEpRet, not LossPi. The policy loss is not a meaningful diagnostic.

Loading and Testing Trained Policies

PyTorch

import torch

ac = torch.load('path/to/pyt_save/model.pt')

obs = env.reset()
while True:
    action = ac.act(torch.as_tensor(obs, dtype=torch.float32))
    obs, rew, done, _ = env.step(action)
    if done:
        break

Command-Line Test Runner

# Run 100 test episodes and render:
python -m spinup.run test_policy path/to/output_dir

# Options:
#  -n 50          run 50 episodes instead of 100
#  -l 500         max 500 steps per episode
#  -nr            no rendering (faster, just print returns)

Plotting Results

# Plot a single run:
python -m spinup.run plot path/to/output_dir

# Compare multiple runs on the same axes:
python -m spinup.run plot path/to/run1 path/to/run2

# Multiple seeds in the same directory are automatically averaged:
python -m spinup.run plot path/to/exp_folder/

The plotter reads progress.txt and plots mean ± std across seeds.

Practical Benchmarking Tips

Always run multiple seeds (3–5 minimum). RL is stochastic and a single seed can look great or terrible by chance.
Report mean and std, not just the best seed.
Wall-clock vs. environment steps: report TotalEnvInteracts on the x-axis for fair comparisons across machines.
Baseline against VPG: VPG is a sanity check. If PPO doesn't beat VPG, something is wrong.
Tune on simple environments first: CartPole, Pendulum, or HalfCheetah. Don't waste compute on complex envs before confirming basic sanity.

References

OpenAI Spinning Up — Running Experiments

OpenAI Spinning Up — Experiment Outputs

OpenAI Spinning Up — Plotting Results

Previous Next →