Deep Reinforcement Learning · Off-Policy Methods & Tooling

Running, Logging & Benchmarking

10 min read
By the end of this reading you will be able to:
  • Run any Spinning Up algorithm from the command line using spinup.run, selecting PyTorch vs TensorFlow and setting hyperparameters as flags
  • Use ExperimentGrid to define a hyperparameter sweep and launch all combinations programmatically
  • Interpret the experiment output directory (config.json, progress.txt, pyt_save/), load a saved policy with test_policy, and plot results with the Spinning Up plotter

Running from the Command Line

Spinning Up ships with spinup/run.py, a unified launcher for all algorithms:

python -m spinup.run [algo] [flags]

Choosing PyTorch or TensorFlow

# PyTorch version:
python -m spinup.run ppo_pytorch --env HalfCheetah-v2 --exp_name ppo-hc

# TensorFlow version:
python -m spinup.run ppo_tf1 --env HalfCheetah-v2 --exp_name ppo-hc-tf

# Default (reads spinup/user_config.py):
python -m spinup.run ppo --env HalfCheetah-v2

Setting Hyperparameters

Every keyword argument for the algorithm function can be set as a flag:

python -m spinup.run ppo_pytorch --env Walker2d-v2 \
  --exp_name walker-ppo \
  --epochs 200 \
  --gamma 0.99 \
  --clip_ratio 0.2 \
  --hid[h] [64,64]   # shortcut for ac_kwargs:hidden_sizes

Values pass through eval(), so you can pass Python expressions:

--act torch.nn.ELU     # sets activation function directly

Running Multiple Experiments

Provide multiple values for any argument to run all combinations in series:

# Three seeds, two architectures = 6 experiments total:
python -m spinup.run ppo_pytorch --env Walker2d-v2 \
  --seed 0 10 20 \
  --hid[h] [64,64] [256,256]

Experiments run in series (not parallel) because each one uses enough compute to saturate a machine.

ExperimentGrid: Sweeps from Python

For larger hyperparameter searches, use ExperimentGrid in a Python script:

from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo_pytorch
import torch

eg = ExperimentGrid(name='ppo-sweep')
eg.add('env_name', 'HalfCheetah-v2', '', in_name=True)
eg.add('seed', [10 * i for i in range(5)])               # 5 seeds
eg.add('epochs', 200)
eg.add('ac_kwargs:hidden_sizes', [(64, 64), (256, 256)], 'hid')  # 2 architectures
eg.add('ac_kwargs:activation', [torch.nn.Tanh, torch.nn.ReLU], '')  # 2 activations

eg.run(ppo_pytorch, num_cpu=1)  # runs 5 × 2 × 2 = 20 experiments

Key eg.add() parameters:

  • param_name: the algorithm kwarg name
  • values: list of values to try
  • shorthand: used in naming save directories
  • in_name=True: always include this param in directory name (even if it has one value)

Experiment Outputs

Each experiment run saves to:

data_dir/[timestamp_]exp_name[suffix]/[timestamp_]exp_name[suffix]_s[seed]/
  config.json      ← hyperparameters used (for record-keeping)
  progress.txt     ← tab-separated training metrics by epoch
  vars.pkl         ← pickled environment (if serializable)
  pyt_save/        ← PyTorch: model.pt (ActorCritic nn.Module)
  tf1_save/        ← TF: SavedModel directory

progress.txt columns

Typical columns logged by Spinning Up algorithms:

Column Meaning
Epoch Current epoch number
AverageEpRet Mean return over test episodes
StdEpRet Std of test episode returns
MaxEpRet Best episode return
MinEpRet Worst episode return
EpLen Mean episode length
TotalEnvInteracts Total environment steps so far
LossPi Policy loss (not a reliable progress metric)
LossV Value function loss
KL Approx KL divergence (on-policy algs)
Time Wall-clock time elapsed

Key rule: Watch AverageEpRet, not LossPi. The policy loss is not a meaningful diagnostic.

Loading and Testing Trained Policies

PyTorch

import torch

ac = torch.load('path/to/pyt_save/model.pt')

obs = env.reset()
while True:
    action = ac.act(torch.as_tensor(obs, dtype=torch.float32))
    obs, rew, done, _ = env.step(action)
    if done:
        break

Command-Line Test Runner

# Run 100 test episodes and render:
python -m spinup.run test_policy path/to/output_dir

# Options:
#  -n 50          run 50 episodes instead of 100
#  -l 500         max 500 steps per episode
#  -nr            no rendering (faster, just print returns)

Plotting Results

# Plot a single run:
python -m spinup.run plot path/to/output_dir

# Compare multiple runs on the same axes:
python -m spinup.run plot path/to/run1 path/to/run2

# Multiple seeds in the same directory are automatically averaged:
python -m spinup.run plot path/to/exp_folder/

The plotter reads progress.txt and plots mean ± std across seeds.

Practical Benchmarking Tips

  1. Always run multiple seeds (3–5 minimum). RL is stochastic and a single seed can look great or terrible by chance.
  2. Report mean and std, not just the best seed.
  3. Wall-clock vs. environment steps: report TotalEnvInteracts on the x-axis for fair comparisons across machines.
  4. Baseline against VPG: VPG is a sanity check. If PPO doesn't beat VPG, something is wrong.
  5. Tune on simple environments first: CartPole, Pendulum, or HalfCheetah. Don't waste compute on complex envs before confirming basic sanity.