Running, Logging & Benchmarking
- Run any Spinning Up algorithm from the command line using spinup.run, selecting PyTorch vs TensorFlow and setting hyperparameters as flags
- Use ExperimentGrid to define a hyperparameter sweep and launch all combinations programmatically
- Interpret the experiment output directory (config.json, progress.txt, pyt_save/), load a saved policy with test_policy, and plot results with the Spinning Up plotter
Running from the Command Line
Spinning Up ships with spinup/run.py, a unified launcher for all algorithms:
python -m spinup.run [algo] [flags]
Choosing PyTorch or TensorFlow
# PyTorch version:
python -m spinup.run ppo_pytorch --env HalfCheetah-v2 --exp_name ppo-hc
# TensorFlow version:
python -m spinup.run ppo_tf1 --env HalfCheetah-v2 --exp_name ppo-hc-tf
# Default (reads spinup/user_config.py):
python -m spinup.run ppo --env HalfCheetah-v2
Setting Hyperparameters
Every keyword argument for the algorithm function can be set as a flag:
python -m spinup.run ppo_pytorch --env Walker2d-v2 \
--exp_name walker-ppo \
--epochs 200 \
--gamma 0.99 \
--clip_ratio 0.2 \
--hid[h] [64,64] # shortcut for ac_kwargs:hidden_sizes
Values pass through eval(), so you can pass Python expressions:
--act torch.nn.ELU # sets activation function directly
Running Multiple Experiments
Provide multiple values for any argument to run all combinations in series:
# Three seeds, two architectures = 6 experiments total:
python -m spinup.run ppo_pytorch --env Walker2d-v2 \
--seed 0 10 20 \
--hid[h] [64,64] [256,256]
Experiments run in series (not parallel) because each one uses enough compute to saturate a machine.
ExperimentGrid: Sweeps from Python
For larger hyperparameter searches, use ExperimentGrid in a Python script:
from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo_pytorch
import torch
eg = ExperimentGrid(name='ppo-sweep')
eg.add('env_name', 'HalfCheetah-v2', '', in_name=True)
eg.add('seed', [10 * i for i in range(5)]) # 5 seeds
eg.add('epochs', 200)
eg.add('ac_kwargs:hidden_sizes', [(64, 64), (256, 256)], 'hid') # 2 architectures
eg.add('ac_kwargs:activation', [torch.nn.Tanh, torch.nn.ReLU], '') # 2 activations
eg.run(ppo_pytorch, num_cpu=1) # runs 5 × 2 × 2 = 20 experiments
Key eg.add() parameters:
param_name: the algorithm kwarg namevalues: list of values to tryshorthand: used in naming save directoriesin_name=True: always include this param in directory name (even if it has one value)
Experiment Outputs
Each experiment run saves to:
data_dir/[timestamp_]exp_name[suffix]/[timestamp_]exp_name[suffix]_s[seed]/
config.json ← hyperparameters used (for record-keeping)
progress.txt ← tab-separated training metrics by epoch
vars.pkl ← pickled environment (if serializable)
pyt_save/ ← PyTorch: model.pt (ActorCritic nn.Module)
tf1_save/ ← TF: SavedModel directory
progress.txt columns
Typical columns logged by Spinning Up algorithms:
| Column | Meaning |
|---|---|
Epoch |
Current epoch number |
AverageEpRet |
Mean return over test episodes |
StdEpRet |
Std of test episode returns |
MaxEpRet |
Best episode return |
MinEpRet |
Worst episode return |
EpLen |
Mean episode length |
TotalEnvInteracts |
Total environment steps so far |
LossPi |
Policy loss (not a reliable progress metric) |
LossV |
Value function loss |
KL |
Approx KL divergence (on-policy algs) |
Time |
Wall-clock time elapsed |
Key rule: Watch
AverageEpRet, notLossPi. The policy loss is not a meaningful diagnostic.
Loading and Testing Trained Policies
PyTorch
import torch
ac = torch.load('path/to/pyt_save/model.pt')
obs = env.reset()
while True:
action = ac.act(torch.as_tensor(obs, dtype=torch.float32))
obs, rew, done, _ = env.step(action)
if done:
break
Command-Line Test Runner
# Run 100 test episodes and render:
python -m spinup.run test_policy path/to/output_dir
# Options:
# -n 50 run 50 episodes instead of 100
# -l 500 max 500 steps per episode
# -nr no rendering (faster, just print returns)
Plotting Results
# Plot a single run:
python -m spinup.run plot path/to/output_dir
# Compare multiple runs on the same axes:
python -m spinup.run plot path/to/run1 path/to/run2
# Multiple seeds in the same directory are automatically averaged:
python -m spinup.run plot path/to/exp_folder/
The plotter reads progress.txt and plots mean ± std across seeds.
Practical Benchmarking Tips
- Always run multiple seeds (3–5 minimum). RL is stochastic and a single seed can look great or terrible by chance.
- Report mean and std, not just the best seed.
- Wall-clock vs. environment steps: report
TotalEnvInteractson the x-axis for fair comparisons across machines. - Baseline against VPG: VPG is a sanity check. If PPO doesn't beat VPG, something is wrong.
- Tune on simple environments first: CartPole, Pendulum, or HalfCheetah. Don't waste compute on complex envs before confirming basic sanity.