Visual RL with ViZDoom
Setup
ViZDoom ships with prebuilt Linux wheels — no compilation required on Colab:
pip install vizdoom stable-baselines3[extra] gymnasium imageio imageio-ffmpeg opencv-python-headless
After installing, import the gymnasium wrapper to register environments:
import vizdoom.gymnasium_wrapper # registers VizdoomXxx-v1 envs
import gymnasium as gym
Exercise 1: Train PPO on VizdoomBasic-v1
The simplest ViZDoom scenario: the agent faces a stationary monster in a rectangular room. Three discrete actions — move left, move right, shoot. Reward: +101 for a kill, −5 per wasted shot.
We preprocess each frame with a DoomScreenWrapper (grayscale + resize to 84×84) and train PPO with SB3's CnnPolicy:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecMonitor
model = PPO(
policy='CnnPolicy',
env=train_env,
n_steps=512,
batch_size=64,
n_epochs=4,
ent_coef=0.01,
learning_rate=2.5e-4,
seed=0,
)
model.learn(total_timesteps=300_000, callback=eval_cb, progress_bar=True)
After training, gameplay is recorded directly in the notebook with imageio and displayed inline as an HTML5 video.
Tasks:
- At what step does the return first rise above zero?
- Try
ent_coef=0.0— does the agent converge faster or get stuck earlier? - What does SB3's
CnnPolicyCNN encoder look like? (checkmodel.policy.features_extractor)
Exercise 2: Frame Stacking
A single frame gives the agent no information about motion. Frame stacking concatenates the last frames along the channel dimension, converting a partially-observable MDP into an approximately Markovian one — the technique used in DeepMind's original DQN.
from stable_baselines3.common.vec_env import VecFrameStack
def make_doom_stacked(env_id, n_envs=4, n_stack=4, seed=0):
base = make_doom(env_id, n_envs=n_envs, seed=seed)
return VecFrameStack(base, n_stack=n_stack)
Observation shape: (84, 84, 1) → (84, 84, 4).
Discussion:
- On BasicScenario the gap between k=1 and k=4 is small — why? (the monster doesn't move)
- Which ViZDoom scenarios would benefit most from temporal context?
Exercise 3: VizdoomDefendCenter-v1
The agent stands in a circular arena. Monsters spawn at the perimeter and walk toward the center. Actions: rotate left, rotate right, shoot. The agent must track moving targets and fire before being overwhelmed.
This requires temporal reasoning — a policy that works on static frames will fail here.
train_dc = make_doom_stacked('VizdoomDefendCenter-v1', n_envs=8, n_stack=4)
model_dc = PPO('CnnPolicy', train_dc, ent_coef=0.005, ...)
model_dc.learn(total_timesteps=500_000)
Analysis:
- How does the DefendCenter learning curve compare to Basic (steps to first reward, final return, variance)?
- Why does visual RL need far more steps than state-based RL for equivalent tasks?
- Which would help most: larger CNN, more frames, curiosity bonus, or prioritised replay?