Deep Reinforcement Learning · Visual Reinforcement Learning

Visual RL with ViZDoom

Colab Notebook · ~75 min
Google Colab Notebook
Visual RL with ViZDoom
Python · ~75 min
Open in Colab
Lab Objectives
1
Install ViZDoom with the Gymnasium wrapper and inspect observation and action spaces for VizdoomBasic-v1 and VizdoomDefendCenter-v1
2
Build a DoomScreenWrapper that extracts the screen buffer, converts to grayscale, and resizes to 84×84 — matching the standard Atari preprocessing used in DQN and PPO papers
3
Train PPO with CnnPolicy on VizdoomBasic-v1, plot evaluation return curves, and record in-notebook gameplay video with imageio
4
Add frame stacking via VecFrameStack (k=4), retrain on the same scenario, and compare single-frame vs stacked convergence speed and final return
5
Scale to VizdoomDefendCenter-v1 (moving enemies, rotation required) and explain why this scenario requires temporal context while Basic does not
6
Analyse how visual RL sample efficiency compares to state-based RL and identify which technique (larger CNN, more frames, exploration bonus, prioritised replay) would help most on each scenario

Setup

ViZDoom ships with prebuilt Linux wheels — no compilation required on Colab:

pip install vizdoom stable-baselines3[extra] gymnasium imageio imageio-ffmpeg opencv-python-headless

After installing, import the gymnasium wrapper to register environments:

import vizdoom.gymnasium_wrapper   # registers VizdoomXxx-v1 envs
import gymnasium as gym

Exercise 1: Train PPO on VizdoomBasic-v1

The simplest ViZDoom scenario: the agent faces a stationary monster in a rectangular room. Three discrete actions — move left, move right, shoot. Reward: +101 for a kill, −5 per wasted shot.

We preprocess each frame with a DoomScreenWrapper (grayscale + resize to 84×84) and train PPO with SB3's CnnPolicy:

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecMonitor

model = PPO(
    policy='CnnPolicy',
    env=train_env,
    n_steps=512,
    batch_size=64,
    n_epochs=4,
    ent_coef=0.01,
    learning_rate=2.5e-4,
    seed=0,
)
model.learn(total_timesteps=300_000, callback=eval_cb, progress_bar=True)

After training, gameplay is recorded directly in the notebook with imageio and displayed inline as an HTML5 video.

Tasks:

  • At what step does the return first rise above zero?
  • Try ent_coef=0.0 — does the agent converge faster or get stuck earlier?
  • What does SB3's CnnPolicy CNN encoder look like? (check model.policy.features_extractor)

Exercise 2: Frame Stacking

A single frame gives the agent no information about motion. Frame stacking concatenates the last kk frames along the channel dimension, converting a partially-observable MDP into an approximately Markovian one — the technique used in DeepMind's original DQN.

from stable_baselines3.common.vec_env import VecFrameStack

def make_doom_stacked(env_id, n_envs=4, n_stack=4, seed=0):
    base = make_doom(env_id, n_envs=n_envs, seed=seed)
    return VecFrameStack(base, n_stack=n_stack)

Observation shape: (84, 84, 1)(84, 84, 4).

Discussion:

  • On BasicScenario the gap between k=1 and k=4 is small — why? (the monster doesn't move)
  • Which ViZDoom scenarios would benefit most from temporal context?

Exercise 3: VizdoomDefendCenter-v1

The agent stands in a circular arena. Monsters spawn at the perimeter and walk toward the center. Actions: rotate left, rotate right, shoot. The agent must track moving targets and fire before being overwhelmed.

This requires temporal reasoning — a policy that works on static frames will fail here.

train_dc = make_doom_stacked('VizdoomDefendCenter-v1', n_envs=8, n_stack=4)
model_dc  = PPO('CnnPolicy', train_dc, ent_coef=0.005, ...)
model_dc.learn(total_timesteps=500_000)

Analysis:

  • How does the DefendCenter learning curve compare to Basic (steps to first reward, final return, variance)?
  • Why does visual RL need far more steps than state-based RL for equivalent tasks?
  • Which would help most: larger CNN, more frames, curiosity bonus, or prioritised replay?