Deep Reinforcement Learning · Visual Reinforcement Learning

ViZDoom and 3D Visual RL

11 min read
By the end of this reading you will be able to:
  • Compare ViZDoom to Atari as a benchmark: identify what makes 3D first-person environments harder and what additional skills agents must learn
  • Order ViZDoom's built-in scenarios by difficulty and identify the key skill each scenario isolates
  • Explain why ViZDoom Deathmatch requires significantly more training steps and more capable architectures than Basic or DefendCenter

Why ViZDoom?

Atari established that CNNs can learn to play 2D games from pixels. ViZDoom (Kempka et al., 2016) asks the harder question: can agents learn in 3D, first-person, partially observable environments?

The Doom game engine provides:

  • Fully configurable scenarios (maps, enemies, rewards, available weapons)
  • First-person 3D rendering with genuine depth, perspective, and occlusion
  • Deterministic or stochastic enemy behaviour
  • A Python API for programmatic control
  • A Gymnasium wrapper registering environments as VizdoomXxx-v1

The gap between Atari and ViZDoom is the gap between a flat sprite game and a real 3D world — and it turns out to be large.

What 3D Adds Over Atari

Depth and Perspective

In Atari, objects are sprites at fixed screen positions. In Doom, a monster at distance 10 units appears larger and higher in the frame than the same monster at distance 50 units. The agent must learn that apparent size encodes distance — a form of 3D reasoning implicit in the visual input.

Genuine Occlusion

Enemies can be fully hidden behind walls and only reveal themselves when the agent (or enemy) moves. A single frame provides zero information about occluded objects. Frame stacking helps only if occlusion is brief — longer occlusion requires memory.

Active Exploration

Atari games present all relevant information on screen at all times. Many ViZDoom scenarios require the agent to turn and search to find enemies or goals. This introduces an exploration component that doesn't exist in most Atari games.

Moving through 3D space with a consistent heading requires integrating a sequence of actions. The agent must learn that a sequence of TURN_LEFT actions rotates it to face a new direction — a spatial reasoning problem absent from Atari.

Scenario Hierarchy

ViZDoom ships with built-in scenarios in increasing order of difficulty. Each isolates a specific skill:

Scenario Action space Key skill Typical steps to learn
VizdoomBasic-v1 Move L/R, Shoot Align and fire at stationary target 100–300k
VizdoomDefendLine-v1 Move L/R, Shoot Eliminate a wave of approaching enemies 300–500k
VizdoomDefendCenter-v1 Turn L/R, Shoot Track moving targets from a fixed position 500k–1M
VizdoomHealthGathering-v1 Move, Turn Navigate and collect health packs 500k–1M
VizdoomCorridor-v1 Full movement + shoot Navigate a corridor with enemies, collect armour 1–3M
VizdoomMyWayHome-v1 Full movement Maze navigation to a fixed goal 3–10M
VizdoomPredictPosition-v1 Shoot Predict enemy position behind cover 5–20M
VizdoomDeathmatch-v1 Full movement + all weapons Kill enemies across the full map 50–500M

The gap between Basic (200k steps) and Deathmatch (100M steps) is a factor of 500. This is not just more training time — Deathmatch requires qualitatively different capabilities: navigation, target acquisition, weapon selection, ammo management, and spatial memory.

Curriculum Learning

Training directly on hard scenarios often fails: the reward signal is too sparse for the agent to discover the right behaviours from random exploration. Curriculum learning — training on progressively harder scenarios — is a practical solution:

  1. Train on Basic until the agent reliably kills the stationary enemy
  2. Transfer to DefendCenter (moving enemies) using the Basic policy as initialization
  3. Transfer to Corridor (navigation required)
  4. Transfer to Deathmatch

Each transfer re-uses the learned CNN encoder; only the policy head adapts to the new task. This is a form of transfer learning within a domain.

Frame Stacking vs Recurrence on ViZDoom

For Basic and DefendCenter, frame stacking (k=4k=4) provides sufficient temporal context — enemies are always visible and their motion is smooth.

For HealthGathering and MyWayHome, the agent needs to remember where it has been to avoid revisiting dead ends. Frame stacking with k=4k=4 covers only ~0.1 seconds at 35 fps — nowhere near enough. These scenarios benefit from recurrent policies:

# SB3 does not directly expose LSTM-based CnnPolicy,
# but sb3-contrib provides RecurrentPPO:
!pip install sb3-contrib
from sb3_contrib import RecurrentPPO
model = RecurrentPPO('CnnLstmPolicy', env, ...)

The recurrent state persists across steps, giving the agent a trainable memory that is not limited to kk frames.

High-Throughput Training: Sample Factory

Stable Baselines3 achieves ~2,000–5,000 environment frames per second on a single Colab GPU. For Deathmatch-level scenarios, this is prohibitively slow: 100M steps would take ~6 hours with an optimistic 5k fps.

Sample Factory (Petrenko et al., 2020) is a high-throughput RL framework built around asynchronous actors:

  • Separate processes for environment simulation, inference, and learning run concurrently
  • GPU inference is batched across many parallel actors
  • Achieves 100,000–300,000 fps on a single machine with a modern GPU
  • Native ViZDoom integration via sf_examples.vizdoom
# Train a PPO agent on VizdoomDoom Deathmatch with Sample Factory:
python -m sf_examples.vizdoom.train_vizdoom \
    --env=doom_deathmatch_bots \
    --num_workers=16 \
    --num_envs_per_worker=4 \
    --train_for_env_steps=100_000_000

At 200k fps, 100M steps takes ~8 minutes rather than 6 hours.

Modern Visual RL Beyond ViZDoom

ViZDoom remains the standard 3D visual RL benchmark, but the field has expanded:

Environment Domain Key challenge
Atari 100k Games Extreme sample efficiency (only 100k steps allowed)
Procgen Procedural games Generalisation across level layouts
MineDojo / MineRL Minecraft Long-horizon tasks, open-ended goals
Habitat Indoor navigation Photo-realistic scenes, embodied AI
Isaac Gym / IsaacLab Robotics GPU-accelerated physics, sim-to-real
DM Lab 3D navigation Memory, navigation, multi-task

The shared challenge across all of these is sample efficiency: how do you learn good visual representations faster, with fewer environment interactions? Active research directions include self-supervised auxiliary tasks, data augmentation (RAD, DrQ), world models (DreamerV3), and pre-trained visual encoders.