ViZDoom and 3D Visual RL
- Compare ViZDoom to Atari as a benchmark: identify what makes 3D first-person environments harder and what additional skills agents must learn
- Order ViZDoom's built-in scenarios by difficulty and identify the key skill each scenario isolates
- Explain why ViZDoom Deathmatch requires significantly more training steps and more capable architectures than Basic or DefendCenter
Why ViZDoom?
Atari established that CNNs can learn to play 2D games from pixels. ViZDoom (Kempka et al., 2016) asks the harder question: can agents learn in 3D, first-person, partially observable environments?
The Doom game engine provides:
- Fully configurable scenarios (maps, enemies, rewards, available weapons)
- First-person 3D rendering with genuine depth, perspective, and occlusion
- Deterministic or stochastic enemy behaviour
- A Python API for programmatic control
- A Gymnasium wrapper registering environments as
VizdoomXxx-v1
The gap between Atari and ViZDoom is the gap between a flat sprite game and a real 3D world — and it turns out to be large.
What 3D Adds Over Atari
Depth and Perspective
In Atari, objects are sprites at fixed screen positions. In Doom, a monster at distance 10 units appears larger and higher in the frame than the same monster at distance 50 units. The agent must learn that apparent size encodes distance — a form of 3D reasoning implicit in the visual input.
Genuine Occlusion
Enemies can be fully hidden behind walls and only reveal themselves when the agent (or enemy) moves. A single frame provides zero information about occluded objects. Frame stacking helps only if occlusion is brief — longer occlusion requires memory.
Active Exploration
Atari games present all relevant information on screen at all times. Many ViZDoom scenarios require the agent to turn and search to find enemies or goals. This introduces an exploration component that doesn't exist in most Atari games.
Navigation
Moving through 3D space with a consistent heading requires integrating a sequence of actions. The agent must learn that a sequence of TURN_LEFT actions rotates it to face a new direction — a spatial reasoning problem absent from Atari.
Scenario Hierarchy
ViZDoom ships with built-in scenarios in increasing order of difficulty. Each isolates a specific skill:
| Scenario | Action space | Key skill | Typical steps to learn |
|---|---|---|---|
VizdoomBasic-v1 |
Move L/R, Shoot | Align and fire at stationary target | 100–300k |
VizdoomDefendLine-v1 |
Move L/R, Shoot | Eliminate a wave of approaching enemies | 300–500k |
VizdoomDefendCenter-v1 |
Turn L/R, Shoot | Track moving targets from a fixed position | 500k–1M |
VizdoomHealthGathering-v1 |
Move, Turn | Navigate and collect health packs | 500k–1M |
VizdoomCorridor-v1 |
Full movement + shoot | Navigate a corridor with enemies, collect armour | 1–3M |
VizdoomMyWayHome-v1 |
Full movement | Maze navigation to a fixed goal | 3–10M |
VizdoomPredictPosition-v1 |
Shoot | Predict enemy position behind cover | 5–20M |
VizdoomDeathmatch-v1 |
Full movement + all weapons | Kill enemies across the full map | 50–500M |
The gap between Basic (200k steps) and Deathmatch (100M steps) is a factor of 500. This is not just more training time — Deathmatch requires qualitatively different capabilities: navigation, target acquisition, weapon selection, ammo management, and spatial memory.
Curriculum Learning
Training directly on hard scenarios often fails: the reward signal is too sparse for the agent to discover the right behaviours from random exploration. Curriculum learning — training on progressively harder scenarios — is a practical solution:
- Train on Basic until the agent reliably kills the stationary enemy
- Transfer to DefendCenter (moving enemies) using the Basic policy as initialization
- Transfer to Corridor (navigation required)
- Transfer to Deathmatch
Each transfer re-uses the learned CNN encoder; only the policy head adapts to the new task. This is a form of transfer learning within a domain.
Frame Stacking vs Recurrence on ViZDoom
For Basic and DefendCenter, frame stacking () provides sufficient temporal context — enemies are always visible and their motion is smooth.
For HealthGathering and MyWayHome, the agent needs to remember where it has been to avoid revisiting dead ends. Frame stacking with covers only ~0.1 seconds at 35 fps — nowhere near enough. These scenarios benefit from recurrent policies:
# SB3 does not directly expose LSTM-based CnnPolicy,
# but sb3-contrib provides RecurrentPPO:
!pip install sb3-contrib
from sb3_contrib import RecurrentPPO
model = RecurrentPPO('CnnLstmPolicy', env, ...)
The recurrent state persists across steps, giving the agent a trainable memory that is not limited to frames.
High-Throughput Training: Sample Factory
Stable Baselines3 achieves ~2,000–5,000 environment frames per second on a single Colab GPU. For Deathmatch-level scenarios, this is prohibitively slow: 100M steps would take ~6 hours with an optimistic 5k fps.
Sample Factory (Petrenko et al., 2020) is a high-throughput RL framework built around asynchronous actors:
- Separate processes for environment simulation, inference, and learning run concurrently
- GPU inference is batched across many parallel actors
- Achieves 100,000–300,000 fps on a single machine with a modern GPU
- Native ViZDoom integration via
sf_examples.vizdoom
# Train a PPO agent on VizdoomDoom Deathmatch with Sample Factory:
python -m sf_examples.vizdoom.train_vizdoom \
--env=doom_deathmatch_bots \
--num_workers=16 \
--num_envs_per_worker=4 \
--train_for_env_steps=100_000_000
At 200k fps, 100M steps takes ~8 minutes rather than 6 hours.
Modern Visual RL Beyond ViZDoom
ViZDoom remains the standard 3D visual RL benchmark, but the field has expanded:
| Environment | Domain | Key challenge |
|---|---|---|
| Atari 100k | Games | Extreme sample efficiency (only 100k steps allowed) |
| Procgen | Procedural games | Generalisation across level layouts |
| MineDojo / MineRL | Minecraft | Long-horizon tasks, open-ended goals |
| Habitat | Indoor navigation | Photo-realistic scenes, embodied AI |
| Isaac Gym / IsaacLab | Robotics | GPU-accelerated physics, sim-to-real |
| DM Lab | 3D navigation | Memory, navigation, multi-task |
The shared challenge across all of these is sample efficiency: how do you learn good visual representations faster, with fewer environment interactions? Active research directions include self-supervised auxiliary tasks, data augmentation (RAD, DrQ), world models (DreamerV3), and pre-trained visual encoders.