Deep Reinforcement Learning · Visual Reinforcement Learning

ViZDoom and 3D Visual RL

11 min read

By the end of this reading you will be able to:

Compare ViZDoom to Atari as a benchmark: identify what makes 3D first-person environments harder and what additional skills agents must learn
Order ViZDoom's built-in scenarios by difficulty and identify the key skill each scenario isolates
Explain why ViZDoom Deathmatch requires significantly more training steps and more capable architectures than Basic or DefendCenter

Why ViZDoom?

Atari established that CNNs can learn to play 2D games from pixels. ViZDoom (Kempka et al., 2016) asks the harder question: can agents learn in 3D, first-person, partially observable environments?

The Doom game engine provides:

Fully configurable scenarios (maps, enemies, rewards, available weapons)
First-person 3D rendering with genuine depth, perspective, and occlusion
Deterministic or stochastic enemy behaviour
A Python API for programmatic control
A Gymnasium wrapper registering environments as VizdoomXxx-v1

The gap between Atari and ViZDoom is the gap between a flat sprite game and a real 3D world — and it turns out to be large.

What 3D Adds Over Atari

Depth and Perspective

In Atari, objects are sprites at fixed screen positions. In Doom, a monster at distance 10 units appears larger and higher in the frame than the same monster at distance 50 units. The agent must learn that apparent size encodes distance — a form of 3D reasoning implicit in the visual input.

Genuine Occlusion

Enemies can be fully hidden behind walls and only reveal themselves when the agent (or enemy) moves. A single frame provides zero information about occluded objects. Frame stacking helps only if occlusion is brief — longer occlusion requires memory.

Active Exploration

Atari games present all relevant information on screen at all times. Many ViZDoom scenarios require the agent to turn and search to find enemies or goals. This introduces an exploration component that doesn't exist in most Atari games.

Moving through 3D space with a consistent heading requires integrating a sequence of actions. The agent must learn that a sequence of TURN_LEFT actions rotates it to face a new direction — a spatial reasoning problem absent from Atari.

Scenario Hierarchy

ViZDoom ships with built-in scenarios in increasing order of difficulty. Each isolates a specific skill:

Scenario	Action space	Key skill	Typical steps to learn
`VizdoomBasic-v1`	Move L/R, Shoot	Align and fire at stationary target	100–300k
`VizdoomDefendLine-v1`	Move L/R, Shoot	Eliminate a wave of approaching enemies	300–500k
`VizdoomDefendCenter-v1`	Turn L/R, Shoot	Track moving targets from a fixed position	500k–1M
`VizdoomHealthGathering-v1`	Move, Turn	Navigate and collect health packs	500k–1M
`VizdoomCorridor-v1`	Full movement + shoot	Navigate a corridor with enemies, collect armour	1–3M
`VizdoomMyWayHome-v1`	Full movement	Maze navigation to a fixed goal	3–10M
`VizdoomPredictPosition-v1`	Shoot	Predict enemy position behind cover	5–20M
`VizdoomDeathmatch-v1`	Full movement + all weapons	Kill enemies across the full map	50–500M

The gap between Basic (~~200k steps) and Deathmatch (~~100M steps) is a factor of 500. This is not just more training time — Deathmatch requires qualitatively different capabilities: navigation, target acquisition, weapon selection, ammo management, and spatial memory.

Curriculum Learning

Training directly on hard scenarios often fails: the reward signal is too sparse for the agent to discover the right behaviours from random exploration. Curriculum learning — training on progressively harder scenarios — is a practical solution:

Train on Basic until the agent reliably kills the stationary enemy
Transfer to DefendCenter (moving enemies) using the Basic policy as initialization
Transfer to Corridor (navigation required)
Transfer to Deathmatch

Each transfer re-uses the learned CNN encoder; only the policy head adapts to the new task. This is a form of transfer learning within a domain.

Frame Stacking vs Recurrence on ViZDoom

For Basic and DefendCenter, frame stacking ( $k=4$ ) provides sufficient temporal context — enemies are always visible and their motion is smooth.

For HealthGathering and MyWayHome, the agent needs to remember where it has been to avoid revisiting dead ends. Frame stacking with $k=4$ covers only ~0.1 seconds at 35 fps — nowhere near enough. These scenarios benefit from recurrent policies:

# SB3 does not directly expose LSTM-based CnnPolicy,
# but sb3-contrib provides RecurrentPPO:
!pip install sb3-contrib
from sb3_contrib import RecurrentPPO
model = RecurrentPPO('CnnLstmPolicy', env, ...)

The recurrent state persists across steps, giving the agent a trainable memory that is not limited to $k$ frames.

High-Throughput Training: Sample Factory

Stable Baselines3 achieves ~2,000–5,000 environment frames per second on a single Colab GPU. For Deathmatch-level scenarios, this is prohibitively slow: 100M steps would take ~6 hours with an optimistic 5k fps.

Sample Factory (Petrenko et al., 2020) is a high-throughput RL framework built around asynchronous actors:

Separate processes for environment simulation, inference, and learning run concurrently
GPU inference is batched across many parallel actors
Achieves 100,000–300,000 fps on a single machine with a modern GPU
Native ViZDoom integration via sf_examples.vizdoom

# Train a PPO agent on VizdoomDoom Deathmatch with Sample Factory:
python -m sf_examples.vizdoom.train_vizdoom \
    --env=doom_deathmatch_bots \
    --num_workers=16 \
    --num_envs_per_worker=4 \
    --train_for_env_steps=100_000_000

At 200k fps, 100M steps takes ~8 minutes rather than 6 hours.

Modern Visual RL Beyond ViZDoom

ViZDoom remains the standard 3D visual RL benchmark, but the field has expanded:

Environment	Domain	Key challenge
Atari 100k	Games	Extreme sample efficiency (only 100k steps allowed)
Procgen	Procedural games	Generalisation across level layouts
MineDojo / MineRL	Minecraft	Long-horizon tasks, open-ended goals
Habitat	Indoor navigation	Photo-realistic scenes, embodied AI
Isaac Gym / IsaacLab	Robotics	GPU-accelerated physics, sim-to-real
DM Lab	3D navigation	Memory, navigation, multi-task

The shared challenge across all of these is sample efficiency: how do you learn good visual representations faster, with fewer environment interactions? Active research directions include self-supervised auxiliary tasks, data augmentation (RAD, DrQ), world models (DreamerV3), and pre-trained visual encoders.

References

Kempka et al. 2016 — ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning

Petrenko et al. 2020 — Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning

Hafner et al. 2023 — Mastering Diverse Domains through World Models (DreamerV3)

Yarats et al. 2021 — Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning (DrQ-v2)

Previous Start Lab →