Visual RL: From Pixels to Policies
- Explain why raw-pixel observations make RL harder than low-dimensional state inputs, and identify the two simultaneous learning problems the agent must solve
- Describe the key design decisions in Atari DQN (Mnih et al. 2015) that enabled learning from pixels and explain the role each plays
- Quantify the sample efficiency gap between state-based and pixel-based RL on equivalent tasks and explain why the gap exists
The Observation Gap
All the algorithms covered so far — VPG, PPO, DDPG, SAC — receive low-dimensional state vectors as input: a handful of numbers representing position, velocity, joint angles, and similar quantities. These state-based agents learn in millions of steps.
The same algorithms applied to raw pixels require tens to hundreds of millions of steps on equivalent tasks. The reason is that pixel-based RL imposes two simultaneous learning problems:
- Representation learning: extract a compact, task-relevant feature vector from a high-dimensional, redundant image
- Policy learning: map that feature vector to good actions
State-based RL only has to solve problem 2. Every environment step and every gradient update must serve both problems at once when learning from pixels.
Why Not a Linear Layer?
A 240×320×3 Doom frame contains 230,400 numbers. A fully connected layer mapping this to a 512-unit representation would have parameters — just for the first layer. Beyond the parameter count, linear layers cannot exploit spatial structure: a corridor detector must re-learn the same pattern for every possible screen position.
Convolutional layers solve both problems:
- Weight sharing across spatial positions drastically reduces parameters
- Translation equivariance means a detector fires wherever the relevant pattern appears, regardless of position
A standard three-layer CNN encoder for RL has roughly 1.7M parameters — 70× fewer than a single fully connected layer from the same input.
Historical Breakthrough: Atari DQN
Mnih et al. (2013/2015) showed that a single convolutional architecture could learn to play 49 Atari 2600 games from raw pixels, achieving human-level or superhuman performance on many of them. The algorithm — DQN — combined three ideas that had existed separately:
- Deep convolutional Q-network: a CNN mapping frames to Q-values for each action
- Experience replay: storing transitions in a buffer and training on random minibatches to break temporal correlation
- Separate target network: a frozen copy of the Q-network updated periodically to stabilize the Bellman target
Five key preprocessing and architecture decisions made it work:
| Decision | Value | Why |
|---|---|---|
| Frame stacking | frames | Encodes motion without recurrence |
| Grayscale | Yes | Colour adds no task-relevant signal on Atari |
| Resize | 84×84 | Reduces dimensionality while preserving game structure |
| Normalize | pixels ÷ 255 | Keeps inputs in for stable gradients |
| CNN architecture | 3 conv layers + 2 FC | Sufficient for Atari; becomes standard |
This preprocessing pipeline and encoder architecture have been used almost unchanged in visual RL research for the decade since.
Sample Efficiency Gap
| Setting | Steps to learn CartPole-equivalent task |
|---|---|
| PPO, state input (4 numbers) | ~50k |
| PPO, pixel input (84×84 grayscale) | ~500k–2M |
| Gap | 10–40× |
The gap exists because every gradient signal that arrives from the environment must simultaneously improve the CNN encoder and the policy head. With a fixed pre-trained encoder (as in transfer learning), the gap shrinks dramatically — evidence that representation learning, not policy learning, is the bottleneck.
Partial Observability in 3D Environments
Atari games are almost fully observable — everything relevant fits in a single frame. 3D environments like ViZDoom introduce genuine partial observability: enemies can be behind walls, outside the field of view, or approaching from angles not captured by the current frame.
This places visual RL in Partially Observable MDP (POMDP) territory:
where the observation (the current screen frame) provides strictly less information than the full game state . The agent must form a belief about hidden state — typically via:
- Frame stacking: approximate temporal memory; works well for smooth motion
- Recurrent policies (LSTM/GRU): explicit memory; handles long-range dependencies at the cost of training complexity
Most practical visual RL systems use frame stacking for short-horizon tasks and add recurrence only when it demonstrably helps.
What Makes Visual RL Harder in Practice
Beyond sample efficiency, three practical challenges stand out:
Representation collapse: early in training, the CNN may learn to ignore moving objects (enemies, goals) and focus on static textures, since static features produce more consistent predictions. Entropy bonuses and auxiliary tasks help.
Reward sparsity amplified by bad representations: a sparse reward signal is hard enough with a good state representation. With a poor visual encoder, the agent has almost no gradient signal to improve the encoder — a chicken-and-egg problem.
Computational cost: processing 84×84 frames through a CNN is 10–50× more expensive per step than a simple MLP forward pass. Vectorised environments (VecEnv) and GPU-accelerated rendering are standard mitigations.
The next reading covers the standard CNN encoder architecture and preprocessing pipeline used to address these challenges.