Deep Reinforcement Learning · Visual Reinforcement Learning

Visual RL: From Pixels to Policies

12 min read
By the end of this reading you will be able to:
  • Explain why raw-pixel observations make RL harder than low-dimensional state inputs, and identify the two simultaneous learning problems the agent must solve
  • Describe the key design decisions in Atari DQN (Mnih et al. 2015) that enabled learning from pixels and explain the role each plays
  • Quantify the sample efficiency gap between state-based and pixel-based RL on equivalent tasks and explain why the gap exists

The Observation Gap

All the algorithms covered so far — VPG, PPO, DDPG, SAC — receive low-dimensional state vectors as input: a handful of numbers representing position, velocity, joint angles, and similar quantities. These state-based agents learn in millions of steps.

The same algorithms applied to raw pixels require tens to hundreds of millions of steps on equivalent tasks. The reason is that pixel-based RL imposes two simultaneous learning problems:

  1. Representation learning: extract a compact, task-relevant feature vector from a high-dimensional, redundant image
  2. Policy learning: map that feature vector to good actions

State-based RL only has to solve problem 2. Every environment step and every gradient update must serve both problems at once when learning from pixels.

Why Not a Linear Layer?

A 240×320×3 Doom frame contains 230,400 numbers. A fully connected layer mapping this to a 512-unit representation would have 230,400×512118M230{,}400 \times 512 \approx 118\text{M} parameters — just for the first layer. Beyond the parameter count, linear layers cannot exploit spatial structure: a corridor detector must re-learn the same pattern for every possible screen position.

Convolutional layers solve both problems:

  • Weight sharing across spatial positions drastically reduces parameters
  • Translation equivariance means a detector fires wherever the relevant pattern appears, regardless of position

A standard three-layer CNN encoder for RL has roughly 1.7M parameters — 70× fewer than a single fully connected layer from the same input.

Historical Breakthrough: Atari DQN

Mnih et al. (2013/2015) showed that a single convolutional architecture could learn to play 49 Atari 2600 games from raw pixels, achieving human-level or superhuman performance on many of them. The algorithm — DQN — combined three ideas that had existed separately:

  1. Deep convolutional Q-network: a CNN mapping frames to Q-values for each action
  2. Experience replay: storing transitions in a buffer and training on random minibatches to break temporal correlation
  3. Separate target network: a frozen copy of the Q-network updated periodically to stabilize the Bellman target

Five key preprocessing and architecture decisions made it work:

Decision Value Why
Frame stacking k=4k=4 frames Encodes motion without recurrence
Grayscale Yes Colour adds no task-relevant signal on Atari
Resize 84×84 Reduces dimensionality while preserving game structure
Normalize pixels ÷ 255 Keeps inputs in [0,1][0,1] for stable gradients
CNN architecture 3 conv layers + 2 FC Sufficient for Atari; becomes standard

This preprocessing pipeline and encoder architecture have been used almost unchanged in visual RL research for the decade since.

Sample Efficiency Gap

Setting Steps to learn CartPole-equivalent task
PPO, state input (4 numbers) ~50k
PPO, pixel input (84×84 grayscale) ~500k–2M
Gap 10–40×

The gap exists because every gradient signal that arrives from the environment must simultaneously improve the CNN encoder and the policy head. With a fixed pre-trained encoder (as in transfer learning), the gap shrinks dramatically — evidence that representation learning, not policy learning, is the bottleneck.

Partial Observability in 3D Environments

Atari games are almost fully observable — everything relevant fits in a single frame. 3D environments like ViZDoom introduce genuine partial observability: enemies can be behind walls, outside the field of view, or approaching from angles not captured by the current frame.

This places visual RL in Partially Observable MDP (POMDP) territory:

ot=O(st)o_t = O(s_t)

where the observation oto_t (the current screen frame) provides strictly less information than the full game state sts_t. The agent must form a belief about hidden state — typically via:

  • Frame stacking: approximate temporal memory; works well for smooth motion
  • Recurrent policies (LSTM/GRU): explicit memory; handles long-range dependencies at the cost of training complexity

Most practical visual RL systems use frame stacking for short-horizon tasks and add recurrence only when it demonstrably helps.

What Makes Visual RL Harder in Practice

Beyond sample efficiency, three practical challenges stand out:

Representation collapse: early in training, the CNN may learn to ignore moving objects (enemies, goals) and focus on static textures, since static features produce more consistent predictions. Entropy bonuses and auxiliary tasks help.

Reward sparsity amplified by bad representations: a sparse reward signal is hard enough with a good state representation. With a poor visual encoder, the agent has almost no gradient signal to improve the encoder — a chicken-and-egg problem.

Computational cost: processing 84×84 frames through a CNN is 10–50× more expensive per step than a simple MLP forward pass. Vectorised environments (VecEnv) and GPU-accelerated rendering are standard mitigations.

The next reading covers the standard CNN encoder architecture and preprocessing pipeline used to address these challenges.