Deep Reinforcement Learning · Visual Reinforcement Learning

Convolutional Policies and Preprocessing

13 min read

By the end of this reading you will be able to:

Describe the standard Atari preprocessing pipeline (grayscale, resize to 84×84, frame stack k=4, normalize) and justify each step from first principles
Explain why frame stacking provides approximate temporal memory and identify scenarios where it is insufficient and recurrence is needed instead
Implement a DoomScreenWrapper that extracts the screen buffer, converts to grayscale, and resizes — and wire it into SB3's CnnPolicy via VecFrameStack

The Standard Preprocessing Pipeline

Every major visual RL result since Atari DQN applies the same four-step preprocessing pipeline to raw frames. Each step has a principled justification:

Step 1: Convert to Grayscale

RGB frames contain three channels; grayscale collapses them to one. The rationale:

No task-relevant signal is lost on most benchmarks: enemy positions, wall geometry, and moving objects are equally visible in grayscale
3× reduction in input size: a 240×320 RGB frame has 230,400 values; grayscale has 76,800
Training stability: three highly correlated channels add noise without information

The exception is tasks where colour is semantically meaningful (e.g., red health packs vs green items). For those, RGB or a learned colour space may outperform grayscale.

import cv2
gray = cv2.cvtColor(rgb_frame, cv2.COLOR_RGB2GRAY)  # (H, W, 3) -> (H, W)

Step 2: Resize to 84×84

Downsampling to 84×84 is the field-standard resolution, established by Atari DQN:

GPU memory: a minibatch of 64 stacked (84×84×4) frames uses ~18MB; equivalent 240×320×4 frames would use ~150MB
CNN parameter count: the first conv layer parameters scale quadratically with input resolution
Information retention: game-relevant features (enemies, platforms, projectiles) are still visible at 84×84; most background texture is legitimately discarded

small = cv2.resize(gray, (84, 84), interpolation=cv2.INTER_AREA)

INTER_AREA (area averaging) is preferred over INTER_LINEAR when downsampling because it avoids aliasing artifacts.

Step 3: Frame Stacking

A single frame is ambiguous: it shows where objects are, but not which direction they're moving or how fast. Without motion information, even simple tasks like dodging a projectile are partially observable.

Frame stacking concatenates the last $k$ preprocessed frames along the channel dimension, giving the agent a short temporal window:

$o_t^{\text{stacked}} = [f_{t-k+1}, f_{t-k+2}, \ldots, f_t] \in \mathbb{R}^{H \times W \times k}$

With $k=4$ and 84×84 grayscale, the observation is $(84, 84, 4)$ . The CNN sees four consecutive frames simultaneously and can detect motion from the difference between them — without any recurrence.

Why this works: for smooth, low-frequency motion (which covers most game mechanics), the velocity of an object is well-approximated by its displacement between frames. The CNN implicitly learns to compute differences.

When it fails: long-range dependencies (e.g., remembering a key picked up 10 seconds ago), very fast motion where objects teleport between frames, or tasks requiring memory of more than $k$ steps. Recurrent policies (LSTM/GRU) are the standard solution for these cases.

In Stable Baselines3, frame stacking is implemented as a VecEnv wrapper:

from stable_baselines3.common.vec_env import VecFrameStack

# Base vectorised env returns (84, 84, 1) observations
vec_env = make_vec_env(...)
# Stack 4 frames -> observations become (84, 84, 4)
stacked_env = VecFrameStack(vec_env, n_stack=4)

Step 4: Normalize to [0, 1]

Raw pixel values are integers in $[0, 255]$ . Neural networks train more stably when inputs have approximately unit scale. SB3's CnnPolicy performs this normalization automatically:

# Inside SB3's preprocessing (applied automatically):
processed = frame.astype(np.float32) / 255.0

The Nature DQN CNN Encoder

The architecture introduced in Mnih et al. (2015) has three convolutional layers followed by a fully connected layer:

Layer	Filters	Kernel	Stride	Output size
Conv1	32	8×8	4	(20, 20, 32)
Conv2	64	4×4	2	(9, 9, 64)
Conv3	64	3×3	1	(7, 7, 64)
Flatten	—	—	—	3136
FC	512	—	—	512

Input: $(84, 84, 4)$ stacked frames. Output: 512-dimensional feature vector fed into the policy/value head.

Total parameters: ~1.7M — lightweight enough to train entirely from RL signal.

SB3's CnnPolicy uses this exact architecture as its default encoder:

from stable_baselines3 import PPO

model = PPO(
    policy='CnnPolicy',
    env=stacked_env,
    ...
)
print(model.policy.features_extractor)
# NatureCNN(
#   (cnn): Sequential(Conv2d(4,32,8,4), ReLU,
#                     Conv2d(32,64,4,2), ReLU,
#                     Conv2d(64,64,3,1), ReLU, Flatten)
#   (linear): Sequential(Linear(3136,512), ReLU)
# )

Implementing a Preprocessing Wrapper

Gymnasium's ObservationWrapper is the standard way to apply preprocessing. The wrapper intercepts every observation and transforms it before the agent sees it:

import gymnasium as gym
import numpy as np
import cv2

class DoomScreenWrapper(gym.ObservationWrapper):
    '''Extract screen buffer, convert to grayscale, resize to (size, size, 1).'''

    def __init__(self, env, screen_size=84):
        super().__init__(env)
        self.screen_size = screen_size
        self.observation_space = gym.spaces.Box(
            low=0, high=255,
            shape=(screen_size, screen_size, 1),
            dtype=np.uint8,
        )

    def observation(self, obs):
        # obs is a dict {'screen': ndarray, 'gamevariables': ndarray}
        screen = obs['screen'] if isinstance(obs, dict) else obs
        gray   = cv2.cvtColor(screen, cv2.COLOR_RGB2GRAY)
        small  = cv2.resize(gray, (self.screen_size, self.screen_size),
                             interpolation=cv2.INTER_AREA)
        return small[:, :, np.newaxis]   # (H, W) -> (H, W, 1)

The full pipeline — vectorisation, wrapping, and stacking:

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack, VecMonitor

def make_doom(env_id, n_envs=4, n_stack=4, seed=0):
    def _make():
        return DoomScreenWrapper(gym.make(env_id))
    vec = VecMonitor(make_vec_env(_make, n_envs=n_envs, seed=seed))
    return VecFrameStack(vec, n_stack=n_stack)

# Observation: (n_envs, 84, 84, 4)
train_env = make_doom('VizdoomBasic-v1', n_envs=8, n_stack=4)

Training PPO with CnnPolicy

With image observations, PPO's hyperparameters shift relative to the MLP setting:

Hyperparameter	MLP typical	CNN typical	Why
`n_steps`	2048	512	Shorter rollouts; more frequent encoder updates
`n_epochs`	10	4	Fewer passes per rollout (CNNs overfit more easily)
`learning_rate`	3e-4	2.5e-4	Slightly lower for stability with larger input
`ent_coef`	0.0	0.01	Entropy bonus encourages visual exploration
`n_envs`	4–8	8–16	More envs compensate for lower sample efficiency

model = PPO(
    policy='CnnPolicy',
    env=train_env,
    n_steps=512,
    batch_size=64,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
    learning_rate=2.5e-4,
    seed=0,
)
model.learn(total_timesteps=300_000)

Interpreting Visual RL Learning Curves

Visual RL learning curves have a characteristic shape that differs from state-based RL:

Flat early phase (often 0–30% of training): the CNN encoder is too noisy to produce useful gradient signal; the policy appears to do nothing useful
Rapid rise: once the encoder begins to extract consistent features, the policy improves quickly
Plateau: the encoder has converged; further improvement requires more data or architectural changes

If the curve is flat for more than 50% of total training, consider: longer learning_starts, higher ent_coef, or data augmentation (random crops, colour jitter) to improve encoder learning.

References

Mnih et al. 2015 — Human-level control through deep reinforcement learning

Raffin et al. 2021 — Stable-Baselines3: Reliable Reinforcement Learning Implementations

Laskin et al. 2020 — Reinforcement Learning with Augmented Data (RAD)

Previous Next →

Convolutional Policies and Preprocessing

The Standard Preprocessing Pipeline

Step 1: Convert to Grayscale

Step 2: Resize to 84×84

Step 3: Frame Stacking

Step 4: Normalize to [0, 1]

The Nature DQN CNN Encoder

Implementing a Preprocessing Wrapper

Training PPO with CnnPolicy

Interpreting Visual RL Learning Curves

Privacy Policy

What we collect

What we don't collect

Your choices

Contact