Deep Reinforcement Learning · Visual Reinforcement Learning

Convolutional Policies and Preprocessing

13 min read
By the end of this reading you will be able to:
  • Describe the standard Atari preprocessing pipeline (grayscale, resize to 84×84, frame stack k=4, normalize) and justify each step from first principles
  • Explain why frame stacking provides approximate temporal memory and identify scenarios where it is insufficient and recurrence is needed instead
  • Implement a DoomScreenWrapper that extracts the screen buffer, converts to grayscale, and resizes — and wire it into SB3's CnnPolicy via VecFrameStack

The Standard Preprocessing Pipeline

Every major visual RL result since Atari DQN applies the same four-step preprocessing pipeline to raw frames. Each step has a principled justification:

Step 1: Convert to Grayscale

RGB frames contain three channels; grayscale collapses them to one. The rationale:

  • No task-relevant signal is lost on most benchmarks: enemy positions, wall geometry, and moving objects are equally visible in grayscale
  • 3× reduction in input size: a 240×320 RGB frame has 230,400 values; grayscale has 76,800
  • Training stability: three highly correlated channels add noise without information

The exception is tasks where colour is semantically meaningful (e.g., red health packs vs green items). For those, RGB or a learned colour space may outperform grayscale.

import cv2
gray = cv2.cvtColor(rgb_frame, cv2.COLOR_RGB2GRAY)  # (H, W, 3) -> (H, W)

Step 2: Resize to 84×84

Downsampling to 84×84 is the field-standard resolution, established by Atari DQN:

  • GPU memory: a minibatch of 64 stacked (84×84×4) frames uses ~18MB; equivalent 240×320×4 frames would use ~150MB
  • CNN parameter count: the first conv layer parameters scale quadratically with input resolution
  • Information retention: game-relevant features (enemies, platforms, projectiles) are still visible at 84×84; most background texture is legitimately discarded
small = cv2.resize(gray, (84, 84), interpolation=cv2.INTER_AREA)

INTER_AREA (area averaging) is preferred over INTER_LINEAR when downsampling because it avoids aliasing artifacts.

Step 3: Frame Stacking

A single frame is ambiguous: it shows where objects are, but not which direction they're moving or how fast. Without motion information, even simple tasks like dodging a projectile are partially observable.

Frame stacking concatenates the last kk preprocessed frames along the channel dimension, giving the agent a short temporal window:

otstacked=[ftk+1,ftk+2,,ft]RH×W×ko_t^{\text{stacked}} = [f_{t-k+1}, f_{t-k+2}, \ldots, f_t] \in \mathbb{R}^{H \times W \times k}

With k=4k=4 and 84×84 grayscale, the observation is (84,84,4)(84, 84, 4). The CNN sees four consecutive frames simultaneously and can detect motion from the difference between them — without any recurrence.

Why this works: for smooth, low-frequency motion (which covers most game mechanics), the velocity of an object is well-approximated by its displacement between frames. The CNN implicitly learns to compute differences.

When it fails: long-range dependencies (e.g., remembering a key picked up 10 seconds ago), very fast motion where objects teleport between frames, or tasks requiring memory of more than kk steps. Recurrent policies (LSTM/GRU) are the standard solution for these cases.

In Stable Baselines3, frame stacking is implemented as a VecEnv wrapper:

from stable_baselines3.common.vec_env import VecFrameStack

# Base vectorised env returns (84, 84, 1) observations
vec_env = make_vec_env(...)
# Stack 4 frames -> observations become (84, 84, 4)
stacked_env = VecFrameStack(vec_env, n_stack=4)

Step 4: Normalize to [0, 1]

Raw pixel values are integers in [0,255][0, 255]. Neural networks train more stably when inputs have approximately unit scale. SB3's CnnPolicy performs this normalization automatically:

# Inside SB3's preprocessing (applied automatically):
processed = frame.astype(np.float32) / 255.0

The Nature DQN CNN Encoder

The architecture introduced in Mnih et al. (2015) has three convolutional layers followed by a fully connected layer:

Layer Filters Kernel Stride Output size
Conv1 32 8×8 4 (20, 20, 32)
Conv2 64 4×4 2 (9, 9, 64)
Conv3 64 3×3 1 (7, 7, 64)
Flatten 3136
FC 512 512

Input: (84,84,4)(84, 84, 4) stacked frames. Output: 512-dimensional feature vector fed into the policy/value head.

Total parameters: ~1.7M — lightweight enough to train entirely from RL signal.

SB3's CnnPolicy uses this exact architecture as its default encoder:

from stable_baselines3 import PPO

model = PPO(
    policy='CnnPolicy',
    env=stacked_env,
    ...
)
print(model.policy.features_extractor)
# NatureCNN(
#   (cnn): Sequential(Conv2d(4,32,8,4), ReLU,
#                     Conv2d(32,64,4,2), ReLU,
#                     Conv2d(64,64,3,1), ReLU, Flatten)
#   (linear): Sequential(Linear(3136,512), ReLU)
# )

Implementing a Preprocessing Wrapper

Gymnasium's ObservationWrapper is the standard way to apply preprocessing. The wrapper intercepts every observation and transforms it before the agent sees it:

import gymnasium as gym
import numpy as np
import cv2

class DoomScreenWrapper(gym.ObservationWrapper):
    '''Extract screen buffer, convert to grayscale, resize to (size, size, 1).'''

    def __init__(self, env, screen_size=84):
        super().__init__(env)
        self.screen_size = screen_size
        self.observation_space = gym.spaces.Box(
            low=0, high=255,
            shape=(screen_size, screen_size, 1),
            dtype=np.uint8,
        )

    def observation(self, obs):
        # obs is a dict {'screen': ndarray, 'gamevariables': ndarray}
        screen = obs['screen'] if isinstance(obs, dict) else obs
        gray   = cv2.cvtColor(screen, cv2.COLOR_RGB2GRAY)
        small  = cv2.resize(gray, (self.screen_size, self.screen_size),
                             interpolation=cv2.INTER_AREA)
        return small[:, :, np.newaxis]   # (H, W) -> (H, W, 1)

The full pipeline — vectorisation, wrapping, and stacking:

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack, VecMonitor

def make_doom(env_id, n_envs=4, n_stack=4, seed=0):
    def _make():
        return DoomScreenWrapper(gym.make(env_id))
    vec = VecMonitor(make_vec_env(_make, n_envs=n_envs, seed=seed))
    return VecFrameStack(vec, n_stack=n_stack)

# Observation: (n_envs, 84, 84, 4)
train_env = make_doom('VizdoomBasic-v1', n_envs=8, n_stack=4)

Training PPO with CnnPolicy

With image observations, PPO's hyperparameters shift relative to the MLP setting:

Hyperparameter MLP typical CNN typical Why
n_steps 2048 512 Shorter rollouts; more frequent encoder updates
n_epochs 10 4 Fewer passes per rollout (CNNs overfit more easily)
learning_rate 3e-4 2.5e-4 Slightly lower for stability with larger input
ent_coef 0.0 0.01 Entropy bonus encourages visual exploration
n_envs 4–8 8–16 More envs compensate for lower sample efficiency
model = PPO(
    policy='CnnPolicy',
    env=train_env,
    n_steps=512,
    batch_size=64,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
    learning_rate=2.5e-4,
    seed=0,
)
model.learn(total_timesteps=300_000)

Interpreting Visual RL Learning Curves

Visual RL learning curves have a characteristic shape that differs from state-based RL:

  • Flat early phase (often 0–30% of training): the CNN encoder is too noisy to produce useful gradient signal; the policy appears to do nothing useful
  • Rapid rise: once the encoder begins to extract consistent features, the policy improves quickly
  • Plateau: the encoder has converged; further improvement requires more data or architectural changes

If the curve is flat for more than 50% of total training, consider: longer learning_starts, higher ent_coef, or data augmentation (random crops, colour jitter) to improve encoder learning.