Convolutional Policies and Preprocessing
- Describe the standard Atari preprocessing pipeline (grayscale, resize to 84×84, frame stack k=4, normalize) and justify each step from first principles
- Explain why frame stacking provides approximate temporal memory and identify scenarios where it is insufficient and recurrence is needed instead
- Implement a DoomScreenWrapper that extracts the screen buffer, converts to grayscale, and resizes — and wire it into SB3's CnnPolicy via VecFrameStack
The Standard Preprocessing Pipeline
Every major visual RL result since Atari DQN applies the same four-step preprocessing pipeline to raw frames. Each step has a principled justification:
Step 1: Convert to Grayscale
RGB frames contain three channels; grayscale collapses them to one. The rationale:
- No task-relevant signal is lost on most benchmarks: enemy positions, wall geometry, and moving objects are equally visible in grayscale
- 3× reduction in input size: a 240×320 RGB frame has 230,400 values; grayscale has 76,800
- Training stability: three highly correlated channels add noise without information
The exception is tasks where colour is semantically meaningful (e.g., red health packs vs green items). For those, RGB or a learned colour space may outperform grayscale.
import cv2
gray = cv2.cvtColor(rgb_frame, cv2.COLOR_RGB2GRAY) # (H, W, 3) -> (H, W)
Step 2: Resize to 84×84
Downsampling to 84×84 is the field-standard resolution, established by Atari DQN:
- GPU memory: a minibatch of 64 stacked (84×84×4) frames uses ~18MB; equivalent 240×320×4 frames would use ~150MB
- CNN parameter count: the first conv layer parameters scale quadratically with input resolution
- Information retention: game-relevant features (enemies, platforms, projectiles) are still visible at 84×84; most background texture is legitimately discarded
small = cv2.resize(gray, (84, 84), interpolation=cv2.INTER_AREA)
INTER_AREA (area averaging) is preferred over INTER_LINEAR when downsampling because it avoids aliasing artifacts.
Step 3: Frame Stacking
A single frame is ambiguous: it shows where objects are, but not which direction they're moving or how fast. Without motion information, even simple tasks like dodging a projectile are partially observable.
Frame stacking concatenates the last preprocessed frames along the channel dimension, giving the agent a short temporal window:
With and 84×84 grayscale, the observation is . The CNN sees four consecutive frames simultaneously and can detect motion from the difference between them — without any recurrence.
Why this works: for smooth, low-frequency motion (which covers most game mechanics), the velocity of an object is well-approximated by its displacement between frames. The CNN implicitly learns to compute differences.
When it fails: long-range dependencies (e.g., remembering a key picked up 10 seconds ago), very fast motion where objects teleport between frames, or tasks requiring memory of more than steps. Recurrent policies (LSTM/GRU) are the standard solution for these cases.
In Stable Baselines3, frame stacking is implemented as a VecEnv wrapper:
from stable_baselines3.common.vec_env import VecFrameStack
# Base vectorised env returns (84, 84, 1) observations
vec_env = make_vec_env(...)
# Stack 4 frames -> observations become (84, 84, 4)
stacked_env = VecFrameStack(vec_env, n_stack=4)
Step 4: Normalize to [0, 1]
Raw pixel values are integers in . Neural networks train more stably when inputs have approximately unit scale. SB3's CnnPolicy performs this normalization automatically:
# Inside SB3's preprocessing (applied automatically):
processed = frame.astype(np.float32) / 255.0
The Nature DQN CNN Encoder
The architecture introduced in Mnih et al. (2015) has three convolutional layers followed by a fully connected layer:
| Layer | Filters | Kernel | Stride | Output size |
|---|---|---|---|---|
| Conv1 | 32 | 8×8 | 4 | (20, 20, 32) |
| Conv2 | 64 | 4×4 | 2 | (9, 9, 64) |
| Conv3 | 64 | 3×3 | 1 | (7, 7, 64) |
| Flatten | — | — | — | 3136 |
| FC | 512 | — | — | 512 |
Input: stacked frames. Output: 512-dimensional feature vector fed into the policy/value head.
Total parameters: ~1.7M — lightweight enough to train entirely from RL signal.
SB3's CnnPolicy uses this exact architecture as its default encoder:
from stable_baselines3 import PPO
model = PPO(
policy='CnnPolicy',
env=stacked_env,
...
)
print(model.policy.features_extractor)
# NatureCNN(
# (cnn): Sequential(Conv2d(4,32,8,4), ReLU,
# Conv2d(32,64,4,2), ReLU,
# Conv2d(64,64,3,1), ReLU, Flatten)
# (linear): Sequential(Linear(3136,512), ReLU)
# )
Implementing a Preprocessing Wrapper
Gymnasium's ObservationWrapper is the standard way to apply preprocessing. The wrapper intercepts every observation and transforms it before the agent sees it:
import gymnasium as gym
import numpy as np
import cv2
class DoomScreenWrapper(gym.ObservationWrapper):
'''Extract screen buffer, convert to grayscale, resize to (size, size, 1).'''
def __init__(self, env, screen_size=84):
super().__init__(env)
self.screen_size = screen_size
self.observation_space = gym.spaces.Box(
low=0, high=255,
shape=(screen_size, screen_size, 1),
dtype=np.uint8,
)
def observation(self, obs):
# obs is a dict {'screen': ndarray, 'gamevariables': ndarray}
screen = obs['screen'] if isinstance(obs, dict) else obs
gray = cv2.cvtColor(screen, cv2.COLOR_RGB2GRAY)
small = cv2.resize(gray, (self.screen_size, self.screen_size),
interpolation=cv2.INTER_AREA)
return small[:, :, np.newaxis] # (H, W) -> (H, W, 1)
The full pipeline — vectorisation, wrapping, and stacking:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack, VecMonitor
def make_doom(env_id, n_envs=4, n_stack=4, seed=0):
def _make():
return DoomScreenWrapper(gym.make(env_id))
vec = VecMonitor(make_vec_env(_make, n_envs=n_envs, seed=seed))
return VecFrameStack(vec, n_stack=n_stack)
# Observation: (n_envs, 84, 84, 4)
train_env = make_doom('VizdoomBasic-v1', n_envs=8, n_stack=4)
Training PPO with CnnPolicy
With image observations, PPO's hyperparameters shift relative to the MLP setting:
| Hyperparameter | MLP typical | CNN typical | Why |
|---|---|---|---|
n_steps |
2048 | 512 | Shorter rollouts; more frequent encoder updates |
n_epochs |
10 | 4 | Fewer passes per rollout (CNNs overfit more easily) |
learning_rate |
3e-4 | 2.5e-4 | Slightly lower for stability with larger input |
ent_coef |
0.0 | 0.01 | Entropy bonus encourages visual exploration |
n_envs |
4–8 | 8–16 | More envs compensate for lower sample efficiency |
model = PPO(
policy='CnnPolicy',
env=train_env,
n_steps=512,
batch_size=64,
n_epochs=4,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01,
learning_rate=2.5e-4,
seed=0,
)
model.learn(total_timesteps=300_000)
Interpreting Visual RL Learning Curves
Visual RL learning curves have a characteristic shape that differs from state-based RL:
- Flat early phase (often 0–30% of training): the CNN encoder is too noisy to produce useful gradient signal; the policy appears to do nothing useful
- Rapid rise: once the encoder begins to extract consistent features, the policy improves quickly
- Plateau: the encoder has converged; further improvement requires more data or architectural changes
If the curve is flat for more than 50% of total training, consider: longer learning_starts, higher ent_coef, or data augmentation (random crops, colour jitter) to improve encoder learning.