Convolutional Layers and the Vision Inductive Bias
- Compute the output size of a convolutional layer given input size, kernel size, stride, and padding, and determine the receptive field of a neuron at a given depth
- Explain how weight sharing makes convolutions parameter-efficient and why translation equivariance is a useful inductive bias for image tasks
- Distinguish max pooling from average pooling and global average pooling, and identify what each discards and what it preserves
- Explain what depthwise separable convolutions are, compute the parameter and multiply-accumulate reduction factor vs. standard convolutions, and name one architecture that uses them
Why Not Just Use MLPs for Images?
An MLP applied to an image treats each pixel as an independent input feature. A RGB image has 150,528 inputs — the first hidden layer of a width-1024 MLP would have 154 million parameters. More importantly, the MLP has no notion of locality: it treats a pixel and its neighbor identically to a pixel at the opposite corner.
The visual world has structure that we should build into the architecture:
- Locality: meaningful patterns (edges, textures, shapes) are local — they span small regions of the image
- Translation invariance: a cat in the upper left looks the same as a cat in the lower right
- Compositionality: higher-level features (eyes, ears) are built from lower-level ones (edges, curves)
Convolutional layers encode these as inductive biases — assumptions about the problem structure built into the architecture itself.
The Convolution Operation
A convolutional layer slides a small kernel (filter) over the input and computes a dot product at each position:
where is the input feature map and is a kernel.
(Technical note: deep learning frameworks compute cross-correlation, not true convolution — the kernel is not flipped. The distinction is irrelevant for learned kernels.)
Key Parameters
| Parameter | Effect |
|---|---|
| Kernel size | Receptive field per layer; is standard |
| Stride | Step size of the sliding window; stride 2 halves spatial dimensions |
| Padding | Zeros added to the border; same padding preserves spatial size |
| Channels in | Number of input feature maps |
| Channels out | Number of output feature maps = number of kernels |
Output size formula:
Example: Input , kernel , stride 1, padding 1 → output (same padding).
Input , kernel , stride 2, padding 1 → output .
Weight Sharing and Parameter Efficiency
The kernel weights are shared across all spatial positions. A single kernel with input channels has parameters, regardless of the input spatial size.
Compare: a fully connected layer over to produce outputs would need parameters. A convolutional layer doing the same spatial compression needs only parameters.
Weight sharing encodes the prior that the same feature detector (edge, color gradient) is useful everywhere in the image. This is translation equivariance: if the input shifts, the output feature map shifts by the same amount.
Receptive Field
The receptive field of a neuron is the region of the input that influences its value. For a single conv layer, each output neuron sees a input region. After such layers (stride 1), the receptive field grows to .
Stacking layers is equivalent to applying a single large kernel — two layers give a receptive field with fewer parameters ( vs. ) and an extra non-linearity.
Pooling
Pooling layers reduce spatial dimensions by summarizing a local region:
Max pooling: takes the maximum value in each window. Preserves the strongest activation; invariant to small translations.
Average pooling: takes the mean. Smoother; more sensitive to the overall response.
Global average pooling (GAP): averages each entire feature map to a single scalar. Collapses to . Used before the final classification head in ResNets and modern CNNs — replaces large fully connected layers, dramatically reducing parameter count.
Depthwise Separable Convolutions
Standard convolution mixes spatial filtering and channel mixing in one operation. Depthwise separable convolution splits them:
- Depthwise conv: Apply a separate kernel to each input channel independently — spatial filtering only, no channel mixing. Parameters:
- Pointwise conv: Apply a convolution to mix channels. Parameters:
Parameter reduction factor vs. standard conv:
For : roughly fewer parameters. For , : reduction.
Used in: MobileNet, Xception, EfficientNet — architectures designed for mobile and edge deployment.
PyTorch and TensorFlow
PyTorch — nn.Conv2d, output size, pooling, depthwise separable:
import torch
import torch.nn as nn
# Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
x = torch.randn(8, 3, 32, 32) # (B, C_in, H, W)
out = conv(x) # (8, 64, 32, 32) — padding=1 preserves spatial size
# Output size formula: H_out = floor((H_in + 2P - K) / S) + 1
# K=3, P=1, S=1 → floor((32 + 2 - 3) / 1) + 1 = 32 ✓
# Stride=2 for downsampling (replaces max pooling in modern nets)
conv_down = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
out_down = conv_down(out) # (8, 128, 16, 16)
# Max pooling and average pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1)) # global average pool → (B, C, 1, 1)
# Depthwise separable convolution (MobileNet style)
depthwise = nn.Conv2d(64, 64, kernel_size=3, padding=1, groups=64) # groups=C_in
pointwise = nn.Conv2d(64, 128, kernel_size=1)
class DepthwiseSeparable(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.dw = nn.Conv2d(in_ch, in_ch, 3, padding=1, groups=in_ch)
self.pw = nn.Conv2d(in_ch, out_ch, 1)
def forward(self, x):
return self.pw(self.dw(x))
# Typical CNN block: Conv → BN → ReLU
block = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
)
TensorFlow / Keras:
import tensorflow as tf
# Conv2D uses channels-last by default: (B, H, W, C)
conv = tf.keras.layers.Conv2D(filters=64, kernel_size=3, strides=1,
padding='same', activation='relu')
x = tf.random.normal((8, 32, 32, 3))
out = conv(x) # (8, 32, 32, 64) — 'same' padding preserves H, W
# 'valid' padding: no padding, output shrinks
conv_valid = tf.keras.layers.Conv2D(64, 3, padding='valid')
# H_out = 32 - 3 + 1 = 30
# Pooling
max_pool = tf.keras.layers.MaxPooling2D(pool_size=2, strides=2)
global_avg = tf.keras.layers.GlobalAveragePooling2D() # (B, C)
# Depthwise separable
dws = tf.keras.layers.SeparableConv2D(filters=128, kernel_size=3, padding='same')