Supplement · Neural Network Architectures

Convolutional Layers and the Vision Inductive Bias

16 min read

By the end of this reading you will be able to:

Compute the output size of a convolutional layer given input size, kernel size, stride, and padding, and determine the receptive field of a neuron at a given depth
Explain how weight sharing makes convolutions parameter-efficient and why translation equivariance is a useful inductive bias for image tasks
Distinguish max pooling from average pooling and global average pooling, and identify what each discards and what it preserves
Explain what depthwise separable convolutions are, compute the parameter and multiply-accumulate reduction factor vs. standard convolutions, and name one architecture that uses them

Why Not Just Use MLPs for Images?

An MLP applied to an image treats each pixel as an independent input feature. A $224 \times 224 \times 3$ RGB image has 150,528 inputs — the first hidden layer of a width-1024 MLP would have 154 million parameters. More importantly, the MLP has no notion of locality: it treats a pixel and its neighbor identically to a pixel at the opposite corner.

The visual world has structure that we should build into the architecture:

Locality: meaningful patterns (edges, textures, shapes) are local — they span small regions of the image
Translation invariance: a cat in the upper left looks the same as a cat in the lower right
Compositionality: higher-level features (eyes, ears) are built from lower-level ones (edges, curves)

Convolutional layers encode these as inductive biases — assumptions about the problem structure built into the architecture itself.

The Convolution Operation

A convolutional layer slides a small kernel (filter) over the input and computes a dot product at each position:

$[\mathbf{F} * \mathbf{k}]_{i,j} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} \mathbf{F}_{i+m,\, j+n} \cdot \mathbf{k}_{m,n}$

where $\mathbf{F}$ is the input feature map and $\mathbf{k}$ is a $K \times K$ kernel.

(Technical note: deep learning frameworks compute cross-correlation, not true convolution — the kernel is not flipped. The distinction is irrelevant for learned kernels.)

Key Parameters

Parameter	Effect
Kernel size $K$	Receptive field per layer; $3 \times 3$ is standard
Stride $S$	Step size of the sliding window; stride 2 halves spatial dimensions
Padding $P$	Zeros added to the border; `same` padding preserves spatial size
Channels in $C_{\text{in}}$	Number of input feature maps
Channels out $C_{\text{out}}$	Number of output feature maps = number of kernels

Output size formula:

$H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2P - K}{S} \right\rfloor + 1$

Example: Input $224 \times 224$ , kernel $3 \times 3$ , stride 1, padding 1 → output $224 \times 224$ (same padding).

Input $224 \times 224$ , kernel $3 \times 3$ , stride 2, padding 1 → output $112 \times 112$ .

The kernel weights are shared across all spatial positions. A single $3 \times 3$ kernel with $C_{\text{in}}$ input channels has $9 \times C_{\text{in}}$ parameters, regardless of the input spatial size.

Compare: a fully connected layer over $224 \times 224 \times 3$ to produce $112 \times 112 \times 64$ outputs would need $\sim 10^{10}$ parameters. A convolutional layer doing the same spatial compression needs only $3^2 \times 3 \times 64 = 1{,}728$ parameters.

Weight sharing encodes the prior that the same feature detector (edge, color gradient) is useful everywhere in the image. This is translation equivariance: if the input shifts, the output feature map shifts by the same amount.

Receptive Field

The receptive field of a neuron is the region of the input that influences its value. For a single $3 \times 3$ conv layer, each output neuron sees a $3 \times 3$ input region. After $L$ such layers (stride 1), the receptive field grows to $(2L+1) \times (2L+1)$ .

Stacking $3 \times 3$ layers is equivalent to applying a single large kernel — two $3 \times 3$ layers give a $5 \times 5$ receptive field with fewer parameters ( $2 \times 9 = 18$ vs. $25$ ) and an extra non-linearity.

Pooling

Pooling layers reduce spatial dimensions by summarizing a local region:

Max pooling: takes the maximum value in each $K \times K$ window. Preserves the strongest activation; invariant to small translations.

Average pooling: takes the mean. Smoother; more sensitive to the overall response.

Global average pooling (GAP): averages each entire feature map to a single scalar. Collapses $H \times W \times C$ to $C$ . Used before the final classification head in ResNets and modern CNNs — replaces large fully connected layers, dramatically reducing parameter count.

Depthwise Separable Convolutions

Standard convolution mixes spatial filtering and channel mixing in one operation. Depthwise separable convolution splits them:

Depthwise conv: Apply a separate $K \times K$ kernel to each input channel independently — spatial filtering only, no channel mixing. Parameters: $K^2 \times C_{\text{in}}$
Pointwise conv: Apply a $1 \times 1$ convolution to mix channels. Parameters: $C_{\text{in}} \times C_{\text{out}}$

Parameter reduction factor vs. standard conv:

$\frac{K^2 C_{\text{in}} + C_{\text{in}} C_{\text{out}}}{K^2 C_{\text{in}} C_{\text{out}}} = \frac{1}{C_{\text{out}}} + \frac{1}{K^2} \approx \frac{1}{K^2}$

For $K=3$ : roughly $8\times$ fewer parameters. For $K=3$ , $C_{\text{out}} = 256$ : $8.9\times$ reduction.

Used in: MobileNet, Xception, EfficientNet — architectures designed for mobile and edge deployment.

PyTorch and TensorFlow

PyTorch — nn.Conv2d, output size, pooling, depthwise separable:

import torch
import torch.nn as nn

# Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
x    = torch.randn(8, 3, 32, 32)   # (B, C_in, H, W)
out  = conv(x)                     # (8, 64, 32, 32) — padding=1 preserves spatial size

# Output size formula: H_out = floor((H_in + 2P - K) / S) + 1
# K=3, P=1, S=1 → floor((32 + 2 - 3) / 1) + 1 = 32  ✓

# Stride=2 for downsampling (replaces max pooling in modern nets)
conv_down = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
out_down  = conv_down(out)         # (8, 128, 16, 16)

# Max pooling and average pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))   # global average pool → (B, C, 1, 1)

# Depthwise separable convolution (MobileNet style)
depthwise   = nn.Conv2d(64, 64, kernel_size=3, padding=1, groups=64)  # groups=C_in
pointwise   = nn.Conv2d(64, 128, kernel_size=1)

class DepthwiseSeparable(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.dw = nn.Conv2d(in_ch, in_ch,  3, padding=1, groups=in_ch)
        self.pw = nn.Conv2d(in_ch, out_ch, 1)
    def forward(self, x):
        return self.pw(self.dw(x))

# Typical CNN block: Conv → BN → ReLU
block = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

TensorFlow / Keras:

import tensorflow as tf

# Conv2D uses channels-last by default: (B, H, W, C)
conv = tf.keras.layers.Conv2D(filters=64, kernel_size=3, strides=1,
                               padding='same', activation='relu')
x    = tf.random.normal((8, 32, 32, 3))
out  = conv(x)   # (8, 32, 32, 64) — 'same' padding preserves H, W

# 'valid' padding: no padding, output shrinks
conv_valid = tf.keras.layers.Conv2D(64, 3, padding='valid')
# H_out = 32 - 3 + 1 = 30

# Pooling
max_pool   = tf.keras.layers.MaxPooling2D(pool_size=2, strides=2)
global_avg = tf.keras.layers.GlobalAveragePooling2D()   # (B, C)

# Depthwise separable
dws = tf.keras.layers.SeparableConv2D(filters=128, kernel_size=3, padding='same')

Previous Next →

Convolutional Layers and the Vision Inductive Bias

Why Not Just Use MLPs for Images?

The Convolution Operation

Key Parameters

Weight Sharing and Parameter Efficiency

Receptive Field

Pooling

Depthwise Separable Convolutions

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact