Supplement · Neural Network Architectures

Convolutional Layers and the Vision Inductive Bias

16 min read
By the end of this reading you will be able to:
  • Compute the output size of a convolutional layer given input size, kernel size, stride, and padding, and determine the receptive field of a neuron at a given depth
  • Explain how weight sharing makes convolutions parameter-efficient and why translation equivariance is a useful inductive bias for image tasks
  • Distinguish max pooling from average pooling and global average pooling, and identify what each discards and what it preserves
  • Explain what depthwise separable convolutions are, compute the parameter and multiply-accumulate reduction factor vs. standard convolutions, and name one architecture that uses them

Why Not Just Use MLPs for Images?

An MLP applied to an image treats each pixel as an independent input feature. A 224×224×3224 \times 224 \times 3 RGB image has 150,528 inputs — the first hidden layer of a width-1024 MLP would have 154 million parameters. More importantly, the MLP has no notion of locality: it treats a pixel and its neighbor identically to a pixel at the opposite corner.

The visual world has structure that we should build into the architecture:

  1. Locality: meaningful patterns (edges, textures, shapes) are local — they span small regions of the image
  2. Translation invariance: a cat in the upper left looks the same as a cat in the lower right
  3. Compositionality: higher-level features (eyes, ears) are built from lower-level ones (edges, curves)

Convolutional layers encode these as inductive biases — assumptions about the problem structure built into the architecture itself.


The Convolution Operation

A convolutional layer slides a small kernel (filter) over the input and computes a dot product at each position:

[Fk]i,j=m=0K1n=0K1Fi+m,j+nkm,n[\mathbf{F} * \mathbf{k}]_{i,j} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} \mathbf{F}_{i+m,\, j+n} \cdot \mathbf{k}_{m,n}

where F\mathbf{F} is the input feature map and k\mathbf{k} is a K×KK \times K kernel.

(Technical note: deep learning frameworks compute cross-correlation, not true convolution — the kernel is not flipped. The distinction is irrelevant for learned kernels.)

Key Parameters

Parameter Effect
Kernel size KK Receptive field per layer; 3×33 \times 3 is standard
Stride SS Step size of the sliding window; stride 2 halves spatial dimensions
Padding PP Zeros added to the border; same padding preserves spatial size
Channels in CinC_{\text{in}} Number of input feature maps
Channels out CoutC_{\text{out}} Number of output feature maps = number of kernels

Output size formula:

Hout=Hin+2PKS+1H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2P - K}{S} \right\rfloor + 1

Example: Input 224×224224 \times 224, kernel 3×33 \times 3, stride 1, padding 1 → output 224×224224 \times 224 (same padding).

Input 224×224224 \times 224, kernel 3×33 \times 3, stride 2, padding 1 → output 112×112112 \times 112.


Weight Sharing and Parameter Efficiency

The kernel weights are shared across all spatial positions. A single 3×33 \times 3 kernel with CinC_{\text{in}} input channels has 9×Cin9 \times C_{\text{in}} parameters, regardless of the input spatial size.

Compare: a fully connected layer over 224×224×3224 \times 224 \times 3 to produce 112×112×64112 \times 112 \times 64 outputs would need 1010\sim 10^{10} parameters. A convolutional layer doing the same spatial compression needs only 32×3×64=1,7283^2 \times 3 \times 64 = 1{,}728 parameters.

Weight sharing encodes the prior that the same feature detector (edge, color gradient) is useful everywhere in the image. This is translation equivariance: if the input shifts, the output feature map shifts by the same amount.


Receptive Field

The receptive field of a neuron is the region of the input that influences its value. For a single 3×33 \times 3 conv layer, each output neuron sees a 3×33 \times 3 input region. After LL such layers (stride 1), the receptive field grows to (2L+1)×(2L+1)(2L+1) \times (2L+1).

Stacking 3×33 \times 3 layers is equivalent to applying a single large kernel — two 3×33 \times 3 layers give a 5×55 \times 5 receptive field with fewer parameters (2×9=182 \times 9 = 18 vs. 2525) and an extra non-linearity.


Pooling

Pooling layers reduce spatial dimensions by summarizing a local region:

Max pooling: takes the maximum value in each K×KK \times K window. Preserves the strongest activation; invariant to small translations.

Average pooling: takes the mean. Smoother; more sensitive to the overall response.

Global average pooling (GAP): averages each entire feature map to a single scalar. Collapses H×W×CH \times W \times C to CC. Used before the final classification head in ResNets and modern CNNs — replaces large fully connected layers, dramatically reducing parameter count.


Depthwise Separable Convolutions

Standard convolution mixes spatial filtering and channel mixing in one operation. Depthwise separable convolution splits them:

  1. Depthwise conv: Apply a separate K×KK \times K kernel to each input channel independently — spatial filtering only, no channel mixing. Parameters: K2×CinK^2 \times C_{\text{in}}
  2. Pointwise conv: Apply a 1×11 \times 1 convolution to mix channels. Parameters: Cin×CoutC_{\text{in}} \times C_{\text{out}}

Parameter reduction factor vs. standard conv:

K2Cin+CinCoutK2CinCout=1Cout+1K21K2\frac{K^2 C_{\text{in}} + C_{\text{in}} C_{\text{out}}}{K^2 C_{\text{in}} C_{\text{out}}} = \frac{1}{C_{\text{out}}} + \frac{1}{K^2} \approx \frac{1}{K^2}

For K=3K=3: roughly 8×8\times fewer parameters. For K=3K=3, Cout=256C_{\text{out}} = 256: 8.9×8.9\times reduction.

Used in: MobileNet, Xception, EfficientNet — architectures designed for mobile and edge deployment.


PyTorch and TensorFlow

PyTorchnn.Conv2d, output size, pooling, depthwise separable:

import torch
import torch.nn as nn

# Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
x    = torch.randn(8, 3, 32, 32)   # (B, C_in, H, W)
out  = conv(x)                     # (8, 64, 32, 32) — padding=1 preserves spatial size

# Output size formula: H_out = floor((H_in + 2P - K) / S) + 1
# K=3, P=1, S=1 → floor((32 + 2 - 3) / 1) + 1 = 32  ✓

# Stride=2 for downsampling (replaces max pooling in modern nets)
conv_down = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
out_down  = conv_down(out)         # (8, 128, 16, 16)

# Max pooling and average pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))   # global average pool → (B, C, 1, 1)

# Depthwise separable convolution (MobileNet style)
depthwise   = nn.Conv2d(64, 64, kernel_size=3, padding=1, groups=64)  # groups=C_in
pointwise   = nn.Conv2d(64, 128, kernel_size=1)

class DepthwiseSeparable(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.dw = nn.Conv2d(in_ch, in_ch,  3, padding=1, groups=in_ch)
        self.pw = nn.Conv2d(in_ch, out_ch, 1)
    def forward(self, x):
        return self.pw(self.dw(x))

# Typical CNN block: Conv → BN → ReLU
block = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

TensorFlow / Keras:

import tensorflow as tf

# Conv2D uses channels-last by default: (B, H, W, C)
conv = tf.keras.layers.Conv2D(filters=64, kernel_size=3, strides=1,
                               padding='same', activation='relu')
x    = tf.random.normal((8, 32, 32, 3))
out  = conv(x)   # (8, 32, 32, 64) — 'same' padding preserves H, W

# 'valid' padding: no padding, output shrinks
conv_valid = tf.keras.layers.Conv2D(64, 3, padding='valid')
# H_out = 32 - 3 + 1 = 30

# Pooling
max_pool   = tf.keras.layers.MaxPooling2D(pool_size=2, strides=2)
global_avg = tf.keras.layers.GlobalAveragePooling2D()   # (B, C)

# Depthwise separable
dws = tf.keras.layers.SeparableConv2D(filters=128, kernel_size=3, padding='same')