Supplement · Neural Network Architectures

Going Deeper — ResNets and Residual Connections

14 min read
By the end of this reading you will be able to:
  • Explain the degradation problem — why naively adding layers to a CNN can degrade training accuracy — and why it cannot be explained by overfitting
  • Describe the residual block formula F(x) + x and explain how the identity shortcut enables very deep networks to train by preserving gradient flow
  • Distinguish the basic residual block from the bottleneck block, compute the parameter count of each, and state why bottleneck blocks are used in deeper ResNets
  • Identify where residual connections appear beyond CNNs — including transformer FFN sub-layers and highway networks — and explain what they have in common

The Degradation Problem

A reasonable hypothesis in 2015 was that deeper networks should perform at least as well as shallower ones: a 56-layer network can always learn the identity function for its extra layers, matching the 20-layer baseline.

Experiment showed the opposite. He et al. (2015) demonstrated that a plain 56-layer CNN had higher training error than a 20-layer CNN on CIFAR-10 — not just test error, but training error. This rules out overfitting as the cause. The network was failing to learn even on the data it had seen.

Why? As gradients backpropagate through many layers, they are multiplied by weight matrices repeatedly. With random initialization and no special design, gradients either vanish (product shrinks toward zero) or explode (product grows unbounded). The signal from the loss disappears before reaching early layers.

Batch normalization helped, but did not fully solve the problem. The breakthrough was a structural change.


Residual Connections

He et al. (2016) introduced the residual block:

y=F(x)+x\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

where F(x)\mathcal{F}(\mathbf{x}) is a small sub-network (two conv + BN + ReLU layers), and x\mathbf{x} is routed directly to the output via a skip connection (also called a shortcut or residual connection).

The network learns the residual F(x)\mathcal{F}(\mathbf{x}) rather than the full transformation H(x)\mathbf{H}(\mathbf{x}). If the optimal mapping is close to the identity, F(x)\mathcal{F}(\mathbf{x}) only needs to learn a small correction — easier than learning the full mapping from scratch.

Why Residuals Help Gradient Flow

By the chain rule, the gradient of a loss L\mathcal{L} with respect to the block input x\mathbf{x} is:

Lx=Ly(Fx+I)\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left(\frac{\partial \mathcal{F}}{\partial \mathbf{x}} + I\right)

The +I+I term is the identity shortcut. Even if F/x\partial \mathcal{F}/\partial \mathbf{x} becomes very small (vanishing gradients), the gradient of L\mathcal{L} still flows back through the +I+I path unobstructed. Residual connections are a gradient highway through the entire network.


Block Variants

Basic Block (ResNet-18, ResNet-34)

Two 3×33 \times 3 conv layers with batch normalization:

x → Conv(3×3, C) → BN → ReLU → Conv(3×3, C) → BN → (+x) → ReLU → y

Parameters per block (ignoring BN): 2×32×C2=18C22 \times 3^2 \times C^2 = 18C^2

Bottleneck Block (ResNet-50, ResNet-101, ResNet-152)

Three layers: a 1×11 \times 1 conv to reduce channels, a 3×33 \times 3 conv, and a 1×11 \times 1 conv to expand back:

x → Conv(1×1, C/4) → BN → ReLU → Conv(3×3, C/4) → BN → ReLU → Conv(1×1, C) → BN → (+x) → ReLU → y

Parameters: C(C/4)+9(C/4)2+(C/4)C=C2/2+9C2/161.06C2C \cdot (C/4) + 9 \cdot (C/4)^2 + (C/4) \cdot C = C^2/2 + 9C^2/16 \approx 1.06 C^2

Vs. basic block at 18C218C^2: the bottleneck uses ~17× fewer parameters for the same output dimension. This is why ResNets deeper than 34 layers all use bottleneck blocks.

Projection Shortcut

When the block changes spatial dimensions (stride 2) or the number of channels, the shortcut x\mathbf{x} cannot be added directly. A projection shortcut applies a 1×11 \times 1 conv (with matching stride) to x\mathbf{x} before addition:

y=F(x)+Wsx\mathbf{y} = \mathcal{F}(\mathbf{x}) + W_s \mathbf{x}


The ResNet Family

Model Layers Params ImageNet Top-1
ResNet-18 18 11M 69.8%
ResNet-50 50 25M 76.1%
ResNet-101 101 44M 77.4%
ResNet-152 152 60M 78.3%

ResNets enabled training of networks 2–10× deeper than previous state-of-the-art and won ILSVRC 2015 by a large margin. The architecture remains a standard backbone for vision today.


Residual Connections Everywhere

The idea generalized far beyond CNNs:

Transformers: Both the self-attention sub-layer and the FFN sub-layer use residual connections: y=LayerNorm(x+Attention(x))\mathbf{y} = \text{LayerNorm}(\mathbf{x} + \text{Attention}(\mathbf{x})). Without them, transformers with more than a few layers cannot train.

Highway networks (Srivastava 2015): A gated variant: y=T(x)F(x)+(1T(x))x\mathbf{y} = T(\mathbf{x}) \odot \mathcal{F}(\mathbf{x}) + (1 - T(\mathbf{x})) \odot \mathbf{x}, where TT is a learned gate. Conceptually similar but with adaptive blending.

DenseNet: Each layer receives as input the concatenation of all previous layers' outputs — an extreme form of residual connectivity that improves gradient and feature reuse.

The universal pattern: Residual connections allow deep networks to be viewed as ensembles of shallow paths. Veit et al. (2016) showed that unrolling a ResNet reveals exponentially many paths of varying depth; most gradient flows through relatively short paths, with depth providing robustness rather than strictly sequential computation.


PyTorch and TensorFlow

PyTorch — basic block, bottleneck block, and projection shortcut:

import torch
import torch.nn as nn

# Basic residual block (ResNet-18 / ResNet-34)
class BasicBlock(nn.Module):
    def __init__(self, channels: int):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(channels)
        self.relu  = nn.ReLU()

    def forward(self, x):
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return self.relu(out + residual)   # F(x) + x

# Bottleneck block (ResNet-50/101/152): 1x1 → 3x3 → 1x1
class Bottleneck(nn.Module):
    expansion = 4
    def __init__(self, in_ch: int, mid_ch: int):
        super().__init__()
        out_ch = mid_ch * self.expansion
        self.net = nn.Sequential(
            nn.Conv2d(in_ch,  mid_ch, 1, bias=False), nn.BatchNorm2d(mid_ch), nn.ReLU(),
            nn.Conv2d(mid_ch, mid_ch, 3, padding=1, bias=False), nn.BatchNorm2d(mid_ch), nn.ReLU(),
            nn.Conv2d(mid_ch, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch),
        )
        # Projection shortcut when dimensions change
        self.shortcut = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch)
        ) if in_ch != out_ch else nn.Identity()
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.net(x) + self.shortcut(x))

# Use torchvision for pretrained ResNets
import torchvision.models as models
resnet50 = models.resnet50(weights='IMAGENET1K_V2')
resnet50.fc = nn.Linear(2048, 10)   # replace head for fine-tuning

TensorFlow / Keras:

import tensorflow as tf

# Functional API makes skip connections explicit
def residual_block(x, filters: int):
    residual = x
    x = tf.keras.layers.Conv2D(filters, 3, padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
    x = tf.keras.layers.Conv2D(filters, 3, padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    return tf.keras.layers.ReLU()(x + residual)   # F(x) + x

# Pretrained ResNet50 from Keras applications
base = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base.trainable = False   # freeze for transfer learning
out  = tf.keras.layers.GlobalAveragePooling2D()(base.output)
out  = tf.keras.layers.Dense(10)(out)
model = tf.keras.Model(base.input, out)