Supplement · Neural Network Architectures

Going Deeper — ResNets and Residual Connections

14 min read

By the end of this reading you will be able to:

Explain the degradation problem — why naively adding layers to a CNN can degrade training accuracy — and why it cannot be explained by overfitting
Describe the residual block formula F(x) + x and explain how the identity shortcut enables very deep networks to train by preserving gradient flow
Distinguish the basic residual block from the bottleneck block, compute the parameter count of each, and state why bottleneck blocks are used in deeper ResNets
Identify where residual connections appear beyond CNNs — including transformer FFN sub-layers and highway networks — and explain what they have in common

The Degradation Problem

A reasonable hypothesis in 2015 was that deeper networks should perform at least as well as shallower ones: a 56-layer network can always learn the identity function for its extra layers, matching the 20-layer baseline.

Experiment showed the opposite. He et al. (2015) demonstrated that a plain 56-layer CNN had higher training error than a 20-layer CNN on CIFAR-10 — not just test error, but training error. This rules out overfitting as the cause. The network was failing to learn even on the data it had seen.

Why? As gradients backpropagate through many layers, they are multiplied by weight matrices repeatedly. With random initialization and no special design, gradients either vanish (product shrinks toward zero) or explode (product grows unbounded). The signal from the loss disappears before reaching early layers.

Batch normalization helped, but did not fully solve the problem. The breakthrough was a structural change.

Residual Connections

He et al. (2016) introduced the residual block:

$\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$

where $\mathcal{F}(\mathbf{x})$ is a small sub-network (two conv + BN + ReLU layers), and $\mathbf{x}$ is routed directly to the output via a skip connection (also called a shortcut or residual connection).

The network learns the residual $\mathcal{F}(\mathbf{x})$ rather than the full transformation $\mathbf{H}(\mathbf{x})$ . If the optimal mapping is close to the identity, $\mathcal{F}(\mathbf{x})$ only needs to learn a small correction — easier than learning the full mapping from scratch.

Why Residuals Help Gradient Flow

By the chain rule, the gradient of a loss $\mathcal{L}$ with respect to the block input $\mathbf{x}$ is:

$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left(\frac{\partial \mathcal{F}}{\partial \mathbf{x}} + I\right)$

The $+I$ term is the identity shortcut. Even if $\partial \mathcal{F}/\partial \mathbf{x}$ becomes very small (vanishing gradients), the gradient of $\mathcal{L}$ still flows back through the $+I$ path unobstructed. Residual connections are a gradient highway through the entire network.

Block Variants

Basic Block (ResNet-18, ResNet-34)

Two $3 \times 3$ conv layers with batch normalization:

x → Conv(3×3, C) → BN → ReLU → Conv(3×3, C) → BN → (+x) → ReLU → y

Parameters per block (ignoring BN): $2 \times 3^2 \times C^2 = 18C^2$

Bottleneck Block (ResNet-50, ResNet-101, ResNet-152)

Three layers: a $1 \times 1$ conv to reduce channels, a $3 \times 3$ conv, and a $1 \times 1$ conv to expand back:

x → Conv(1×1, C/4) → BN → ReLU → Conv(3×3, C/4) → BN → ReLU → Conv(1×1, C) → BN → (+x) → ReLU → y

Parameters: $C \cdot (C/4) + 9 \cdot (C/4)^2 + (C/4) \cdot C = C^2/2 + 9C^2/16 \approx 1.06 C^2$

Vs. basic block at $18C^2$ : the bottleneck uses ~17× fewer parameters for the same output dimension. This is why ResNets deeper than 34 layers all use bottleneck blocks.

Projection Shortcut

When the block changes spatial dimensions (stride 2) or the number of channels, the shortcut $\mathbf{x}$ cannot be added directly. A projection shortcut applies a $1 \times 1$ conv (with matching stride) to $\mathbf{x}$ before addition:

$\mathbf{y} = \mathcal{F}(\mathbf{x}) + W_s \mathbf{x}$

The ResNet Family

Model	Layers	Params	ImageNet Top-1
ResNet-18	18	11M	69.8%
ResNet-50	50	25M	76.1%
ResNet-101	101	44M	77.4%
ResNet-152	152	60M	78.3%

ResNets enabled training of networks 2–10× deeper than previous state-of-the-art and won ILSVRC 2015 by a large margin. The architecture remains a standard backbone for vision today.

Residual Connections Everywhere

The idea generalized far beyond CNNs:

Transformers: Both the self-attention sub-layer and the FFN sub-layer use residual connections: $\mathbf{y} = \text{LayerNorm}(\mathbf{x} + \text{Attention}(\mathbf{x}))$ . Without them, transformers with more than a few layers cannot train.

Highway networks (Srivastava 2015): A gated variant: $\mathbf{y} = T(\mathbf{x}) \odot \mathcal{F}(\mathbf{x}) + (1 - T(\mathbf{x})) \odot \mathbf{x}$ , where $T$ is a learned gate. Conceptually similar but with adaptive blending.

DenseNet: Each layer receives as input the concatenation of all previous layers' outputs — an extreme form of residual connectivity that improves gradient and feature reuse.

The universal pattern: Residual connections allow deep networks to be viewed as ensembles of shallow paths. Veit et al. (2016) showed that unrolling a ResNet reveals exponentially many paths of varying depth; most gradient flows through relatively short paths, with depth providing robustness rather than strictly sequential computation.

PyTorch and TensorFlow

PyTorch — basic block, bottleneck block, and projection shortcut:

import torch
import torch.nn as nn

# Basic residual block (ResNet-18 / ResNet-34)
class BasicBlock(nn.Module):
    def __init__(self, channels: int):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(channels)
        self.relu  = nn.ReLU()

    def forward(self, x):
        residual = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return self.relu(out + residual)   # F(x) + x

# Bottleneck block (ResNet-50/101/152): 1x1 → 3x3 → 1x1
class Bottleneck(nn.Module):
    expansion = 4
    def __init__(self, in_ch: int, mid_ch: int):
        super().__init__()
        out_ch = mid_ch * self.expansion
        self.net = nn.Sequential(
            nn.Conv2d(in_ch,  mid_ch, 1, bias=False), nn.BatchNorm2d(mid_ch), nn.ReLU(),
            nn.Conv2d(mid_ch, mid_ch, 3, padding=1, bias=False), nn.BatchNorm2d(mid_ch), nn.ReLU(),
            nn.Conv2d(mid_ch, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch),
        )
        # Projection shortcut when dimensions change
        self.shortcut = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch)
        ) if in_ch != out_ch else nn.Identity()
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.net(x) + self.shortcut(x))

# Use torchvision for pretrained ResNets
import torchvision.models as models
resnet50 = models.resnet50(weights='IMAGENET1K_V2')
resnet50.fc = nn.Linear(2048, 10)   # replace head for fine-tuning

TensorFlow / Keras:

import tensorflow as tf

# Functional API makes skip connections explicit
def residual_block(x, filters: int):
    residual = x
    x = tf.keras.layers.Conv2D(filters, 3, padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
    x = tf.keras.layers.Conv2D(filters, 3, padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    return tf.keras.layers.ReLU()(x + residual)   # F(x) + x

# Pretrained ResNet50 from Keras applications
base = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base.trainable = False   # freeze for transfer learning
out  = tf.keras.layers.GlobalAveragePooling2D()(base.output)
out  = tf.keras.layers.Dense(10)(out)
model = tf.keras.Model(base.input, out)

References

He et al. 2016 — Deep Residual Learning for Image Recognition

Previous Take Quiz →

Going Deeper — ResNets and Residual Connections

The Degradation Problem

Residual Connections

Why Residuals Help Gradient Flow

Block Variants

Basic Block (ResNet-18, ResNet-34)

Bottleneck Block (ResNet-50, ResNet-101, ResNet-152)

Projection Shortcut

The ResNet Family

Residual Connections Everywhere

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact