Going Deeper — ResNets and Residual Connections
- Explain the degradation problem — why naively adding layers to a CNN can degrade training accuracy — and why it cannot be explained by overfitting
- Describe the residual block formula F(x) + x and explain how the identity shortcut enables very deep networks to train by preserving gradient flow
- Distinguish the basic residual block from the bottleneck block, compute the parameter count of each, and state why bottleneck blocks are used in deeper ResNets
- Identify where residual connections appear beyond CNNs — including transformer FFN sub-layers and highway networks — and explain what they have in common
The Degradation Problem
A reasonable hypothesis in 2015 was that deeper networks should perform at least as well as shallower ones: a 56-layer network can always learn the identity function for its extra layers, matching the 20-layer baseline.
Experiment showed the opposite. He et al. (2015) demonstrated that a plain 56-layer CNN had higher training error than a 20-layer CNN on CIFAR-10 — not just test error, but training error. This rules out overfitting as the cause. The network was failing to learn even on the data it had seen.
Why? As gradients backpropagate through many layers, they are multiplied by weight matrices repeatedly. With random initialization and no special design, gradients either vanish (product shrinks toward zero) or explode (product grows unbounded). The signal from the loss disappears before reaching early layers.
Batch normalization helped, but did not fully solve the problem. The breakthrough was a structural change.
Residual Connections
He et al. (2016) introduced the residual block:
where is a small sub-network (two conv + BN + ReLU layers), and is routed directly to the output via a skip connection (also called a shortcut or residual connection).
The network learns the residual rather than the full transformation . If the optimal mapping is close to the identity, only needs to learn a small correction — easier than learning the full mapping from scratch.
Why Residuals Help Gradient Flow
By the chain rule, the gradient of a loss with respect to the block input is:
The term is the identity shortcut. Even if becomes very small (vanishing gradients), the gradient of still flows back through the path unobstructed. Residual connections are a gradient highway through the entire network.
Block Variants
Basic Block (ResNet-18, ResNet-34)
Two conv layers with batch normalization:
x → Conv(3×3, C) → BN → ReLU → Conv(3×3, C) → BN → (+x) → ReLU → y
Parameters per block (ignoring BN):
Bottleneck Block (ResNet-50, ResNet-101, ResNet-152)
Three layers: a conv to reduce channels, a conv, and a conv to expand back:
x → Conv(1×1, C/4) → BN → ReLU → Conv(3×3, C/4) → BN → ReLU → Conv(1×1, C) → BN → (+x) → ReLU → y
Parameters:
Vs. basic block at : the bottleneck uses ~17× fewer parameters for the same output dimension. This is why ResNets deeper than 34 layers all use bottleneck blocks.
Projection Shortcut
When the block changes spatial dimensions (stride 2) or the number of channels, the shortcut cannot be added directly. A projection shortcut applies a conv (with matching stride) to before addition:
The ResNet Family
| Model | Layers | Params | ImageNet Top-1 |
|---|---|---|---|
| ResNet-18 | 18 | 11M | 69.8% |
| ResNet-50 | 50 | 25M | 76.1% |
| ResNet-101 | 101 | 44M | 77.4% |
| ResNet-152 | 152 | 60M | 78.3% |
ResNets enabled training of networks 2–10× deeper than previous state-of-the-art and won ILSVRC 2015 by a large margin. The architecture remains a standard backbone for vision today.
Residual Connections Everywhere
The idea generalized far beyond CNNs:
Transformers: Both the self-attention sub-layer and the FFN sub-layer use residual connections: . Without them, transformers with more than a few layers cannot train.
Highway networks (Srivastava 2015): A gated variant: , where is a learned gate. Conceptually similar but with adaptive blending.
DenseNet: Each layer receives as input the concatenation of all previous layers' outputs — an extreme form of residual connectivity that improves gradient and feature reuse.
The universal pattern: Residual connections allow deep networks to be viewed as ensembles of shallow paths. Veit et al. (2016) showed that unrolling a ResNet reveals exponentially many paths of varying depth; most gradient flows through relatively short paths, with depth providing robustness rather than strictly sequential computation.
PyTorch and TensorFlow
PyTorch — basic block, bottleneck block, and projection shortcut:
import torch
import torch.nn as nn
# Basic residual block (ResNet-18 / ResNet-34)
class BasicBlock(nn.Module):
def __init__(self, channels: int):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU()
def forward(self, x):
residual = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
return self.relu(out + residual) # F(x) + x
# Bottleneck block (ResNet-50/101/152): 1x1 → 3x3 → 1x1
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, in_ch: int, mid_ch: int):
super().__init__()
out_ch = mid_ch * self.expansion
self.net = nn.Sequential(
nn.Conv2d(in_ch, mid_ch, 1, bias=False), nn.BatchNorm2d(mid_ch), nn.ReLU(),
nn.Conv2d(mid_ch, mid_ch, 3, padding=1, bias=False), nn.BatchNorm2d(mid_ch), nn.ReLU(),
nn.Conv2d(mid_ch, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch),
)
# Projection shortcut when dimensions change
self.shortcut = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch)
) if in_ch != out_ch else nn.Identity()
self.relu = nn.ReLU()
def forward(self, x):
return self.relu(self.net(x) + self.shortcut(x))
# Use torchvision for pretrained ResNets
import torchvision.models as models
resnet50 = models.resnet50(weights='IMAGENET1K_V2')
resnet50.fc = nn.Linear(2048, 10) # replace head for fine-tuning
TensorFlow / Keras:
import tensorflow as tf
# Functional API makes skip connections explicit
def residual_block(x, filters: int):
residual = x
x = tf.keras.layers.Conv2D(filters, 3, padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Conv2D(filters, 3, padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
return tf.keras.layers.ReLU()(x + residual) # F(x) + x
# Pretrained ResNet50 from Keras applications
base = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base.trainable = False # freeze for transfer learning
out = tf.keras.layers.GlobalAveragePooling2D()(base.output)
out = tf.keras.layers.Dense(10)(out)
model = tf.keras.Model(base.input, out)