Supplement · Regularization

Dropout and Stochastic Regularization

15 min read
By the end of this reading you will be able to:
  • Explain dropout's ensemble interpretation: why randomly masking units during training is equivalent to averaging over an exponential number of subnetworks at inference
  • Explain inverted dropout — why activations are scaled by 1/(1−p) during training — and state why this allows the same network to be used at inference without any modification
  • Distinguish standard dropout from DropConnect, MC Dropout, and Stochastic Depth (DropPath), and explain the use case for each variant
  • Identify where dropout is and is not typically applied in modern architectures — including where batch normalization has replaced it and where it remains standard

The Dropout Idea

Dropout (Srivastava et al., 2014) is one of the most widely used regularizers in deep learning. During training, each neuron is independently set to zero with probability pp (the dropout rate, also called drop probability):

hdrop=mh,miBernoulli(1p)\mathbf{h}_{\text{drop}} = \mathbf{m} \odot \mathbf{h}, \quad m_i \sim \text{Bernoulli}(1-p)

A new mask m\mathbf{m} is sampled independently for each training example and each forward pass.

At inference: the mask is removed. The full network is used without any randomness.


Inverted Dropout: Matching Train and Test Scales

If we drop neurons with probability pp during training, the expected value of each activation is scaled down by (1p)(1-p). At inference without dropout, activations are (1p)1(1-p)^{-1} times larger than the network was trained to expect — a mismatch.

Inverted dropout corrects during training: surviving activations are scaled up by 1/(1p)1/(1-p):

hdrop=mh1p,miBernoulli(1p)\mathbf{h}_{\text{drop}} = \frac{\mathbf{m} \odot \mathbf{h}}{1-p}, \quad m_i \sim \text{Bernoulli}(1-p)

Now E[hdrop]=h\mathbb{E}[\mathbf{h}_{\text{drop}}] = \mathbf{h} — the expected activation at train time equals the full activation at test time. The same network runs identically at inference with no scaling correction needed.

This is the standard implementation in PyTorch (nn.Dropout) and TensorFlow.


Why Dropout Works

Ensemble Interpretation

A network with nn neurons has 2n2^n possible subnetworks (each neuron either present or absent). Dropout trains all 2n2^n subnetworks simultaneously with shared weights. At inference, using the full network approximates averaging the predictions of all these subnetworks — a geometric mean ensemble.

Ensembles consistently outperform single models; dropout achieves this without the cost of training multiple models.

Preventing Co-Adaptation

Without dropout, neurons can co-adapt: a neuron learns to depend on specific other neurons to fix its mistakes. With dropout, no neuron can rely on any other being present — each must learn useful features independently. This forces redundant representations: multiple neurons learn to detect similar patterns, which generalizes better.

Noise as Regularization

Dropout is equivalent to adding multiplicative noise to activations. Adding noise to training inputs is known to improve generalization; dropout generalizes this to all hidden layers.


Dropout Placement

Dense layers: after the activation, before the next layer. Typical drop rate: 0.3–0.5 for fully connected layers.

CNNs (spatial dropout): standard neuron dropout on feature maps drops individual values. SpatialDropout2D drops entire feature channels — better for CNNs because adjacent spatial positions are highly correlated, so dropping individual values removes little information.

RNNs: dropout on inputs and outputs, but not on recurrent connections (dropping recurrent connections disrupts the temporal gradient flow). Variational dropout (same mask across timesteps) is often preferred.

Transformers: dropout is applied after attention and after the FFN sub-layer. Typical rate: 0.1. In practice, many modern LLMs remove dropout entirely — at scale, data diversity provides sufficient regularization.

Do not use: immediately before softmax; in the output layer.


Variants

DropConnect

Instead of dropping activations, randomly zero-out individual weights during training:

h=(MW)x,MijBernoulli(1p)\mathbf{h} = (\mathbf{M} \odot W)\mathbf{x}, \quad M_{ij} \sim \text{Bernoulli}(1-p)

Even more parameters are masked per forward pass — each forward pass effectively uses a different sparse weight matrix. Marginally stronger regularizer than dropout but more expensive to implement.

MC Dropout (Uncertainty Estimation)

At inference time, keep dropout active and run multiple stochastic forward passes:

p(yx)1Tt=1Tp(yx,θ^t)p(y \mid x) \approx \frac{1}{T} \sum_{t=1}^T p(y \mid x, \hat{\theta}_t)

where θ^t\hat{\theta}_t is a sample from the weight posterior approximated by dropout. This turns any network with dropout into a Bayesian approximation — the variance across forward passes estimates the model's uncertainty. Gal & Ghahramani (2016) showed this is equivalent to variational inference in a deep Gaussian process.

Useful when the model needs to say "I don't know" (medical diagnosis, autonomous driving).

Stochastic Depth (DropPath)

Instead of dropping individual neurons, randomly drop entire residual blocks during training:

y=F(x)b+x,bBernoulli(1p)\mathbf{y} = \mathcal{F}(\mathbf{x}) \cdot b + \mathbf{x}, \quad b \sim \text{Bernoulli}(1 - p_\ell)

where pp_\ell increases linearly with depth (the earliest layers are rarely dropped; the deepest are dropped most often).

At inference: scale residual by (1p)(1-p_\ell). The effective depth during training is randomly shorter, acting as an ensemble over depths.

Used in: EfficientNet, DeiT (Vision Transformers), Swin Transformer. Particularly effective for very deep networks or transformer stacks with many layers.


Hyperparameter: Choosing Drop Rate pp

Setting Typical pp
Fully connected layers (MLP head) 0.3–0.5
Transformer (attention + FFN) 0.1
CNN spatial layers 0.1–0.2
Stochastic depth (DropPath) 0.0–0.2 (depth-dependent)

Larger models generally benefit from higher dropout rates. If validation loss is high and training loss is low, increase pp. If both are high, decrease pp (the model is already struggling to learn).


PyTorch and TensorFlow

PyTorch — dropout, MC Dropout, and DropPath:

import torch
import torch.nn as nn

# Standard dropout (inverted: divides by 1-p during training, identity at eval)
drop = nn.Dropout(p=0.5)
x    = torch.randn(8, 256)

drop.train();  out_train = drop(x)   # ~half zeros, survivors scaled by 1/(1-0.5)=2
drop.eval();   out_eval  = drop(x)   # identity — no scaling needed

# Dropout2d: zeros entire channels in (B, C, H, W) feature maps
drop2d = nn.Dropout2d(p=0.2)
x2d    = torch.randn(4, 64, 8, 8)
out    = drop2d(x2d)                 # whole channels set to zero

# MC Dropout: keep model in train() at inference, run N forward passes
class MCDropoutMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(64, 128), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(128, 10),
        )
    def forward(self, x): return self.net(x)

model = MCDropoutMLP()
model.train()                              # keep dropout active at inference
N = 30
with torch.no_grad():
    preds = torch.stack([model(x) for _ in range(N)])  # (30, B, 10)
mean = preds.mean(0)                       # predictive mean
var  = preds.var(0)                        # epistemic uncertainty

# Stochastic Depth (DropPath) — typical timm implementation
class DropPath(nn.Module):
    """Drop entire residual branches with probability drop_prob."""
    def __init__(self, drop_prob: float = 0.0):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        if not self.training or self.drop_prob == 0.0:
            return x
        keep = 1 - self.drop_prob
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)   # per-sample mask
        mask  = torch.bernoulli(torch.full(shape, keep, device=x.device)) / keep
        return x * mask

TensorFlow / Keras:

import tensorflow as tf

# Standard dropout
drop = tf.keras.layers.Dropout(rate=0.5)
x    = tf.random.normal((8, 256))
out_train = drop(x, training=True)    # active
out_eval  = drop(x, training=False)   # identity

# SpatialDropout2D: zeros entire channels in (B, H, W, C)
drop2d = tf.keras.layers.SpatialDropout2D(rate=0.2)
x2d    = tf.random.normal((4, 8, 8, 64))
out    = drop2d(x2d, training=True)

# MC Dropout: call model with training=True at inference
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(10),
])

N = 30
preds = tf.stack([model(x, training=True) for _ in range(N)])  # (30, B, 10)
mean  = tf.reduce_mean(preds, axis=0)
var   = tf.math.reduce_variance(preds, axis=0)