Supplement · Activation Functions

The ReLU Family

15 min read
By the end of this reading you will be able to:
  • State the ReLU formula and its gradient, and explain why it avoids the vanishing gradient problem in the positive region
  • Explain the dying ReLU problem and describe how LeakyReLU and PReLU address it with a non-zero negative slope
  • Distinguish PReLU (learned slope), RReLU (randomised slope), and LeakyReLU (fixed slope) and identify when each is appropriate
  • Apply ReLU6 for mobile and quantised inference and explain why bounding the output at 6 aids fixed-point representation

ReLU — Rectified Linear Unit

The most widely used activation in modern deep learning:

f(x)=max(0,x)f(x) = \max(0, x)

ReLU solved the vanishing gradient problem: for x>0x > 0, the gradient is exactly 1 — no shrinkage regardless of network depth. Its piecewise-linear form also makes it extremely fast to compute.

Gradient: f(x)={1x>00x<0undefinedx=0f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x < 0 \\ \text{undefined} & x = 0 \end{cases}

In practice, f(0)=0f'(0) = 0 (sub-gradient convention). The zero gradient at x<0x < 0 is what causes the dying neuron problem: if a neuron's pre-activation is always negative, it never updates.

PyTorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.ReLU()(x))    # tensor([0., 0., 0., 1., 2.])
print(F.relu(x))       # identical; use nn.ReLU() in nn.Sequential

TensorFlow:

import tensorflow as tf

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.nn.relu(x))                  # [0. 0. 0. 1. 2.]
# In a model: tf.keras.layers.Dense(64, activation='relu')

LeakyReLU — A Small Leak Fixes Dead Neurons

f(x)=max(αx,x)={xx0αxx<0f(x) = \max(\alpha x,\, x) = \begin{cases} x & x \geq 0 \\ \alpha x & x < 0 \end{cases}

where α\alpha is a small constant, typically α=0.01\alpha = 0.01. The negative slope α\alpha ensures a non-zero gradient for all inputs, preventing neurons from dying.

Gradient: f(x)=1f'(x) = 1 for x0x \geq 0, f(x)=αf'(x) = \alpha for x<0x < 0.

Note that ReLU(x)=LeakyReLU(x,α=0)\text{ReLU}(x) = \text{LeakyReLU}(x, \alpha=0).

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
leaky = nn.LeakyReLU(negative_slope=0.01)
print(leaky(x))   # tensor([-0.0200, -0.0100,  0.0000,  1.0000,  2.0000])

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.nn.leaky_relu(x, alpha=0.01))   # [-0.02 -0.01  0.    1.    2.  ]
# In a model: tf.keras.layers.LeakyReLU(negative_slope=0.01)

PReLU — Parametric ReLU

f(x)=max(αx,x),α learnablef(x) = \max(\alpha x,\, x), \quad \alpha \text{ learnable}

Identical formula to LeakyReLU, but α\alpha is a trainable parameter initialized to 0.25. The network learns the optimal negative slope from data. PyTorch supports channel-wise α\alpha via nn.PReLU(num_parameters=C).

Gradient w.r.t. α\alpha: For x<0x < 0: f/α=x\partial f / \partial \alpha = x. This gradient propagates back and updates α\alpha during training.

PyTorch:

# num_parameters=1: one shared alpha; set to C for per-channel slopes
prelu = nn.PReLU(num_parameters=1, init=0.25)
x = torch.tensor([-2., -1., 0., 1., 2.])
print(prelu(x))   # tensor([-0.5000, -0.2500,  0.0000,  1.0000,  2.0000]) at init

TensorFlow:

# PReLU in Keras: one learned slope per channel by default
prelu = tf.keras.layers.PReLU(shared_axes=[1, 2])  # shared across spatial dims
# Equivalent: tf.keras.layers.PReLU() with default per-unit slopes

RReLU — Randomized Leaky ReLU

f(x)={xx0axx<0,aU(lower,upper)f(x) = \begin{cases} x & x \geq 0 \\ ax & x < 0 \end{cases}, \quad a \sim \mathcal{U}(\text{lower}, \text{upper})

During training, the negative slope aa is sampled uniformly from [lower,upper][\text{lower}, \text{upper}] (defaults: [1/8,1/3][1/8, 1/3]). At evaluation time, aa is fixed to (lower+upper)/2(\text{lower} + \text{upper}) / 2. The randomization acts as a form of regularization, similar to dropout.

PyTorch:

# Slope sampled from U(lower, upper) during training; fixed to mean at eval
rrelu = nn.RReLU(lower=1./8, upper=1./3)
x = torch.tensor([-2., -1., 0., 1., 2.])
print(rrelu(x))   # negative outputs vary each forward pass during training

TensorFlow:

# No built-in RReLU in TensorFlow; implement via a custom layer
import tensorflow as tf

class RReLU(tf.keras.layers.Layer):
    def __init__(self, lower=1/8, upper=1/3):
        super().__init__()
        self.lower, self.upper = lower, upper

    def call(self, x, training=False):
        if training:
            alpha = tf.random.uniform(tf.shape(x), self.lower, self.upper)
        else:
            alpha = (self.lower + self.upper) / 2
        return tf.where(x >= 0, x, alpha * x)

ReLU6 — Bounded for Mobile

f(x)=min(max(0,x),6)=clip(x,0,6)f(x) = \min(\max(0, x),\, 6) = \text{clip}(x, 0, 6)

ReLU6 clamps activations at 6, bounding the output to [0,6][0, 6]. This makes it friendly for fixed-point quantization: when activations are bounded, you need fewer bits to represent them precisely. Used extensively in MobileNet.

Gradient: 1 for 0<x<60 < x < 6, 0 otherwise.

PyTorch:

x = torch.tensor([-1., 0., 3., 6., 8.])
print(nn.ReLU6()(x))   # tensor([0., 0., 3., 6., 6.])
print(F.relu6(x))      # identical

TensorFlow:

x = tf.constant([-1., 0., 3., 6., 8.])
print(tf.nn.relu6(x))   # [0. 0. 3. 6. 6.]
# Equivalent: tf.keras.layers.ReLU(max_value=6)

Comparison Table

Activation Formula Range Dying neurons? Learnable?
ReLU max(0,x)\max(0,x) [0,)[0,\infty) Yes No
LeakyReLU max(αx,x)\max(\alpha x, x) (,)(-\infty,\infty) No No
PReLU max(αx,x)\max(\alpha x, x), α\alpha learned (,)(-\infty,\infty) No Yes
RReLU max(ax,x)\max(ax, x), aUa \sim \mathcal{U} (,)(-\infty,\infty) No Stochastic
ReLU6 min(max(0,x),6)\min(\max(0,x), 6) [0,6][0, 6] Yes (above 6 too) No

When to Use Which

  • ReLU: Default choice — fast, simple, works well with BatchNorm
  • LeakyReLU: When dying neurons are a problem (e.g., GANs, deep unsupervised models)
  • PReLU: When you want adaptive negative slope and can afford extra parameters
  • RReLU: As a regularization technique; empirically helps in small datasets
  • ReLU6: MobileNet, quantized models, edge deployment
References
Glorot et al. (2011) — Deep Sparse Rectifier Networks — Introduced ReLU for deep networks; showed dead neurons can be beneficial for sparsity
He et al. (2015) — Delving Deep into Rectifiers (PReLU) — Introduced PReLU and the He initialization scheme
Howard et al. (2017) — MobileNets (ReLU6) — Popularized ReLU6 for mobile-efficient architectures