Supplement · Activation Functions

The ReLU Family

15 min read

By the end of this reading you will be able to:

State the ReLU formula and its gradient, and explain why it avoids the vanishing gradient problem in the positive region
Explain the dying ReLU problem and describe how LeakyReLU and PReLU address it with a non-zero negative slope
Distinguish PReLU (learned slope), RReLU (randomised slope), and LeakyReLU (fixed slope) and identify when each is appropriate
Apply ReLU6 for mobile and quantised inference and explain why bounding the output at 6 aids fixed-point representation

ReLU — Rectified Linear Unit

The most widely used activation in modern deep learning:

$f(x) = \max(0, x)$

ReLU solved the vanishing gradient problem: for $x > 0$ , the gradient is exactly 1 — no shrinkage regardless of network depth. Its piecewise-linear form also makes it extremely fast to compute.

Gradient: $f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x < 0 \\ \text{undefined} & x = 0 \end{cases}$

In practice, $f'(0) = 0$ (sub-gradient convention). The zero gradient at $x < 0$ is what causes the dying neuron problem: if a neuron's pre-activation is always negative, it never updates.

PyTorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.ReLU()(x))    # tensor([0., 0., 0., 1., 2.])
print(F.relu(x))       # identical; use nn.ReLU() in nn.Sequential

TensorFlow:

import tensorflow as tf

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.nn.relu(x))                  # [0. 0. 0. 1. 2.]
# In a model: tf.keras.layers.Dense(64, activation='relu')

LeakyReLU — A Small Leak Fixes Dead Neurons

$f(x) = \max(\alpha x,\, x) = \begin{cases} x & x \geq 0 \\ \alpha x & x < 0 \end{cases}$

where $\alpha$ is a small constant, typically $\alpha = 0.01$ . The negative slope $\alpha$ ensures a non-zero gradient for all inputs, preventing neurons from dying.

Gradient: $f'(x) = 1$ for $x \geq 0$ , $f'(x) = \alpha$ for $x < 0$ .

Note that $\text{ReLU}(x) = \text{LeakyReLU}(x, \alpha=0)$ .

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
leaky = nn.LeakyReLU(negative_slope=0.01)
print(leaky(x))   # tensor([-0.0200, -0.0100,  0.0000,  1.0000,  2.0000])

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.nn.leaky_relu(x, alpha=0.01))   # [-0.02 -0.01  0.    1.    2.  ]
# In a model: tf.keras.layers.LeakyReLU(negative_slope=0.01)

PReLU — Parametric ReLU

$f(x) = \max(\alpha x,\, x), \quad \alpha \text{ learnable}$

Identical formula to LeakyReLU, but $\alpha$ is a trainable parameter initialized to 0.25. The network learns the optimal negative slope from data. PyTorch supports channel-wise $\alpha$ via nn.PReLU(num_parameters=C).

Gradient w.r.t. $\alpha$ : For $x < 0$ : $\partial f / \partial \alpha = x$ . This gradient propagates back and updates $\alpha$ during training.

PyTorch:

# num_parameters=1: one shared alpha; set to C for per-channel slopes
prelu = nn.PReLU(num_parameters=1, init=0.25)
x = torch.tensor([-2., -1., 0., 1., 2.])
print(prelu(x))   # tensor([-0.5000, -0.2500,  0.0000,  1.0000,  2.0000]) at init

TensorFlow:

# PReLU in Keras: one learned slope per channel by default
prelu = tf.keras.layers.PReLU(shared_axes=[1, 2])  # shared across spatial dims
# Equivalent: tf.keras.layers.PReLU() with default per-unit slopes

RReLU — Randomized Leaky ReLU

$f(x) = \begin{cases} x & x \geq 0 \\ ax & x < 0 \end{cases}, \quad a \sim \mathcal{U}(\text{lower}, \text{upper})$

During training, the negative slope $a$ is sampled uniformly from $[\text{lower}, \text{upper}]$ (defaults: $[1/8, 1/3]$ ). At evaluation time, $a$ is fixed to $(\text{lower} + \text{upper}) / 2$ . The randomization acts as a form of regularization, similar to dropout.

PyTorch:

# Slope sampled from U(lower, upper) during training; fixed to mean at eval
rrelu = nn.RReLU(lower=1./8, upper=1./3)
x = torch.tensor([-2., -1., 0., 1., 2.])
print(rrelu(x))   # negative outputs vary each forward pass during training

TensorFlow:

# No built-in RReLU in TensorFlow; implement via a custom layer
import tensorflow as tf

class RReLU(tf.keras.layers.Layer):
    def __init__(self, lower=1/8, upper=1/3):
        super().__init__()
        self.lower, self.upper = lower, upper

    def call(self, x, training=False):
        if training:
            alpha = tf.random.uniform(tf.shape(x), self.lower, self.upper)
        else:
            alpha = (self.lower + self.upper) / 2
        return tf.where(x >= 0, x, alpha * x)

ReLU6 — Bounded for Mobile

$f(x) = \min(\max(0, x),\, 6) = \text{clip}(x, 0, 6)$

ReLU6 clamps activations at 6, bounding the output to $[0, 6]$ . This makes it friendly for fixed-point quantization: when activations are bounded, you need fewer bits to represent them precisely. Used extensively in MobileNet.

Gradient: 1 for $0 < x < 6$ , 0 otherwise.

PyTorch:

x = torch.tensor([-1., 0., 3., 6., 8.])
print(nn.ReLU6()(x))   # tensor([0., 0., 3., 6., 6.])
print(F.relu6(x))      # identical

TensorFlow:

x = tf.constant([-1., 0., 3., 6., 8.])
print(tf.nn.relu6(x))   # [0. 0. 3. 6. 6.]
# Equivalent: tf.keras.layers.ReLU(max_value=6)

Comparison Table

Activation	Formula	Range	Dying neurons?	Learnable?
ReLU	$\max(0,x)$	$[0,\infty)$	Yes	No
LeakyReLU	$\max(\alpha x, x)$	$(-\infty,\infty)$	No	No
PReLU	$\max(\alpha x, x)$ , $\alpha$ learned	$(-\infty,\infty)$	No	Yes
RReLU	$\max(ax, x)$ , $a \sim \mathcal{U}$	$(-\infty,\infty)$	No	Stochastic
ReLU6	$\min(\max(0,x), 6)$	$[0, 6]$	Yes (above 6 too)	No

When to Use Which

ReLU: Default choice — fast, simple, works well with BatchNorm
LeakyReLU: When dying neurons are a problem (e.g., GANs, deep unsupervised models)
PReLU: When you want adaptive negative slope and can afford extra parameters
RReLU: As a regularization technique; empirically helps in small datasets
ReLU6: MobileNet, quantized models, edge deployment

References

Glorot et al. (2011) — Deep Sparse Rectifier Networks — Introduced ReLU for deep networks; showed dead neurons can be beneficial for sparsity

He et al. (2015) — Delving Deep into Rectifiers (PReLU) — Introduced PReLU and the He initialization scheme

Howard et al. (2017) — MobileNets (ReLU6) — Popularized ReLU6 for mobile-efficient architectures

Previous Next →

The ReLU Family

ReLU — Rectified Linear Unit

LeakyReLU — A Small Leak Fixes Dead Neurons

PReLU — Parametric ReLU

RReLU — Randomized Leaky ReLU

ReLU6 — Bounded for Mobile

Comparison Table

When to Use Which

Privacy Policy

What we collect

What we don't collect

Your choices

Contact