The ReLU Family
- State the ReLU formula and its gradient, and explain why it avoids the vanishing gradient problem in the positive region
- Explain the dying ReLU problem and describe how LeakyReLU and PReLU address it with a non-zero negative slope
- Distinguish PReLU (learned slope), RReLU (randomised slope), and LeakyReLU (fixed slope) and identify when each is appropriate
- Apply ReLU6 for mobile and quantised inference and explain why bounding the output at 6 aids fixed-point representation
ReLU — Rectified Linear Unit
The most widely used activation in modern deep learning:
ReLU solved the vanishing gradient problem: for , the gradient is exactly 1 — no shrinkage regardless of network depth. Its piecewise-linear form also makes it extremely fast to compute.
Gradient:
In practice, (sub-gradient convention). The zero gradient at is what causes the dying neuron problem: if a neuron's pre-activation is always negative, it never updates.
PyTorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.ReLU()(x)) # tensor([0., 0., 0., 1., 2.])
print(F.relu(x)) # identical; use nn.ReLU() in nn.Sequential
TensorFlow:
import tensorflow as tf
x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.nn.relu(x)) # [0. 0. 0. 1. 2.]
# In a model: tf.keras.layers.Dense(64, activation='relu')
LeakyReLU — A Small Leak Fixes Dead Neurons
where is a small constant, typically . The negative slope ensures a non-zero gradient for all inputs, preventing neurons from dying.
Gradient: for , for .
Note that .
PyTorch:
x = torch.tensor([-2., -1., 0., 1., 2.])
leaky = nn.LeakyReLU(negative_slope=0.01)
print(leaky(x)) # tensor([-0.0200, -0.0100, 0.0000, 1.0000, 2.0000])
TensorFlow:
x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.nn.leaky_relu(x, alpha=0.01)) # [-0.02 -0.01 0. 1. 2. ]
# In a model: tf.keras.layers.LeakyReLU(negative_slope=0.01)
PReLU — Parametric ReLU
Identical formula to LeakyReLU, but is a trainable parameter initialized to 0.25. The network learns the optimal negative slope from data. PyTorch supports channel-wise via nn.PReLU(num_parameters=C).
Gradient w.r.t. : For : . This gradient propagates back and updates during training.
PyTorch:
# num_parameters=1: one shared alpha; set to C for per-channel slopes
prelu = nn.PReLU(num_parameters=1, init=0.25)
x = torch.tensor([-2., -1., 0., 1., 2.])
print(prelu(x)) # tensor([-0.5000, -0.2500, 0.0000, 1.0000, 2.0000]) at init
TensorFlow:
# PReLU in Keras: one learned slope per channel by default
prelu = tf.keras.layers.PReLU(shared_axes=[1, 2]) # shared across spatial dims
# Equivalent: tf.keras.layers.PReLU() with default per-unit slopes
RReLU — Randomized Leaky ReLU
During training, the negative slope is sampled uniformly from (defaults: ). At evaluation time, is fixed to . The randomization acts as a form of regularization, similar to dropout.
PyTorch:
# Slope sampled from U(lower, upper) during training; fixed to mean at eval
rrelu = nn.RReLU(lower=1./8, upper=1./3)
x = torch.tensor([-2., -1., 0., 1., 2.])
print(rrelu(x)) # negative outputs vary each forward pass during training
TensorFlow:
# No built-in RReLU in TensorFlow; implement via a custom layer
import tensorflow as tf
class RReLU(tf.keras.layers.Layer):
def __init__(self, lower=1/8, upper=1/3):
super().__init__()
self.lower, self.upper = lower, upper
def call(self, x, training=False):
if training:
alpha = tf.random.uniform(tf.shape(x), self.lower, self.upper)
else:
alpha = (self.lower + self.upper) / 2
return tf.where(x >= 0, x, alpha * x)
ReLU6 — Bounded for Mobile
ReLU6 clamps activations at 6, bounding the output to . This makes it friendly for fixed-point quantization: when activations are bounded, you need fewer bits to represent them precisely. Used extensively in MobileNet.
Gradient: 1 for , 0 otherwise.
PyTorch:
x = torch.tensor([-1., 0., 3., 6., 8.])
print(nn.ReLU6()(x)) # tensor([0., 0., 3., 6., 6.])
print(F.relu6(x)) # identical
TensorFlow:
x = tf.constant([-1., 0., 3., 6., 8.])
print(tf.nn.relu6(x)) # [0. 0. 3. 6. 6.]
# Equivalent: tf.keras.layers.ReLU(max_value=6)
Comparison Table
| Activation | Formula | Range | Dying neurons? | Learnable? |
|---|---|---|---|---|
| ReLU | Yes | No | ||
| LeakyReLU | No | No | ||
| PReLU | , learned | No | Yes | |
| RReLU | , | No | Stochastic | |
| ReLU6 | Yes (above 6 too) | No |
When to Use Which
- ReLU: Default choice — fast, simple, works well with BatchNorm
- LeakyReLU: When dying neurons are a problem (e.g., GANs, deep unsupervised models)
- PReLU: When you want adaptive negative slope and can afford extra parameters
- RReLU: As a regularization technique; empirically helps in small datasets
- ReLU6: MobileNet, quantized models, edge deployment