Supplement · Activation Functions

What Is an Activation Function?

12 min read

By the end of this reading you will be able to:

Explain why non-linear activation functions are necessary and what a network without them can only represent
Trace the gradient through a single activation using the chain rule and identify where vanishing gradients originate
Distinguish the vanishing gradient problem from the dying ReLU problem and state a remedy for each
Select an activation function family for a given architecture and task using the practical selection guide

The Role of Non-Linearity

Without activation functions, a neural network with any number of layers collapses to a single linear transformation. If every layer computes $z^{(l)} = W^{(l)} z^{(l-1)}$ , the entire network reduces to $z^{(L)} = (W^{(L)} \cdots W^{(1)}) x = Wx$ . No amount of depth helps — you can express the same function with a single weight matrix.

Activation functions $\sigma$ break this linearity:

$z^{(l)} = \sigma\!\left(W^{(l)} z^{(l-1)} + b^{(l)}\right)$

This is what makes deep networks universal approximators: with enough width, a two-layer network with a non-linear activation can approximate any continuous function on a compact domain (Universal Approximation Theorem).

Backpropagation and the Chain Rule

Every activation function must be differentiable (or at least sub-differentiable) because training uses gradient descent via backpropagation. For a loss $\mathcal{L}$ , the gradient at layer $l$ flows backward through $\sigma'$ :

$\frac{\partial \mathcal{L}}{\partial z^{(l-1)}} = \frac{\partial \mathcal{L}}{\partial z^{(l)}} \cdot W^{(l)\top} \cdot \sigma'\!\left(W^{(l)} z^{(l-1)} + b^{(l)}\right)$

The term $\sigma'$ multiplied at every layer determines whether gradients grow, shrink, or stay stable as they propagate back through many layers.

The Vanishing Gradient Problem

Saturating activations like Sigmoid and Tanh compress their input to a bounded range. Their derivatives approach zero for large $|x|$ :

$\sigma'(x) = \sigma(x)(1-\sigma(x)) \xrightarrow{|x| \to \infty} 0$

In a 50-layer network, multiplying 50 values each $< 0.25$ together produces a gradient that is numerically zero — the early layers receive no learning signal. ReLU and its variants were designed specifically to avoid this.

The Dying ReLU Problem

ReLU solves vanishing gradients, but introduces a new failure mode: neurons that always output 0 are dead. If a neuron's pre-activation is always negative (e.g., after a large negative update), its gradient is always 0 and it never recovers.

LeakyReLU, PReLU, ELU, and SELU were all designed to keep a small gradient for negative inputs, solving the dying neuron problem while retaining ReLU's computational simplicity.

Key Properties to Compare

When choosing an activation function, four properties matter most:

Property	Why It Matters
Output range	Bounded outputs (Sigmoid, Tanh) constrain values but may saturate; unbounded (ReLU) can explode
Smoothness	Smooth (GELU, SiLU, Mish) have well-defined gradients everywhere; non-smooth (ReLU) have sub-gradients at kinks
Monotonicity	Most activations are monotone; non-monotone ones (SiLU near $x≈-1.3$ ) can represent more complex functions
Zero-centering	Zero-centered outputs (Tanh) reduce zig-zag gradient updates; non-zero-centered (ReLU, Sigmoid) can slow convergence

Practical Selection Guide

Scenario	Recommended Activation
Default hidden layers	ReLU, then try GELU or SiLU
Transformers / NLP	GELU (BERT, GPT), SiLU (LLaMA)
Mobile / edge inference	ReLU6, Hardswish, Hardsigmoid
Multi-class output	Softmax
Binary output	Sigmoid
Large vocabulary NLP	AdaptiveLogSoftmax
Variance / positive params	Softplus
Self-normalizing networks	SELU
Sparse representations	Hardshrink, Softshrink

Using Activations in Practice

PyTorch — inline with nn.Sequential:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.GELU(),
    nn.Linear(128, 10)
)

PyTorch — functional style (stateless activations only):

import torch.nn.functional as F

def forward(self, x):
    x = F.relu(self.fc1(x))
    x = F.gelu(self.fc2(x))
    return self.fc3(x)

TensorFlow/Keras — string shorthand:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='gelu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

TensorFlow/Keras — layer objects (configurable):

model = tf.keras.Sequential([
    tf.keras.layers.Dense(256),
    tf.keras.layers.LeakyReLU(negative_slope=0.01),
    tf.keras.layers.Dense(128),
    tf.keras.layers.ELU(alpha=1.0),
    tf.keras.layers.Dense(10, activation='softmax')
])

References

Cybenko (1989) — Universal Approximation Theorem — Proved that a two-layer network with sigmoid activations can approximate any continuous function

Hochreiter (1991) — Vanishing Gradient Problem — Identified that saturating activations cause gradients to vanish in deep networks

Overview Next →

What Is an Activation Function?

The Role of Non-Linearity

Backpropagation and the Chain Rule

The Vanishing Gradient Problem

The Dying ReLU Problem

Key Properties to Compare

Practical Selection Guide

Using Activations in Practice

Privacy Policy

What we collect

What we don't collect

Your choices

Contact