Supplement · Activation Functions

What Is an Activation Function?

12 min read
By the end of this reading you will be able to:
  • Explain why non-linear activation functions are necessary and what a network without them can only represent
  • Trace the gradient through a single activation using the chain rule and identify where vanishing gradients originate
  • Distinguish the vanishing gradient problem from the dying ReLU problem and state a remedy for each
  • Select an activation function family for a given architecture and task using the practical selection guide

The Role of Non-Linearity

Without activation functions, a neural network with any number of layers collapses to a single linear transformation. If every layer computes z(l)=W(l)z(l1)z^{(l)} = W^{(l)} z^{(l-1)}, the entire network reduces to z(L)=(W(L)W(1))x=Wxz^{(L)} = (W^{(L)} \cdots W^{(1)}) x = Wx. No amount of depth helps — you can express the same function with a single weight matrix.

Activation functions σ\sigma break this linearity:

z(l)=σ ⁣(W(l)z(l1)+b(l))z^{(l)} = \sigma\!\left(W^{(l)} z^{(l-1)} + b^{(l)}\right)

This is what makes deep networks universal approximators: with enough width, a two-layer network with a non-linear activation can approximate any continuous function on a compact domain (Universal Approximation Theorem).

Backpropagation and the Chain Rule

Every activation function must be differentiable (or at least sub-differentiable) because training uses gradient descent via backpropagation. For a loss L\mathcal{L}, the gradient at layer ll flows backward through σ\sigma':

Lz(l1)=Lz(l)W(l)σ ⁣(W(l)z(l1)+b(l))\frac{\partial \mathcal{L}}{\partial z^{(l-1)}} = \frac{\partial \mathcal{L}}{\partial z^{(l)}} \cdot W^{(l)\top} \cdot \sigma'\!\left(W^{(l)} z^{(l-1)} + b^{(l)}\right)

The term σ\sigma' multiplied at every layer determines whether gradients grow, shrink, or stay stable as they propagate back through many layers.

The Vanishing Gradient Problem

Saturating activations like Sigmoid and Tanh compress their input to a bounded range. Their derivatives approach zero for large x|x|:

σ(x)=σ(x)(1σ(x))x0\sigma'(x) = \sigma(x)(1-\sigma(x)) \xrightarrow{|x| \to \infty} 0

In a 50-layer network, multiplying 50 values each <0.25< 0.25 together produces a gradient that is numerically zero — the early layers receive no learning signal. ReLU and its variants were designed specifically to avoid this.

The Dying ReLU Problem

ReLU solves vanishing gradients, but introduces a new failure mode: neurons that always output 0 are dead. If a neuron's pre-activation is always negative (e.g., after a large negative update), its gradient is always 0 and it never recovers.

LeakyReLU, PReLU, ELU, and SELU were all designed to keep a small gradient for negative inputs, solving the dying neuron problem while retaining ReLU's computational simplicity.

Key Properties to Compare

When choosing an activation function, four properties matter most:

Property Why It Matters
Output range Bounded outputs (Sigmoid, Tanh) constrain values but may saturate; unbounded (ReLU) can explode
Smoothness Smooth (GELU, SiLU, Mish) have well-defined gradients everywhere; non-smooth (ReLU) have sub-gradients at kinks
Monotonicity Most activations are monotone; non-monotone ones (SiLU near x1.3x≈-1.3) can represent more complex functions
Zero-centering Zero-centered outputs (Tanh) reduce zig-zag gradient updates; non-zero-centered (ReLU, Sigmoid) can slow convergence

Practical Selection Guide

Scenario Recommended Activation
Default hidden layers ReLU, then try GELU or SiLU
Transformers / NLP GELU (BERT, GPT), SiLU (LLaMA)
Mobile / edge inference ReLU6, Hardswish, Hardsigmoid
Multi-class output Softmax
Binary output Sigmoid
Large vocabulary NLP AdaptiveLogSoftmax
Variance / positive params Softplus
Self-normalizing networks SELU
Sparse representations Hardshrink, Softshrink

Using Activations in Practice

PyTorch — inline with nn.Sequential:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.GELU(),
    nn.Linear(128, 10)
)

PyTorch — functional style (stateless activations only):

import torch.nn.functional as F

def forward(self, x):
    x = F.relu(self.fc1(x))
    x = F.gelu(self.fc2(x))
    return self.fc3(x)

TensorFlow/Keras — string shorthand:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='gelu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

TensorFlow/Keras — layer objects (configurable):

model = tf.keras.Sequential([
    tf.keras.layers.Dense(256),
    tf.keras.layers.LeakyReLU(negative_slope=0.01),
    tf.keras.layers.Dense(128),
    tf.keras.layers.ELU(alpha=1.0),
    tf.keras.layers.Dense(10, activation='softmax')
])
References
Cybenko (1989) — Universal Approximation Theorem — Proved that a two-layer network with sigmoid activations can approximate any continuous function
Hochreiter (1991) — Vanishing Gradient Problem — Identified that saturating activations cause gradients to vanish in deep networks