What Is an Activation Function?
- Explain why non-linear activation functions are necessary and what a network without them can only represent
- Trace the gradient through a single activation using the chain rule and identify where vanishing gradients originate
- Distinguish the vanishing gradient problem from the dying ReLU problem and state a remedy for each
- Select an activation function family for a given architecture and task using the practical selection guide
The Role of Non-Linearity
Without activation functions, a neural network with any number of layers collapses to a single linear transformation. If every layer computes , the entire network reduces to . No amount of depth helps — you can express the same function with a single weight matrix.
Activation functions break this linearity:
This is what makes deep networks universal approximators: with enough width, a two-layer network with a non-linear activation can approximate any continuous function on a compact domain (Universal Approximation Theorem).
Backpropagation and the Chain Rule
Every activation function must be differentiable (or at least sub-differentiable) because training uses gradient descent via backpropagation. For a loss , the gradient at layer flows backward through :
The term multiplied at every layer determines whether gradients grow, shrink, or stay stable as they propagate back through many layers.
The Vanishing Gradient Problem
Saturating activations like Sigmoid and Tanh compress their input to a bounded range. Their derivatives approach zero for large :
In a 50-layer network, multiplying 50 values each together produces a gradient that is numerically zero — the early layers receive no learning signal. ReLU and its variants were designed specifically to avoid this.
The Dying ReLU Problem
ReLU solves vanishing gradients, but introduces a new failure mode: neurons that always output 0 are dead. If a neuron's pre-activation is always negative (e.g., after a large negative update), its gradient is always 0 and it never recovers.
LeakyReLU, PReLU, ELU, and SELU were all designed to keep a small gradient for negative inputs, solving the dying neuron problem while retaining ReLU's computational simplicity.
Key Properties to Compare
When choosing an activation function, four properties matter most:
| Property | Why It Matters |
|---|---|
| Output range | Bounded outputs (Sigmoid, Tanh) constrain values but may saturate; unbounded (ReLU) can explode |
| Smoothness | Smooth (GELU, SiLU, Mish) have well-defined gradients everywhere; non-smooth (ReLU) have sub-gradients at kinks |
| Monotonicity | Most activations are monotone; non-monotone ones (SiLU near ) can represent more complex functions |
| Zero-centering | Zero-centered outputs (Tanh) reduce zig-zag gradient updates; non-zero-centered (ReLU, Sigmoid) can slow convergence |
Practical Selection Guide
| Scenario | Recommended Activation |
|---|---|
| Default hidden layers | ReLU, then try GELU or SiLU |
| Transformers / NLP | GELU (BERT, GPT), SiLU (LLaMA) |
| Mobile / edge inference | ReLU6, Hardswish, Hardsigmoid |
| Multi-class output | Softmax |
| Binary output | Sigmoid |
| Large vocabulary NLP | AdaptiveLogSoftmax |
| Variance / positive params | Softplus |
| Self-normalizing networks | SELU |
| Sparse representations | Hardshrink, Softshrink |
Using Activations in Practice
PyTorch — inline with nn.Sequential:
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.GELU(),
nn.Linear(128, 10)
)
PyTorch — functional style (stateless activations only):
import torch.nn.functional as F
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.gelu(self.fc2(x))
return self.fc3(x)
TensorFlow/Keras — string shorthand:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='gelu'),
tf.keras.layers.Dense(10, activation='softmax')
])
TensorFlow/Keras — layer objects (configurable):
model = tf.keras.Sequential([
tf.keras.layers.Dense(256),
tf.keras.layers.LeakyReLU(negative_slope=0.01),
tf.keras.layers.Dense(128),
tf.keras.layers.ELU(alpha=1.0),
tf.keras.layers.Dense(10, activation='softmax')
])