Saturating Activations
- Derive the sigmoid derivative as sigma(x)(1 - sigma(x)) and explain why its maximum of 0.25 causes vanishing gradients in deep networks
- Contrast sigmoid and tanh in terms of output range and zero-centring, and explain why zero-centred outputs reduce gradient zig-zagging
- Identify Hardsigmoid and Hardtanh as piecewise-linear approximations and state their computational advantage for mobile inference
- Select among the six saturating activations given constraints on output range, computational budget, and gradient flow requirements
Sigmoid
The sigmoid function maps any real input to , making it ideal for binary probability outputs. Its S-shaped curve saturates at both ends: as and as .
Derivative: The sigmoid has a clean self-referential derivative:
The maximum derivative is (at ). In a 10-layer network, 10 sigmoid gradients multiplied together can be as small as — effectively zero. This is the vanishing gradient problem in numbers.
PyTorch:
x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Sigmoid()(x)) # tensor([0.0474, 0.2689, 0.5000, 0.7311, 0.9526])
print(torch.sigmoid(x)) # identical
TensorFlow:
x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.sigmoid(x)) # [0.0474 0.2689 0.5000 0.7311 0.9526]
# In a model output: tf.keras.layers.Dense(1, activation='sigmoid')
Tanh — Zero-Centered Sigmoid
Tanh maps to and is zero-centered, meaning its outputs average near zero. This is advantageous because non-zero-centered activations (like Sigmoid) cause all gradients to flow in the same direction (all positive or all negative), creating zig-zag updates in weight space.
Derivative:
Maximum derivative is 1 at , but still vanishes at . Tanh still suffers from vanishing gradients in very deep networks.
PyTorch:
x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Tanh()(x)) # tensor([-0.9951, -0.7616, 0.0000, 0.7616, 0.9951])
print(torch.tanh(x)) # identical
TensorFlow:
x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.tanh(x)) # [-0.9951 -0.7616 0. 0.7616 0.9951]
# In an RNN cell: tf.keras.layers.SimpleRNN(64, activation='tanh')
Hardsigmoid — Piecewise Linear Sigmoid
A piecewise linear approximation of Sigmoid:
- : output 0 (saturated low)
- : linear ramp
- : output 1 (saturated high)
At : , matching sigmoid. The slope in the linear region is , compared to sigmoid's maximum slope of .
Note: TensorFlow's hard_sigmoid uses a different formula: , which has slope and centers at .
PyTorch:
x = torch.tensor([-4., -3., 0., 3., 4.])
print(nn.Hardsigmoid()(x)) # tensor([0.0000, 0.0000, 0.5000, 1.0000, 1.0000])
# Formula: clamp((x+3)/6, 0, 1)
TensorFlow:
x = tf.constant([-4., -3., 0., 3., 4.])
# TF formula: clip(0.2*x + 0.5, 0, 1) — different slope and center than PyTorch
print(tf.keras.activations.hard_sigmoid(x)) # [0. 0.1 0.5 1. 1. ]
# Verify parity: PyTorch uses (x+3)/6; TensorFlow uses 0.2x+0.5
Hardtanh — Piecewise Linear Tanh
Simply clips the input to (defaults: ). This is a piecewise linear approximation of Tanh:
- Below : saturated at
- In range: identity (gradient = 1)
- Above : saturated at
Used in quantization-aware training (the clipping models fixed-point saturation) and as a fast Tanh replacement in RNNs.
PyTorch:
x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Hardtanh()(x)) # tensor([-1., -1., 0., 1., 1.])
# Custom range: nn.Hardtanh(min_val=-2.0, max_val=2.0)
TensorFlow:
x = tf.constant([-3., -1., 0., 1., 3.])
# No built-in Hardtanh; clip_by_value is equivalent
print(tf.clip_by_value(x, -1.0, 1.0)) # [-1. -1. 0. 1. 1.]
Softsign
A computationally simpler alternative to Tanh with range and zero-centering. Where Tanh uses exponentials, Softsign uses only absolute value and division. It saturates more slowly (polynomial decay vs exponential for Tanh), meaning gradients remain non-negligible for larger inputs.
Derivative:
PyTorch:
x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Softsign()(x)) # tensor([-0.7500, -0.5000, 0.0000, 0.5000, 0.7500])
print(F.softsign(x)) # identical
TensorFlow:
x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.softsign(x)) # [-0.75 -0.5 0. 0.5 0.75]
print(tf.keras.activations.softsign(x)) # identical
LogSigmoid
Outputs — always negative because of a value in . PyTorch implements this in a numerically stable way using the log-sum-exp identity:
This avoids computing for large positive (which would overflow). LogSigmoid is commonly used with NLLLoss for binary classification.
PyTorch:
x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.LogSigmoid()(x)) # tensor([-3.0486, -1.3133, -0.6931, -0.3133, -0.0486])
# Use with nn.NLLLoss for binary classification
TensorFlow:
x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.math.log_sigmoid(x)) # [-3.0486 -1.3133 -0.6931 -0.3133 -0.0486]
# Equivalent: -tf.math.softplus(-x)
Comparison Table
| Activation | Range | Zero-centered | Vanishes? | Exponentials |
|---|---|---|---|---|
| Sigmoid | No | Yes | Yes | |
| Tanh | Yes | Yes | Yes | |
| Hardsigmoid | No | Yes (at ends) | No | |
| Hardtanh | Yes | Yes (at ends) | No | |
| Softsign | Yes | Slowly | No | |
| LogSigmoid | No | Yes | Yes (stable) |