Supplement · Activation Functions

Saturating Activations

14 min read

By the end of this reading you will be able to:

Derive the sigmoid derivative as sigma(x)(1 - sigma(x)) and explain why its maximum of 0.25 causes vanishing gradients in deep networks
Contrast sigmoid and tanh in terms of output range and zero-centring, and explain why zero-centred outputs reduce gradient zig-zagging
Identify Hardsigmoid and Hardtanh as piecewise-linear approximations and state their computational advantage for mobile inference
Select among the six saturating activations given constraints on output range, computational budget, and gradient flow requirements

Sigmoid

$\sigma(x) = \frac{1}{1 + e^{-x}}$

The sigmoid function maps any real input to $(0, 1)$ , making it ideal for binary probability outputs. Its S-shaped curve saturates at both ends: $\sigma(x) \to 1$ as $x \to +\infty$ and $\sigma(x) \to 0$ as $x \to -\infty$ .

Derivative: The sigmoid has a clean self-referential derivative: $\sigma'(x) = \sigma(x)\,(1 - \sigma(x))$

The maximum derivative is $0.25$ (at $x=0$ ). In a 10-layer network, 10 sigmoid gradients multiplied together can be as small as $0.25^{10} \approx 10^{-6}$ — effectively zero. This is the vanishing gradient problem in numbers.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Sigmoid()(x))      # tensor([0.0474, 0.2689, 0.5000, 0.7311, 0.9526])
print(torch.sigmoid(x))     # identical

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.sigmoid(x))                    # [0.0474 0.2689 0.5000 0.7311 0.9526]
# In a model output: tf.keras.layers.Dense(1, activation='sigmoid')

Tanh — Zero-Centered Sigmoid

$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$

Tanh maps to $(-1, 1)$ and is zero-centered, meaning its outputs average near zero. This is advantageous because non-zero-centered activations (like Sigmoid) cause all gradients to flow in the same direction (all positive or all negative), creating zig-zag updates in weight space.

Derivative: $\tanh'(x) = 1 - \tanh^2(x)$

Maximum derivative is 1 at $x=0$ , but still vanishes at $|x| \gg 0$ . Tanh still suffers from vanishing gradients in very deep networks.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Tanh()(x))     # tensor([-0.9951, -0.7616,  0.0000,  0.7616,  0.9951])
print(torch.tanh(x))    # identical

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.tanh(x))                        # [-0.9951 -0.7616  0.     0.7616  0.9951]
# In an RNN cell: tf.keras.layers.SimpleRNN(64, activation='tanh')

Hardsigmoid — Piecewise Linear Sigmoid

$f(x) = \max\!\left(0,\, \min\!\left(1,\, \frac{x+3}{6}\right)\right)$

A piecewise linear approximation of Sigmoid:

$x \leq -3$ : output 0 (saturated low)
$-3 < x < 3$ : linear ramp $(x+3)/6$
$x \geq 3$ : output 1 (saturated high)

At $x=0$ : $f(0) = 0.5$ , matching sigmoid. The slope in the linear region is $1/6 \approx 0.167$ , compared to sigmoid's maximum slope of $0.25$ .

Note: TensorFlow's hard_sigmoid uses a different formula: $\text{clip}(0.2x + 0.5, 0, 1)$ , which has slope $0.2$ and centers at $x = -2.5$ .

PyTorch:

x = torch.tensor([-4., -3., 0., 3., 4.])
print(nn.Hardsigmoid()(x))   # tensor([0.0000, 0.0000, 0.5000, 1.0000, 1.0000])
# Formula: clamp((x+3)/6, 0, 1)

TensorFlow:

x = tf.constant([-4., -3., 0., 3., 4.])
# TF formula: clip(0.2*x + 0.5, 0, 1) — different slope and center than PyTorch
print(tf.keras.activations.hard_sigmoid(x))  # [0.   0.1  0.5  1.   1. ]
# Verify parity: PyTorch uses (x+3)/6; TensorFlow uses 0.2x+0.5

Hardtanh — Piecewise Linear Tanh

$f(x) = \max(\text{min\_val},\, \min(\text{max\_val},\, x))$

Simply clips the input to $[\text{min\_val}, \text{max\_val}]$ (defaults: $[-1, 1]$ ). This is a piecewise linear approximation of Tanh:

Below $\text{min\_val}$ : saturated at $\text{min\_val}$
In range: identity (gradient = 1)
Above $\text{max\_val}$ : saturated at $\text{max\_val}$

Used in quantization-aware training (the clipping models fixed-point saturation) and as a fast Tanh replacement in RNNs.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Hardtanh()(x))   # tensor([-1., -1.,  0.,  1.,  1.])
# Custom range: nn.Hardtanh(min_val=-2.0, max_val=2.0)

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
# No built-in Hardtanh; clip_by_value is equivalent
print(tf.clip_by_value(x, -1.0, 1.0))   # [-1. -1.  0.  1.  1.]

Softsign

$f(x) = \frac{x}{1 + |x|}$

A computationally simpler alternative to Tanh with range $(-1, 1)$ and zero-centering. Where Tanh uses exponentials, Softsign uses only absolute value and division. It saturates more slowly (polynomial decay vs exponential for Tanh), meaning gradients remain non-negligible for larger inputs.

Derivative: $f'(x) = 1/(1 + |x|)^2$

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Softsign()(x))   # tensor([-0.7500, -0.5000,  0.0000,  0.5000,  0.7500])
print(F.softsign(x))      # identical

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.softsign(x))                    # [-0.75 -0.5   0.    0.5   0.75]
print(tf.keras.activations.softsign(x))     # identical

LogSigmoid

$f(x) = \log\sigma(x) = \log\frac{1}{1 + e^{-x}} = -\log(1 + e^{-x})$

Outputs $(-\infty, 0]$ — always negative because $\log$ of a value in $(0,1)$ . PyTorch implements this in a numerically stable way using the log-sum-exp identity:

$\log\sigma(x) = -\log(1 + e^{-x}) = \begin{cases} -\log(1 + e^{-x}) & x \geq 0 \\ x - \log(1 + e^x) & x < 0 \end{cases}$

This avoids computing $e^x$ for large positive $x$ (which would overflow). LogSigmoid is commonly used with NLLLoss for binary classification.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.LogSigmoid()(x))   # tensor([-3.0486, -1.3133, -0.6931, -0.3133, -0.0486])
# Use with nn.NLLLoss for binary classification

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.math.log_sigmoid(x))   # [-3.0486 -1.3133 -0.6931 -0.3133 -0.0486]
# Equivalent: -tf.math.softplus(-x)

Comparison Table

Activation	Range	Zero-centered	Vanishes?	Exponentials
Sigmoid	$(0,1)$	No	Yes	Yes
Tanh	$(-1,1)$	Yes	Yes	Yes
Hardsigmoid	$[0,1]$	No	Yes (at ends)	No
Hardtanh	$[-1,1]$	Yes	Yes (at ends)	No
Softsign	$(-1,1)$	Yes	Slowly	No
LogSigmoid	$(-\infty, 0]$	No	Yes	Yes (stable)

References

Bengio et al. (1994) — Learning Long-Term Dependencies — Analyzed vanishing gradient problem with saturating activations in RNNs

LeCun et al. (1998) — Efficient BackProp — Recommended Tanh over Sigmoid for zero-centering advantages

Previous Next →

Saturating Activations

Sigmoid

Tanh — Zero-Centered Sigmoid

Hardsigmoid — Piecewise Linear Sigmoid

Hardtanh — Piecewise Linear Tanh

Softsign

LogSigmoid

Comparison Table

Privacy Policy

What we collect

What we don't collect

Your choices

Contact