Supplement · Activation Functions

Saturating Activations

14 min read
By the end of this reading you will be able to:
  • Derive the sigmoid derivative as sigma(x)(1 - sigma(x)) and explain why its maximum of 0.25 causes vanishing gradients in deep networks
  • Contrast sigmoid and tanh in terms of output range and zero-centring, and explain why zero-centred outputs reduce gradient zig-zagging
  • Identify Hardsigmoid and Hardtanh as piecewise-linear approximations and state their computational advantage for mobile inference
  • Select among the six saturating activations given constraints on output range, computational budget, and gradient flow requirements

Sigmoid

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

The sigmoid function maps any real input to (0,1)(0, 1), making it ideal for binary probability outputs. Its S-shaped curve saturates at both ends: σ(x)1\sigma(x) \to 1 as x+x \to +\infty and σ(x)0\sigma(x) \to 0 as xx \to -\infty.

Derivative: The sigmoid has a clean self-referential derivative: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)\,(1 - \sigma(x))

The maximum derivative is 0.250.25 (at x=0x=0). In a 10-layer network, 10 sigmoid gradients multiplied together can be as small as 0.25101060.25^{10} \approx 10^{-6} — effectively zero. This is the vanishing gradient problem in numbers.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Sigmoid()(x))      # tensor([0.0474, 0.2689, 0.5000, 0.7311, 0.9526])
print(torch.sigmoid(x))     # identical

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.sigmoid(x))                    # [0.0474 0.2689 0.5000 0.7311 0.9526]
# In a model output: tf.keras.layers.Dense(1, activation='sigmoid')

Tanh — Zero-Centered Sigmoid

tanh(x)=exexex+ex=2σ(2x)1\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

Tanh maps to (1,1)(-1, 1) and is zero-centered, meaning its outputs average near zero. This is advantageous because non-zero-centered activations (like Sigmoid) cause all gradients to flow in the same direction (all positive or all negative), creating zig-zag updates in weight space.

Derivative: tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)

Maximum derivative is 1 at x=0x=0, but still vanishes at x0|x| \gg 0. Tanh still suffers from vanishing gradients in very deep networks.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Tanh()(x))     # tensor([-0.9951, -0.7616,  0.0000,  0.7616,  0.9951])
print(torch.tanh(x))    # identical

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.tanh(x))                        # [-0.9951 -0.7616  0.     0.7616  0.9951]
# In an RNN cell: tf.keras.layers.SimpleRNN(64, activation='tanh')

Hardsigmoid — Piecewise Linear Sigmoid

f(x)=max ⁣(0,min ⁣(1,x+36))f(x) = \max\!\left(0,\, \min\!\left(1,\, \frac{x+3}{6}\right)\right)

A piecewise linear approximation of Sigmoid:

  • x3x \leq -3: output 0 (saturated low)
  • 3<x<3-3 < x < 3: linear ramp (x+3)/6(x+3)/6
  • x3x \geq 3: output 1 (saturated high)

At x=0x=0: f(0)=0.5f(0) = 0.5, matching sigmoid. The slope in the linear region is 1/60.1671/6 \approx 0.167, compared to sigmoid's maximum slope of 0.250.25.

Note: TensorFlow's hard_sigmoid uses a different formula: clip(0.2x+0.5,0,1)\text{clip}(0.2x + 0.5, 0, 1), which has slope 0.20.2 and centers at x=2.5x = -2.5.

PyTorch:

x = torch.tensor([-4., -3., 0., 3., 4.])
print(nn.Hardsigmoid()(x))   # tensor([0.0000, 0.0000, 0.5000, 1.0000, 1.0000])
# Formula: clamp((x+3)/6, 0, 1)

TensorFlow:

x = tf.constant([-4., -3., 0., 3., 4.])
# TF formula: clip(0.2*x + 0.5, 0, 1) — different slope and center than PyTorch
print(tf.keras.activations.hard_sigmoid(x))  # [0.   0.1  0.5  1.   1. ]
# Verify parity: PyTorch uses (x+3)/6; TensorFlow uses 0.2x+0.5

Hardtanh — Piecewise Linear Tanh

f(x)=max(min_val,min(max_val,x))f(x) = \max(\text{min\_val},\, \min(\text{max\_val},\, x))

Simply clips the input to [min_val,max_val][\text{min\_val}, \text{max\_val}] (defaults: [1,1][-1, 1]). This is a piecewise linear approximation of Tanh:

  • Below min_val\text{min\_val}: saturated at min_val\text{min\_val}
  • In range: identity (gradient = 1)
  • Above max_val\text{max\_val}: saturated at max_val\text{max\_val}

Used in quantization-aware training (the clipping models fixed-point saturation) and as a fast Tanh replacement in RNNs.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Hardtanh()(x))   # tensor([-1., -1.,  0.,  1.,  1.])
# Custom range: nn.Hardtanh(min_val=-2.0, max_val=2.0)

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
# No built-in Hardtanh; clip_by_value is equivalent
print(tf.clip_by_value(x, -1.0, 1.0))   # [-1. -1.  0.  1.  1.]

Softsign

f(x)=x1+xf(x) = \frac{x}{1 + |x|}

A computationally simpler alternative to Tanh with range (1,1)(-1, 1) and zero-centering. Where Tanh uses exponentials, Softsign uses only absolute value and division. It saturates more slowly (polynomial decay vs exponential for Tanh), meaning gradients remain non-negligible for larger inputs.

Derivative: f(x)=1/(1+x)2f'(x) = 1/(1 + |x|)^2

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.Softsign()(x))   # tensor([-0.7500, -0.5000,  0.0000,  0.5000,  0.7500])
print(F.softsign(x))      # identical

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.nn.softsign(x))                    # [-0.75 -0.5   0.    0.5   0.75]
print(tf.keras.activations.softsign(x))     # identical

LogSigmoid

f(x)=logσ(x)=log11+ex=log(1+ex)f(x) = \log\sigma(x) = \log\frac{1}{1 + e^{-x}} = -\log(1 + e^{-x})

Outputs (,0](-\infty, 0] — always negative because log\log of a value in (0,1)(0,1). PyTorch implements this in a numerically stable way using the log-sum-exp identity:

logσ(x)=log(1+ex)={log(1+ex)x0xlog(1+ex)x<0\log\sigma(x) = -\log(1 + e^{-x}) = \begin{cases} -\log(1 + e^{-x}) & x \geq 0 \\ x - \log(1 + e^x) & x < 0 \end{cases}

This avoids computing exe^x for large positive xx (which would overflow). LogSigmoid is commonly used with NLLLoss for binary classification.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.LogSigmoid()(x))   # tensor([-3.0486, -1.3133, -0.6931, -0.3133, -0.0486])
# Use with nn.NLLLoss for binary classification

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.math.log_sigmoid(x))   # [-3.0486 -1.3133 -0.6931 -0.3133 -0.0486]
# Equivalent: -tf.math.softplus(-x)

Comparison Table

Activation Range Zero-centered Vanishes? Exponentials
Sigmoid (0,1)(0,1) No Yes Yes
Tanh (1,1)(-1,1) Yes Yes Yes
Hardsigmoid [0,1][0,1] No Yes (at ends) No
Hardtanh [1,1][-1,1] Yes Yes (at ends) No
Softsign (1,1)(-1,1) Yes Slowly No
LogSigmoid (,0](-\infty, 0] No Yes Yes (stable)
References
Bengio et al. (1994) — Learning Long-Term Dependencies — Analyzed vanishing gradient problem with saturating activations in RNNs
LeCun et al. (1998) — Efficient BackProp — Recommended Tanh over Sigmoid for zero-centering advantages