Supplement · Activation Functions

Smooth Modern Activations

16 min read

By the end of this reading you will be able to:

State the GELU formula and explain its probabilistic interpretation as an input-gated activation
Explain why SiLU/Swish is non-monotonic and describe the self-gating property x * sigmoid(x)
Compare ELU, CELU, and SELU in terms of negative saturation value and the conditions under which SELU induces self-normalisation
Select among GELU, SiLU, Mish, ELU, and SELU for a given architecture based on smoothness, monotonicity, and self-normalisation requirements

GELU — Gaussian Error Linear Unit

$f(x) = x \cdot \Phi(x)$

where $\Phi(x)$ is the standard normal CDF. The interpretation: GELU stochastically gates the input by the probability that the input is positive. For small $x$ , the gate closes (output ≈ 0); for large $x$ , the gate opens (output ≈ $x$ ).

Approximation: Computing $\Phi(x)$ exactly requires the error function. In practice, the tanh approximation is used:

$f(x) \approx 0.5x\left(1 + \tanh\!\left(\sqrt{\tfrac{2}{\pi}}\,(x + 0.044715\,x^3)\right)\right)$

This approximation is accurate to within $\sim 10^{-4}$ for all $x$ . PyTorch also supports approximate='none' for the exact form.

Where it's used: BERT, GPT-2, GPT-3, ViT — essentially every modern transformer uses GELU.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.GELU()(x))                        # tensor([-0.0455, -0.1587,  0.0000,  0.8413,  1.9545])
print(nn.GELU(approximate='tanh')(x))      # near-identical (tanh approx)
print(F.gelu(x, approximate='none'))       # exact via erf

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.gelu(x, approximate=False))  # exact
print(tf.keras.activations.gelu(x, approximate=True))   # tanh approx
# In a model: tf.keras.layers.Dense(64, activation='gelu')

SiLU / Swish — Sigmoid Linear Unit

$f(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}$

Swish is GELU's close relative: instead of gating by the normal CDF, it gates by the sigmoid. The output is the input scaled by a soft gate that ranges from 0 (at $x \to -\infty$ ) to 1 (at $x \to +\infty$ ).

Key property — non-monotonic: Unlike ReLU, SiLU has a local minimum near $x \approx -1.28$ where $f'(x) = 0$ . This non-monotonicity allows the function to represent a broader class of functions.

Derivative: $f'(x) = \sigma(x) + x\,\sigma(x)(1 - \sigma(x)) = f(x) + \sigma(x)(1 - f(x))$

Where it's used: EfficientNet, MobileNetV3, LLaMA, Mistral.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.SiLU()(x))   # tensor([-0.0955, -0.2689,  0.0000,  0.7311,  1.7616])
print(F.silu(x))      # identical
# Note: non-monotone; minimum near x ≈ -1.28

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.swish(x))   # [-0.0955 -0.2689  0.     0.7311  1.7616]
print(tf.nn.swish(x))                  # identical
# In a model: tf.keras.layers.Dense(64, activation='swish')

Mish — Self-Regularized Non-Monotone

$f(x) = x \cdot \tanh(\text{softplus}(x)) = x \cdot \tanh\!\left(\ln(1 + e^x)\right)$

Mish combines the smoothness of Tanh with the unboundedness of ReLU. Like SiLU, it is non-monotone and self-gating. The "self-regularizing" label comes from the observation that the softplus inside tanh acts like a smooth version of max(0,x) — Mish applies an adaptive soft gate.

Comparison to SiLU: Mish is smoother in the negative region (continuous second derivative) and slightly outperforms SiLU empirically in some computer vision benchmarks, at higher computational cost.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.Mish()(x))   # tensor([-0.3034, -0.3034,  0.0000,  0.8651,  1.9440])
print(F.mish(x))      # identical

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
# No built-in Mish in TensorFlow; implement from formula
mish = lambda x: x * tf.math.tanh(tf.math.softplus(x))
print(mish(x))   # [-0.3034 -0.3034  0.      0.8651  1.9440]

ELU — Exponential Linear Unit

$f(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$

ELU uses an exponential curve for negative inputs, making it smooth at $x=0$ when $\alpha=1$ : $f(0) = 0$ , $f'(0) = 1$ from both sides. The negative saturation is bounded below by $-\alpha$ , typically $-1$ .

Advantage over ReLU: ELU pushes mean activations toward zero (like batch normalization, but inherent), which speeds convergence. The exponential negative region provides a useful non-zero negative output.

Gradient for $x \leq 0$ : $f'(x) = \alpha e^x = f(x) + \alpha$ .

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.ELU(alpha=1.0)(x))   # tensor([-0.8647, -0.6321,  0.0000,  1.0000,  2.0000])
# Gradient for x<=0: alpha * exp(x) — never zero, but saturates at -alpha

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.elu(x))         # [-0.8647 -0.6321  0.      1.      2.    ]
print(tf.nn.elu(x))                        # identical
# In a model: tf.keras.layers.ELU(alpha=1.0)

CELU — Continuously Differentiable ELU

$f(x) = \max(0, x) + \min\!\left(0,\, \alpha\left(e^{x/\alpha} - 1\right)\right)$

CELU fixes a subtle issue with ELU: ELU's derivative at $x=0$ from the left is $\alpha$ , which equals 1 only when $\alpha=1$ . For other $\alpha$ values, ELU is not $C^1$ (continuously differentiable). CELU reparameterizes the negative region with $e^{x/\alpha}$ so that $f'(0^-) = 1$ for any $\alpha$ .

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.CELU(alpha=1.0)(x))     # tensor([-0.8647, -0.6321,  0.0000,  1.0000,  2.0000])
print(nn.CELU(alpha=0.5)(x))     # different scaling in negative region

TensorFlow:

# No built-in CELU in TensorFlow; implement from formula
def celu(x, alpha=1.0):
    return tf.where(x > 0, x, alpha * (tf.exp(x / alpha) - 1))

x = tf.constant([-2., -1., 0., 1., 2.])
print(celu(x))   # [-0.8647 -0.6321  0.      1.      2.    ]

SELU — Scaled Exponential Linear Unit

$f(x) = \lambda \cdot \text{ELU}(x, \alpha) = \lambda \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$

where $\lambda \approx 1.0507$ and $\alpha \approx 1.6733$ . These constants were derived analytically to produce a self-normalizing property: if the inputs to a SELU layer have mean 0 and variance 1, the outputs will also have mean ≈ 0 and variance ≈ 1 (a fixed-point argument).

This means SELU networks don't need BatchNorm — the activations self-regulate. Works best with nn.AlphaDropout and He/LeCun initialization.

PyTorch:

# Constants are fixed: lambda=1.0507, alpha=1.6733
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.SELU()(x))   # tensor([-1.5202, -1.1113,  0.0000,  1.0507,  2.1014])
# Must use AlphaDropout + LeCun init for self-normalisation to hold:
# nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='linear')

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.selu(x))   # [-1.5202 -1.1113  0.      1.0507  2.1014]
# In a model (with LeCun init):
# tf.keras.layers.Dense(64, activation='selu',
#     kernel_initializer='lecun_normal')

Comparison Table

Activation	Smooth	Monotone	Self-gating	Non-linear negative
GELU	Yes	Mostly	Yes (via CDF)	Yes
SiLU	Yes	No	Yes (via sigmoid)	Yes
Mish	Yes	No	Yes (via tanh·softplus)	Yes
ELU	Yes	Yes	No	Yes (exponential)
CELU	Yes ( $C^1$ )	Yes	No	Yes (exponential)
SELU	Yes	Yes	No	Yes (self-normalizing)

References

Hendrycks & Gimpel (2016) — GELU — Introduced GELU; later adopted in BERT and GPT

Ramachandran et al. (2017) — Swish (SiLU) — Found Swish via neural architecture search; showed it outperforms ReLU on deep models

Misra (2019) — Mish — Introduced Mish; demonstrated self-regularizing properties

Clevert et al. (2016) — ELU — Introduced ELU; showed faster convergence due to mean-pushing property

Klambauer et al. (2017) — SELU (Self-Normalizing Networks) — Derived the exact λ and α constants that produce the self-normalizing fixed point

Previous Take Quiz →

Smooth Modern Activations

GELU — Gaussian Error Linear Unit

SiLU / Swish — Sigmoid Linear Unit

Mish — Self-Regularized Non-Monotone

ELU — Exponential Linear Unit

CELU — Continuously Differentiable ELU

SELU — Scaled Exponential Linear Unit

Comparison Table

Privacy Policy

What we collect

What we don't collect

Your choices

Contact