Supplement · Activation Functions

Smooth Modern Activations

16 min read
By the end of this reading you will be able to:
  • State the GELU formula and explain its probabilistic interpretation as an input-gated activation
  • Explain why SiLU/Swish is non-monotonic and describe the self-gating property x * sigmoid(x)
  • Compare ELU, CELU, and SELU in terms of negative saturation value and the conditions under which SELU induces self-normalisation
  • Select among GELU, SiLU, Mish, ELU, and SELU for a given architecture based on smoothness, monotonicity, and self-normalisation requirements

GELU — Gaussian Error Linear Unit

f(x)=xΦ(x)f(x) = x \cdot \Phi(x)

where Φ(x)\Phi(x) is the standard normal CDF. The interpretation: GELU stochastically gates the input by the probability that the input is positive. For small xx, the gate closes (output ≈ 0); for large xx, the gate opens (output ≈ xx).

Approximation: Computing Φ(x)\Phi(x) exactly requires the error function. In practice, the tanh approximation is used:

f(x)0.5x(1+tanh ⁣(2π(x+0.044715x3)))f(x) \approx 0.5x\left(1 + \tanh\!\left(\sqrt{\tfrac{2}{\pi}}\,(x + 0.044715\,x^3)\right)\right)

This approximation is accurate to within 104\sim 10^{-4} for all xx. PyTorch also supports approximate='none' for the exact form.

Where it's used: BERT, GPT-2, GPT-3, ViT — essentially every modern transformer uses GELU.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.GELU()(x))                        # tensor([-0.0455, -0.1587,  0.0000,  0.8413,  1.9545])
print(nn.GELU(approximate='tanh')(x))      # near-identical (tanh approx)
print(F.gelu(x, approximate='none'))       # exact via erf

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.gelu(x, approximate=False))  # exact
print(tf.keras.activations.gelu(x, approximate=True))   # tanh approx
# In a model: tf.keras.layers.Dense(64, activation='gelu')

SiLU / Swish — Sigmoid Linear Unit

f(x)=xσ(x)=x1+exf(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

Swish is GELU's close relative: instead of gating by the normal CDF, it gates by the sigmoid. The output is the input scaled by a soft gate that ranges from 0 (at xx \to -\infty) to 1 (at x+x \to +\infty).

Key property — non-monotonic: Unlike ReLU, SiLU has a local minimum near x1.28x \approx -1.28 where f(x)=0f'(x) = 0. This non-monotonicity allows the function to represent a broader class of functions.

Derivative: f(x)=σ(x)+xσ(x)(1σ(x))=f(x)+σ(x)(1f(x))f'(x) = \sigma(x) + x\,\sigma(x)(1 - \sigma(x)) = f(x) + \sigma(x)(1 - f(x))

Where it's used: EfficientNet, MobileNetV3, LLaMA, Mistral.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.SiLU()(x))   # tensor([-0.0955, -0.2689,  0.0000,  0.7311,  1.7616])
print(F.silu(x))      # identical
# Note: non-monotone; minimum near x ≈ -1.28

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.swish(x))   # [-0.0955 -0.2689  0.     0.7311  1.7616]
print(tf.nn.swish(x))                  # identical
# In a model: tf.keras.layers.Dense(64, activation='swish')

Mish — Self-Regularized Non-Monotone

f(x)=xtanh(softplus(x))=xtanh ⁣(ln(1+ex))f(x) = x \cdot \tanh(\text{softplus}(x)) = x \cdot \tanh\!\left(\ln(1 + e^x)\right)

Mish combines the smoothness of Tanh with the unboundedness of ReLU. Like SiLU, it is non-monotone and self-gating. The "self-regularizing" label comes from the observation that the softplus inside tanh acts like a smooth version of max(0,x) — Mish applies an adaptive soft gate.

Comparison to SiLU: Mish is smoother in the negative region (continuous second derivative) and slightly outperforms SiLU empirically in some computer vision benchmarks, at higher computational cost.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.Mish()(x))   # tensor([-0.3034, -0.3034,  0.0000,  0.8651,  1.9440])
print(F.mish(x))      # identical

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
# No built-in Mish in TensorFlow; implement from formula
mish = lambda x: x * tf.math.tanh(tf.math.softplus(x))
print(mish(x))   # [-0.3034 -0.3034  0.      0.8651  1.9440]

ELU — Exponential Linear Unit

f(x)={xx>0α(ex1)x0f(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

ELU uses an exponential curve for negative inputs, making it smooth at x=0x=0 when α=1\alpha=1: f(0)=0f(0) = 0, f(0)=1f'(0) = 1 from both sides. The negative saturation is bounded below by α-\alpha, typically 1-1.

Advantage over ReLU: ELU pushes mean activations toward zero (like batch normalization, but inherent), which speeds convergence. The exponential negative region provides a useful non-zero negative output.

Gradient for x0x \leq 0: f(x)=αex=f(x)+αf'(x) = \alpha e^x = f(x) + \alpha.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.ELU(alpha=1.0)(x))   # tensor([-0.8647, -0.6321,  0.0000,  1.0000,  2.0000])
# Gradient for x<=0: alpha * exp(x) — never zero, but saturates at -alpha

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.elu(x))         # [-0.8647 -0.6321  0.      1.      2.    ]
print(tf.nn.elu(x))                        # identical
# In a model: tf.keras.layers.ELU(alpha=1.0)

CELU — Continuously Differentiable ELU

f(x)=max(0,x)+min ⁣(0,α(ex/α1))f(x) = \max(0, x) + \min\!\left(0,\, \alpha\left(e^{x/\alpha} - 1\right)\right)

CELU fixes a subtle issue with ELU: ELU's derivative at x=0x=0 from the left is α\alpha, which equals 1 only when α=1\alpha=1. For other α\alpha values, ELU is not C1C^1 (continuously differentiable). CELU reparameterizes the negative region with ex/αe^{x/\alpha} so that f(0)=1f'(0^-) = 1 for any α\alpha.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.CELU(alpha=1.0)(x))     # tensor([-0.8647, -0.6321,  0.0000,  1.0000,  2.0000])
print(nn.CELU(alpha=0.5)(x))     # different scaling in negative region

TensorFlow:

# No built-in CELU in TensorFlow; implement from formula
def celu(x, alpha=1.0):
    return tf.where(x > 0, x, alpha * (tf.exp(x / alpha) - 1))

x = tf.constant([-2., -1., 0., 1., 2.])
print(celu(x))   # [-0.8647 -0.6321  0.      1.      2.    ]

SELU — Scaled Exponential Linear Unit

f(x)=λELU(x,α)=λ{xx>0α(ex1)x0f(x) = \lambda \cdot \text{ELU}(x, \alpha) = \lambda \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

where λ1.0507\lambda \approx 1.0507 and α1.6733\alpha \approx 1.6733. These constants were derived analytically to produce a self-normalizing property: if the inputs to a SELU layer have mean 0 and variance 1, the outputs will also have mean ≈ 0 and variance ≈ 1 (a fixed-point argument).

This means SELU networks don't need BatchNorm — the activations self-regulate. Works best with nn.AlphaDropout and He/LeCun initialization.

PyTorch:

# Constants are fixed: lambda=1.0507, alpha=1.6733
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.SELU()(x))   # tensor([-1.5202, -1.1113,  0.0000,  1.0507,  2.1014])
# Must use AlphaDropout + LeCun init for self-normalisation to hold:
# nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='linear')

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.selu(x))   # [-1.5202 -1.1113  0.      1.0507  2.1014]
# In a model (with LeCun init):
# tf.keras.layers.Dense(64, activation='selu',
#     kernel_initializer='lecun_normal')

Comparison Table

Activation Smooth Monotone Self-gating Non-linear negative
GELU Yes Mostly Yes (via CDF) Yes
SiLU Yes No Yes (via sigmoid) Yes
Mish Yes No Yes (via tanh·softplus) Yes
ELU Yes Yes No Yes (exponential)
CELU Yes (C1C^1) Yes No Yes (exponential)
SELU Yes Yes No Yes (self-normalizing)
References
Hendrycks & Gimpel (2016) — GELU — Introduced GELU; later adopted in BERT and GPT
Ramachandran et al. (2017) — Swish (SiLU) — Found Swish via neural architecture search; showed it outperforms ReLU on deep models
Misra (2019) — Mish — Introduced Mish; demonstrated self-regularizing properties
Clevert et al. (2016) — ELU — Introduced ELU; showed faster convergence due to mean-pushing property
Klambauer et al. (2017) — SELU (Self-Normalizing Networks) — Derived the exact λ and α constants that produce the self-normalizing fixed point