Smooth Modern Activations
- State the GELU formula and explain its probabilistic interpretation as an input-gated activation
- Explain why SiLU/Swish is non-monotonic and describe the self-gating property x * sigmoid(x)
- Compare ELU, CELU, and SELU in terms of negative saturation value and the conditions under which SELU induces self-normalisation
- Select among GELU, SiLU, Mish, ELU, and SELU for a given architecture based on smoothness, monotonicity, and self-normalisation requirements
GELU — Gaussian Error Linear Unit
where is the standard normal CDF. The interpretation: GELU stochastically gates the input by the probability that the input is positive. For small , the gate closes (output ≈ 0); for large , the gate opens (output ≈ ).
Approximation: Computing exactly requires the error function. In practice, the tanh approximation is used:
This approximation is accurate to within for all . PyTorch also supports approximate='none' for the exact form.
Where it's used: BERT, GPT-2, GPT-3, ViT — essentially every modern transformer uses GELU.
PyTorch:
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.GELU()(x)) # tensor([-0.0455, -0.1587, 0.0000, 0.8413, 1.9545])
print(nn.GELU(approximate='tanh')(x)) # near-identical (tanh approx)
print(F.gelu(x, approximate='none')) # exact via erf
TensorFlow:
x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.gelu(x, approximate=False)) # exact
print(tf.keras.activations.gelu(x, approximate=True)) # tanh approx
# In a model: tf.keras.layers.Dense(64, activation='gelu')
SiLU / Swish — Sigmoid Linear Unit
Swish is GELU's close relative: instead of gating by the normal CDF, it gates by the sigmoid. The output is the input scaled by a soft gate that ranges from 0 (at ) to 1 (at ).
Key property — non-monotonic: Unlike ReLU, SiLU has a local minimum near where . This non-monotonicity allows the function to represent a broader class of functions.
Derivative:
Where it's used: EfficientNet, MobileNetV3, LLaMA, Mistral.
PyTorch:
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.SiLU()(x)) # tensor([-0.0955, -0.2689, 0.0000, 0.7311, 1.7616])
print(F.silu(x)) # identical
# Note: non-monotone; minimum near x ≈ -1.28
TensorFlow:
x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.swish(x)) # [-0.0955 -0.2689 0. 0.7311 1.7616]
print(tf.nn.swish(x)) # identical
# In a model: tf.keras.layers.Dense(64, activation='swish')
Mish — Self-Regularized Non-Monotone
Mish combines the smoothness of Tanh with the unboundedness of ReLU. Like SiLU, it is non-monotone and self-gating. The "self-regularizing" label comes from the observation that the softplus inside tanh acts like a smooth version of max(0,x) — Mish applies an adaptive soft gate.
Comparison to SiLU: Mish is smoother in the negative region (continuous second derivative) and slightly outperforms SiLU empirically in some computer vision benchmarks, at higher computational cost.
PyTorch:
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.Mish()(x)) # tensor([-0.3034, -0.3034, 0.0000, 0.8651, 1.9440])
print(F.mish(x)) # identical
TensorFlow:
x = tf.constant([-2., -1., 0., 1., 2.])
# No built-in Mish in TensorFlow; implement from formula
mish = lambda x: x * tf.math.tanh(tf.math.softplus(x))
print(mish(x)) # [-0.3034 -0.3034 0. 0.8651 1.9440]
ELU — Exponential Linear Unit
ELU uses an exponential curve for negative inputs, making it smooth at when : , from both sides. The negative saturation is bounded below by , typically .
Advantage over ReLU: ELU pushes mean activations toward zero (like batch normalization, but inherent), which speeds convergence. The exponential negative region provides a useful non-zero negative output.
Gradient for : .
PyTorch:
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.ELU(alpha=1.0)(x)) # tensor([-0.8647, -0.6321, 0.0000, 1.0000, 2.0000])
# Gradient for x<=0: alpha * exp(x) — never zero, but saturates at -alpha
TensorFlow:
x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.elu(x)) # [-0.8647 -0.6321 0. 1. 2. ]
print(tf.nn.elu(x)) # identical
# In a model: tf.keras.layers.ELU(alpha=1.0)
CELU — Continuously Differentiable ELU
CELU fixes a subtle issue with ELU: ELU's derivative at from the left is , which equals 1 only when . For other values, ELU is not (continuously differentiable). CELU reparameterizes the negative region with so that for any .
PyTorch:
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.CELU(alpha=1.0)(x)) # tensor([-0.8647, -0.6321, 0.0000, 1.0000, 2.0000])
print(nn.CELU(alpha=0.5)(x)) # different scaling in negative region
TensorFlow:
# No built-in CELU in TensorFlow; implement from formula
def celu(x, alpha=1.0):
return tf.where(x > 0, x, alpha * (tf.exp(x / alpha) - 1))
x = tf.constant([-2., -1., 0., 1., 2.])
print(celu(x)) # [-0.8647 -0.6321 0. 1. 2. ]
SELU — Scaled Exponential Linear Unit
where and . These constants were derived analytically to produce a self-normalizing property: if the inputs to a SELU layer have mean 0 and variance 1, the outputs will also have mean ≈ 0 and variance ≈ 1 (a fixed-point argument).
This means SELU networks don't need BatchNorm — the activations self-regulate. Works best with nn.AlphaDropout and He/LeCun initialization.
PyTorch:
# Constants are fixed: lambda=1.0507, alpha=1.6733
x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.SELU()(x)) # tensor([-1.5202, -1.1113, 0.0000, 1.0507, 2.1014])
# Must use AlphaDropout + LeCun init for self-normalisation to hold:
# nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='linear')
TensorFlow:
x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.selu(x)) # [-1.5202 -1.1113 0. 1.0507 2.1014]
# In a model (with LeCun init):
# tf.keras.layers.Dense(64, activation='selu',
# kernel_initializer='lecun_normal')
Comparison Table
| Activation | Smooth | Monotone | Self-gating | Non-linear negative |
|---|---|---|---|---|
| GELU | Yes | Mostly | Yes (via CDF) | Yes |
| SiLU | Yes | No | Yes (via sigmoid) | Yes |
| Mish | Yes | No | Yes (via tanh·softplus) | Yes |
| ELU | Yes | Yes | No | Yes (exponential) |
| CELU | Yes () | Yes | No | Yes (exponential) |
| SELU | Yes | Yes | No | Yes (self-normalizing) |