Supplement · Activation Functions

Gating & Normalization Activations

15 min read
By the end of this reading you will be able to:
  • Explain how GLU uses a sigmoid gate to control information flow and state which dimension is split
  • Apply Softmax to a logit vector and verify that the outputs are non-negative and sum to one
  • Distinguish Softmax from LogSoftmax in terms of output and explain why LogSoftmax is numerically preferred when combined with NLLLoss
  • Identify Hardswish as a piecewise-linear approximation of Swish and state its advantage for mobile deployment

GLU — Gated Linear Unit

GLU(X,W,V,b,c)=(XW+b)σ(XV+c)\text{GLU}(X, W, V, b, c) = (XW + b) \otimes \sigma(XV + c)

GLU splits an input transformation into two halves: a linear component (XW+b)(XW + b) and a gate σ(XV+c)\sigma(XV + c). The gate, a sigmoid applied to a learned linear projection, controls how much of the linear component flows through.

In practice, a single weight matrix WRd×2dW \in \mathbb{R}^{d \times 2d} projects the input to twice the hidden size, then the output is split in half along the last dimension:

x_proj = self.linear(x)            # shape: (B, 2d)
gate, signal = x_proj.chunk(2, dim=-1)
return signal * torch.sigmoid(gate)  # shape: (B, d)

GLU halves the output dimension — the network must project to 2d2d first. This is why GLU layers have double the parameters of simple linear layers.

Where it's used: Language models (Dauphin et al., 2017); later variants like GeGLU (GELU gate) and SwiGLU (SiLU gate) are used in PaLM, LLaMA, Mistral.

PyTorch:

# nn.GLU splits the last dim in half: output is first_half * sigmoid(second_half)
x = torch.randn(2, 8)         # input: (batch=2, features=8)
glu = nn.GLU(dim=-1)
print(glu(x).shape)           # torch.Size([2, 4])

# Manual equivalent (used when building custom GLU variants):
a, b = x.chunk(2, dim=-1)
out = a * torch.sigmoid(b)    # shape: (2, 4)

TensorFlow:

# No built-in GLU; implement via split + sigmoid
import tensorflow as tf

x = tf.random.normal((2, 8))
a, b = tf.split(x, 2, axis=-1)   # each: (2, 4)
out = a * tf.nn.sigmoid(b)        # GLU output: (2, 4)

Hardswish — Mobile-Optimized Swish

f(x)=xReLU6(x+3)6f(x) = x \cdot \frac{\text{ReLU6}(x + 3)}{6}

Hardswish is a piecewise linear approximation of SiLU/Swish designed for mobile inference:

f(x)={0x3x(x+3)/63<x<3xx3f(x) = \begin{cases} 0 & x \leq -3 \\ x(x+3)/6 & -3 < x < 3 \\ x & x \geq 3 \end{cases}

The formula xReLU6(x+3)/6x \cdot \text{ReLU6}(x+3)/6 reuses the ReLU6 operation (which hardware accelerators handle natively) to approximate the sigmoid gate σ(x)(x+3)/6\sigma(x) \approx (x+3)/6 in the active region. This avoids expensive exponential/sigmoid computation.

Where it's used: MobileNetV3, EfficientNet-Lite.

PyTorch:

x = torch.tensor([-4., -3., 0., 3., 4.])
print(nn.Hardswish()(x))   # tensor([0.0000, 0.0000, 0.0000, 3.0000, 4.0000])
print(F.hardswish(x))      # identical

TensorFlow:

x = tf.constant([-4., -3., 0., 3., 4.])
# TF 2.x: tf.keras.activations.hard_swish (available from Keras 3 / TF 2.13+)
# Manual implementation:
hard_swish = lambda x: x * tf.nn.relu6(x + 3.0) / 6.0
print(hard_swish(x))   # [0. 0. 0. 3. 4.]

Softmax

f(xi)=exij=1Kexjf(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{K} e^{x_j}}

Softmax converts a vector of KK real numbers (logits) into a probability distribution: all outputs are in (0,1)(0,1) and they sum to 1. It is the canonical output activation for multi-class classification.

Numerical stability: For large logits, exie^{x_i} overflows float32. The log-sum-exp trick subtracts the maximum:

f(xi)=eximaxjxjjexjmaxjxjf(x_i) = \frac{e^{x_i - \max_j x_j}}{\sum_{j} e^{x_j - \max_j x_j}}

This is numerically identical but safe for any input magnitude. PyTorch applies this internally.

Temperature scaling: f(xi,T)=exi/T/jexj/Tf(x_i, T) = e^{x_i/T} / \sum_j e^{x_j/T}. As T0T \to 0, output approaches a one-hot (argmax); as TT \to \infty, output approaches a uniform distribution.

Important: Do not apply Softmax before nn.CrossEntropyLoss — CrossEntropyLoss applies LogSoftmax internally for numerical stability.

PyTorch:

logits = torch.tensor([2.0, 1.0, 0.1])
probs = nn.Softmax(dim=-1)(logits)
print(probs)             # tensor([0.6590, 0.2424, 0.0986])
print(probs.sum())       # tensor(1.)

TensorFlow:

logits = tf.constant([2.0, 1.0, 0.1])
probs = tf.nn.softmax(logits, axis=-1)
print(probs)             # [0.659  0.2424 0.0986]
print(tf.reduce_sum(probs))  # 1.0
# In a model output: tf.keras.layers.Dense(K, activation='softmax')

LogSoftmax

f(xi)=xilog ⁣(j=1Kexj)=logexijexjf(x_i) = x_i - \log\!\left(\sum_{j=1}^{K} e^{x_j}\right) = \log\frac{e^{x_i}}{\sum_j e^{x_j}}

LogSoftmax computes the log of Softmax in a numerically stable single pass, outputting log-probabilities in (,0](-\infty, 0]. It is designed to pair with nn.NLLLoss: CrossEntropyLoss(x,y)=NLLLoss(LogSoftmax(x),y)\text{CrossEntropyLoss}(x, y) = \text{NLLLoss}(\text{LogSoftmax}(x), y).

PyTorch:

logits = torch.tensor([2.0, 1.0, 0.1])
log_probs = nn.LogSoftmax(dim=-1)(logits)
print(log_probs)         # tensor([-0.4170, -1.4170, -2.3170])
# With NLLLoss: loss = nn.NLLLoss()(log_probs.unsqueeze(0), torch.tensor([0]))

TensorFlow:

logits = tf.constant([2.0, 1.0, 0.1])
log_probs = tf.nn.log_softmax(logits, axis=-1)
print(log_probs)         # [-0.4170 -1.4170 -2.3170]
# Combined loss: tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

Softmax2d

Applies Softmax over the channel dimension at each spatial location of a 4D tensor of shape (N,C,H,W)(N, C, H, W). Equivalent to F.softmax(x, dim=1). At each pixel (h,w)(h, w), the CC channel values are normalized to a probability distribution:

Softmax2d(x)n,c,h,w=exn,c,h,wcexn,c,h,w\text{Softmax2d}(x)_{n,c,h,w} = \frac{e^{x_{n,c,h,w}}}{\sum_{c'} e^{x_{n,c',h,w}}}

Used in semantic segmentation where each pixel needs a class probability distribution.

PyTorch:

# Normalizes across the channel dimension at each spatial location
x = torch.randn(1, 3, 4, 4)   # (N, C, H, W)
out = nn.Softmax2d()(x)
print(out.shape)               # torch.Size([1, 3, 4, 4])
print(out[0, :, 0, 0].sum())   # tensor(1.) — channels sum to 1 at each pixel

TensorFlow:

x = tf.random.normal((1, 4, 4, 3))  # TF uses (N, H, W, C)
out = tf.nn.softmax(x, axis=-1)     # normalize over channel axis
print(tf.reduce_sum(out[0, 0, 0]))  # 1.0

Softmin

f(xi)=exij=1Kexjf(x_i) = \frac{e^{-x_i}}{\sum_{j=1}^{K} e^{-x_j}}

Softmin is identical to Softmax applied to the negated input: Softmin(x) = Softmax(-x). It assigns higher probabilities to smaller values — useful when you want to emphasize the closest (minimum distance) element, e.g., nearest-neighbor attention.

PyTorch:

logits = torch.tensor([2.0, 1.0, 0.1])
print(nn.Softmin(dim=-1)(logits))   # tensor([0.0986, 0.2424, 0.6590])
# Higher weight to smaller values (inverted from Softmax)

TensorFlow:

logits = tf.constant([2.0, 1.0, 0.1])
# Softmin(x) = Softmax(-x)
print(tf.nn.softmax(-logits, axis=-1))   # [0.0986 0.2424 0.659 ]

Comparison Table

Function Output Use case
GLU Rd/2\mathbb{R}^{d/2} Transformer FFN gating
Hardswish (,)(-\infty, \infty) Mobile backbone hidden layers
Softmax [0,1]K[0,1]^K, sums to 1 Multi-class output layer
LogSoftmax (,0]K(-\infty, 0]^K + NLLLoss for classification
Softmax2d [0,1][0,1] per pixel Semantic segmentation output
Softmin [0,1]K[0,1]^K, sums to 1 Distance-based attention
References
Dauphin et al. (2017) — Language Modeling with Gated Convolutional Networks (GLU) — Introduced GLU for language modeling; showed gating mechanism outperforms ReLU in text tasks
Howard et al. (2019) — Searching for MobileNetV3 (Hardswish) — Introduced Hardswish as a mobile-efficient approximation of Swish
Noam Shazeer (2020) — GLU Variants Improve Transformer (GeGLU, SwiGLU) — Demonstrated that replacing the FFN activation with GLU variants (GeGLU, SwiGLU) consistently improves language model perplexity