Supplement · Activation Functions

Gating & Normalization Activations

15 min read

By the end of this reading you will be able to:

Explain how GLU uses a sigmoid gate to control information flow and state which dimension is split
Apply Softmax to a logit vector and verify that the outputs are non-negative and sum to one
Distinguish Softmax from LogSoftmax in terms of output and explain why LogSoftmax is numerically preferred when combined with NLLLoss
Identify Hardswish as a piecewise-linear approximation of Swish and state its advantage for mobile deployment

GLU — Gated Linear Unit

$\text{GLU}(X, W, V, b, c) = (XW + b) \otimes \sigma(XV + c)$

GLU splits an input transformation into two halves: a linear component $(XW + b)$ and a gate $\sigma(XV + c)$ . The gate, a sigmoid applied to a learned linear projection, controls how much of the linear component flows through.

In practice, a single weight matrix $W \in \mathbb{R}^{d \times 2d}$ projects the input to twice the hidden size, then the output is split in half along the last dimension:

x_proj = self.linear(x)            # shape: (B, 2d)
gate, signal = x_proj.chunk(2, dim=-1)
return signal * torch.sigmoid(gate)  # shape: (B, d)

GLU halves the output dimension — the network must project to $2d$ first. This is why GLU layers have double the parameters of simple linear layers.

Where it's used: Language models (Dauphin et al., 2017); later variants like GeGLU (GELU gate) and SwiGLU (SiLU gate) are used in PaLM, LLaMA, Mistral.

PyTorch:

# nn.GLU splits the last dim in half: output is first_half * sigmoid(second_half)
x = torch.randn(2, 8)         # input: (batch=2, features=8)
glu = nn.GLU(dim=-1)
print(glu(x).shape)           # torch.Size([2, 4])

# Manual equivalent (used when building custom GLU variants):
a, b = x.chunk(2, dim=-1)
out = a * torch.sigmoid(b)    # shape: (2, 4)

TensorFlow:

# No built-in GLU; implement via split + sigmoid
import tensorflow as tf

x = tf.random.normal((2, 8))
a, b = tf.split(x, 2, axis=-1)   # each: (2, 4)
out = a * tf.nn.sigmoid(b)        # GLU output: (2, 4)

Hardswish — Mobile-Optimized Swish

$f(x) = x \cdot \frac{\text{ReLU6}(x + 3)}{6}$

Hardswish is a piecewise linear approximation of SiLU/Swish designed for mobile inference:

$f(x) = \begin{cases} 0 & x \leq -3 \\ x(x+3)/6 & -3 < x < 3 \\ x & x \geq 3 \end{cases}$

The formula $x \cdot \text{ReLU6}(x+3)/6$ reuses the ReLU6 operation (which hardware accelerators handle natively) to approximate the sigmoid gate $\sigma(x) \approx (x+3)/6$ in the active region. This avoids expensive exponential/sigmoid computation.

Where it's used: MobileNetV3, EfficientNet-Lite.

PyTorch:

x = torch.tensor([-4., -3., 0., 3., 4.])
print(nn.Hardswish()(x))   # tensor([0.0000, 0.0000, 0.0000, 3.0000, 4.0000])
print(F.hardswish(x))      # identical

TensorFlow:

x = tf.constant([-4., -3., 0., 3., 4.])
# TF 2.x: tf.keras.activations.hard_swish (available from Keras 3 / TF 2.13+)
# Manual implementation:
hard_swish = lambda x: x * tf.nn.relu6(x + 3.0) / 6.0
print(hard_swish(x))   # [0. 0. 0. 3. 4.]

Softmax

$f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{K} e^{x_j}}$

Softmax converts a vector of $K$ real numbers (logits) into a probability distribution: all outputs are in $(0,1)$ and they sum to 1. It is the canonical output activation for multi-class classification.

Numerical stability: For large logits, $e^{x_i}$ overflows float32. The log-sum-exp trick subtracts the maximum:

$f(x_i) = \frac{e^{x_i - \max_j x_j}}{\sum_{j} e^{x_j - \max_j x_j}}$

This is numerically identical but safe for any input magnitude. PyTorch applies this internally.

Temperature scaling: $f(x_i, T) = e^{x_i/T} / \sum_j e^{x_j/T}$ . As $T \to 0$ , output approaches a one-hot (argmax); as $T \to \infty$ , output approaches a uniform distribution.

Important: Do not apply Softmax before nn.CrossEntropyLoss — CrossEntropyLoss applies LogSoftmax internally for numerical stability.

PyTorch:

logits = torch.tensor([2.0, 1.0, 0.1])
probs = nn.Softmax(dim=-1)(logits)
print(probs)             # tensor([0.6590, 0.2424, 0.0986])
print(probs.sum())       # tensor(1.)

TensorFlow:

logits = tf.constant([2.0, 1.0, 0.1])
probs = tf.nn.softmax(logits, axis=-1)
print(probs)             # [0.659  0.2424 0.0986]
print(tf.reduce_sum(probs))  # 1.0
# In a model output: tf.keras.layers.Dense(K, activation='softmax')

LogSoftmax

$f(x_i) = x_i - \log\!\left(\sum_{j=1}^{K} e^{x_j}\right) = \log\frac{e^{x_i}}{\sum_j e^{x_j}}$

LogSoftmax computes the log of Softmax in a numerically stable single pass, outputting log-probabilities in $(-\infty, 0]$ . It is designed to pair with nn.NLLLoss: $\text{CrossEntropyLoss}(x, y) = \text{NLLLoss}(\text{LogSoftmax}(x), y)$ .

PyTorch:

logits = torch.tensor([2.0, 1.0, 0.1])
log_probs = nn.LogSoftmax(dim=-1)(logits)
print(log_probs)         # tensor([-0.4170, -1.4170, -2.3170])
# With NLLLoss: loss = nn.NLLLoss()(log_probs.unsqueeze(0), torch.tensor([0]))

TensorFlow:

logits = tf.constant([2.0, 1.0, 0.1])
log_probs = tf.nn.log_softmax(logits, axis=-1)
print(log_probs)         # [-0.4170 -1.4170 -2.3170]
# Combined loss: tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

Softmax2d

Applies Softmax over the channel dimension at each spatial location of a 4D tensor of shape $(N, C, H, W)$ . Equivalent to F.softmax(x, dim=1). At each pixel $(h, w)$ , the $C$ channel values are normalized to a probability distribution:

$\text{Softmax2d}(x)_{n,c,h,w} = \frac{e^{x_{n,c,h,w}}}{\sum_{c'} e^{x_{n,c',h,w}}}$

Used in semantic segmentation where each pixel needs a class probability distribution.

PyTorch:

# Normalizes across the channel dimension at each spatial location
x = torch.randn(1, 3, 4, 4)   # (N, C, H, W)
out = nn.Softmax2d()(x)
print(out.shape)               # torch.Size([1, 3, 4, 4])
print(out[0, :, 0, 0].sum())   # tensor(1.) — channels sum to 1 at each pixel

TensorFlow:

x = tf.random.normal((1, 4, 4, 3))  # TF uses (N, H, W, C)
out = tf.nn.softmax(x, axis=-1)     # normalize over channel axis
print(tf.reduce_sum(out[0, 0, 0]))  # 1.0

Softmin

$f(x_i) = \frac{e^{-x_i}}{\sum_{j=1}^{K} e^{-x_j}}$

Softmin is identical to Softmax applied to the negated input: Softmin(x) = Softmax(-x). It assigns higher probabilities to smaller values — useful when you want to emphasize the closest (minimum distance) element, e.g., nearest-neighbor attention.

PyTorch:

logits = torch.tensor([2.0, 1.0, 0.1])
print(nn.Softmin(dim=-1)(logits))   # tensor([0.0986, 0.2424, 0.6590])
# Higher weight to smaller values (inverted from Softmax)

TensorFlow:

logits = tf.constant([2.0, 1.0, 0.1])
# Softmin(x) = Softmax(-x)
print(tf.nn.softmax(-logits, axis=-1))   # [0.0986 0.2424 0.659 ]

Comparison Table

Function	Output	Use case
GLU	$\mathbb{R}^{d/2}$	Transformer FFN gating
Hardswish	$(-\infty, \infty)$	Mobile backbone hidden layers
Softmax	$[0,1]^K$ , sums to 1	Multi-class output layer
LogSoftmax	$(-\infty, 0]^K$	+ NLLLoss for classification
Softmax2d	$[0,1]$ per pixel	Semantic segmentation output
Softmin	$[0,1]^K$ , sums to 1	Distance-based attention

References

Dauphin et al. (2017) — Language Modeling with Gated Convolutional Networks (GLU) — Introduced GLU for language modeling; showed gating mechanism outperforms ReLU in text tasks

Howard et al. (2019) — Searching for MobileNetV3 (Hardswish) — Introduced Hardswish as a mobile-efficient approximation of Swish

Noam Shazeer (2020) — GLU Variants Improve Transformer (GeGLU, SwiGLU) — Demonstrated that replacing the FFN activation with GLU variants (GeGLU, SwiGLU) consistently improves language model perplexity

Previous Next →

Gating & Normalization Activations

GLU — Gated Linear Unit

Hardswish — Mobile-Optimized Swish

Softmax

LogSoftmax

Softmax2d

Softmin

Comparison Table

Privacy Policy

What we collect

What we don't collect

Your choices

Contact