Gating & Normalization Activations
- Explain how GLU uses a sigmoid gate to control information flow and state which dimension is split
- Apply Softmax to a logit vector and verify that the outputs are non-negative and sum to one
- Distinguish Softmax from LogSoftmax in terms of output and explain why LogSoftmax is numerically preferred when combined with NLLLoss
- Identify Hardswish as a piecewise-linear approximation of Swish and state its advantage for mobile deployment
GLU — Gated Linear Unit
GLU splits an input transformation into two halves: a linear component and a gate . The gate, a sigmoid applied to a learned linear projection, controls how much of the linear component flows through.
In practice, a single weight matrix projects the input to twice the hidden size, then the output is split in half along the last dimension:
x_proj = self.linear(x) # shape: (B, 2d)
gate, signal = x_proj.chunk(2, dim=-1)
return signal * torch.sigmoid(gate) # shape: (B, d)
GLU halves the output dimension — the network must project to first. This is why GLU layers have double the parameters of simple linear layers.
Where it's used: Language models (Dauphin et al., 2017); later variants like GeGLU (GELU gate) and SwiGLU (SiLU gate) are used in PaLM, LLaMA, Mistral.
PyTorch:
# nn.GLU splits the last dim in half: output is first_half * sigmoid(second_half)
x = torch.randn(2, 8) # input: (batch=2, features=8)
glu = nn.GLU(dim=-1)
print(glu(x).shape) # torch.Size([2, 4])
# Manual equivalent (used when building custom GLU variants):
a, b = x.chunk(2, dim=-1)
out = a * torch.sigmoid(b) # shape: (2, 4)
TensorFlow:
# No built-in GLU; implement via split + sigmoid
import tensorflow as tf
x = tf.random.normal((2, 8))
a, b = tf.split(x, 2, axis=-1) # each: (2, 4)
out = a * tf.nn.sigmoid(b) # GLU output: (2, 4)
Hardswish — Mobile-Optimized Swish
Hardswish is a piecewise linear approximation of SiLU/Swish designed for mobile inference:
The formula reuses the ReLU6 operation (which hardware accelerators handle natively) to approximate the sigmoid gate in the active region. This avoids expensive exponential/sigmoid computation.
Where it's used: MobileNetV3, EfficientNet-Lite.
PyTorch:
x = torch.tensor([-4., -3., 0., 3., 4.])
print(nn.Hardswish()(x)) # tensor([0.0000, 0.0000, 0.0000, 3.0000, 4.0000])
print(F.hardswish(x)) # identical
TensorFlow:
x = tf.constant([-4., -3., 0., 3., 4.])
# TF 2.x: tf.keras.activations.hard_swish (available from Keras 3 / TF 2.13+)
# Manual implementation:
hard_swish = lambda x: x * tf.nn.relu6(x + 3.0) / 6.0
print(hard_swish(x)) # [0. 0. 0. 3. 4.]
Softmax
Softmax converts a vector of real numbers (logits) into a probability distribution: all outputs are in and they sum to 1. It is the canonical output activation for multi-class classification.
Numerical stability: For large logits, overflows float32. The log-sum-exp trick subtracts the maximum:
This is numerically identical but safe for any input magnitude. PyTorch applies this internally.
Temperature scaling: . As , output approaches a one-hot (argmax); as , output approaches a uniform distribution.
Important: Do not apply Softmax before nn.CrossEntropyLoss — CrossEntropyLoss applies LogSoftmax internally for numerical stability.
PyTorch:
logits = torch.tensor([2.0, 1.0, 0.1])
probs = nn.Softmax(dim=-1)(logits)
print(probs) # tensor([0.6590, 0.2424, 0.0986])
print(probs.sum()) # tensor(1.)
TensorFlow:
logits = tf.constant([2.0, 1.0, 0.1])
probs = tf.nn.softmax(logits, axis=-1)
print(probs) # [0.659 0.2424 0.0986]
print(tf.reduce_sum(probs)) # 1.0
# In a model output: tf.keras.layers.Dense(K, activation='softmax')
LogSoftmax
LogSoftmax computes the log of Softmax in a numerically stable single pass, outputting log-probabilities in . It is designed to pair with nn.NLLLoss: .
PyTorch:
logits = torch.tensor([2.0, 1.0, 0.1])
log_probs = nn.LogSoftmax(dim=-1)(logits)
print(log_probs) # tensor([-0.4170, -1.4170, -2.3170])
# With NLLLoss: loss = nn.NLLLoss()(log_probs.unsqueeze(0), torch.tensor([0]))
TensorFlow:
logits = tf.constant([2.0, 1.0, 0.1])
log_probs = tf.nn.log_softmax(logits, axis=-1)
print(log_probs) # [-0.4170 -1.4170 -2.3170]
# Combined loss: tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
Softmax2d
Applies Softmax over the channel dimension at each spatial location of a 4D tensor of shape . Equivalent to F.softmax(x, dim=1). At each pixel , the channel values are normalized to a probability distribution:
Used in semantic segmentation where each pixel needs a class probability distribution.
PyTorch:
# Normalizes across the channel dimension at each spatial location
x = torch.randn(1, 3, 4, 4) # (N, C, H, W)
out = nn.Softmax2d()(x)
print(out.shape) # torch.Size([1, 3, 4, 4])
print(out[0, :, 0, 0].sum()) # tensor(1.) — channels sum to 1 at each pixel
TensorFlow:
x = tf.random.normal((1, 4, 4, 3)) # TF uses (N, H, W, C)
out = tf.nn.softmax(x, axis=-1) # normalize over channel axis
print(tf.reduce_sum(out[0, 0, 0])) # 1.0
Softmin
Softmin is identical to Softmax applied to the negated input: Softmin(x) = Softmax(-x). It assigns higher probabilities to smaller values — useful when you want to emphasize the closest (minimum distance) element, e.g., nearest-neighbor attention.
PyTorch:
logits = torch.tensor([2.0, 1.0, 0.1])
print(nn.Softmin(dim=-1)(logits)) # tensor([0.0986, 0.2424, 0.6590])
# Higher weight to smaller values (inverted from Softmax)
TensorFlow:
logits = tf.constant([2.0, 1.0, 0.1])
# Softmin(x) = Softmax(-x)
print(tf.nn.softmax(-logits, axis=-1)) # [0.0986 0.2424 0.659 ]
Comparison Table
| Function | Output | Use case |
|---|---|---|
| GLU | Transformer FFN gating | |
| Hardswish | Mobile backbone hidden layers | |
| Softmax | , sums to 1 | Multi-class output layer |
| LogSoftmax | + NLLLoss for classification | |
| Softmax2d | per pixel | Semantic segmentation output |
| Softmin | , sums to 1 | Distance-based attention |