Supplement · Activation Functions

NLP & Advanced Activations

14 min read

By the end of this reading you will be able to:

Explain adaptive softmax cluster assignment and state why it reduces computation for large-vocabulary language modelling
Trace the multi-head attention forward pass from Q, K, V inputs to the output projection and identify where softmax and scaling appear
State the SwiGLU formula and explain why it has become the default FFN activation in modern transformer architectures
Compare the FFN activation choices across GPT-2, LLaMA, and PaLM and relate each to a specific activation function family

LogSigmoid

$f(x) = \log\sigma(x) = \log\frac{1}{1+e^{-x}} = -\log(1 + e^{-x})$

LogSigmoid outputs $(-\infty, 0]$ — the log of a probability. It pairs directly with nn.NLLLoss for binary classification, in the same way that nn.LogSoftmax pairs with nn.NLLLoss for multi-class.

Numerically stable implementation: PyTorch uses:

$f(x) = \begin{cases} -\log(1 + e^{-x}) & x \geq 0 \\ x - \log(1 + e^x) & x < 0 \end{cases}$

The two-branch form avoids overflow: for large positive $x$ , computing $e^{-x}$ is safe (it's tiny); for large negative $x$ , computing $e^{x}$ is safe (also tiny).

Relationship to BCEWithLogitsLoss: BCEWithLogitsLoss(logits, labels) ≡ $-[y \cdot f(x) + (1-y) \cdot f(-x)]$ where $f$ is LogSigmoid. The loss function and the activation are two sides of the same coin.

PyTorch:

x = torch.tensor([-3., -1., 0., 1., 3.])
print(nn.LogSigmoid()(x))   # tensor([-3.0486, -1.3133, -0.6931, -0.3133, -0.0486])
# Use with nn.NLLLoss for binary classification (like BCEWithLogitsLoss)

TensorFlow:

x = tf.constant([-3., -1., 0., 1., 3.])
print(tf.math.log_sigmoid(x))   # [-3.0486 -1.3133 -0.6931 -0.3133 -0.0486]
# Equivalent: -tf.math.softplus(-x)

AdaptiveLogSoftmaxWithLoss — Large Vocabulary NLP

For standard Softmax, computing the partition function $Z = \sum_{j=1}^{V} e^{x_j}$ over a vocabulary of size $V$ (often $V = 50{,}000$ or more) costs $O(V)$ at every forward pass. For a batch of 512 tokens, this dominates the compute.

Adaptive Softmax (Grave et al., 2017) reduces this to approximately $O(\sqrt{V})$ using hierarchical clustering. Words are split into frequency clusters:

Head cluster: The most frequent $c_0$ words (e.g., top 2,000) get computed with a full softmax — these are the words that appear in nearly every batch.
Tail clusters: Rarer words are grouped into clusters with progressively smaller projection dimensions (controlled by div_value). A cluster head token is added to the head vocabulary; if the head selects a cluster head, a second smaller softmax is applied within that cluster.

$P(w) = \begin{cases} P(\text{head} = w) & w \in \text{head cluster} \\ P(\text{head} = c_k) \cdot P(w \mid c_k) & w \in \text{tail cluster } k \end{cases}$

The probability is log-additive: $\log P(w) = \log P(c_k) + \log P(w \mid c_k)$ .

PyTorch API:

adaptive_sm = nn.AdaptiveLogSoftmaxWithLoss(
    in_features=512,      # input embedding size
    n_classes=50000,      # vocabulary size
    cutoffs=[2000, 10000], # cluster boundaries
    div_value=4.0          # dimension reduction factor per cluster
)
output, loss = adaptive_sm(embeddings, targets)  # combined forward+loss
log_probs = adaptive_sm.log_prob(embeddings)     # inference

PyTorch:

adaptive_sm = nn.AdaptiveLogSoftmaxWithLoss(
    in_features=512,
    n_classes=50000,
    cutoffs=[2000, 10000],
    div_value=4.0
)
output, loss = adaptive_sm(embeddings, targets)   # combined forward + loss
log_probs = adaptive_sm.log_prob(embeddings)      # inference: log P(w)

TensorFlow:

# No equivalent built-in; approximate with sampled softmax for large vocabularies
loss = tf.nn.sampled_softmax_loss(
    weights=embedding_matrix,   # (vocab_size, embed_dim)
    biases=bias,
    labels=targets,
    inputs=embeddings,
    num_sampled=1000,
    num_classes=50000
)
# For inference: tf.nn.log_softmax(embeddings @ embedding_matrix.T + bias)

MultiheadAttention

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

$\text{head}_i = \text{Attention}(Q W_i^Q,\; K W_i^K,\; V W_i^V)$

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$

Multi-head attention is the core of the Transformer. Rather than computing one attention pattern over the full dimension, it runs $h$ attention heads in parallel, each projecting queries, keys, and values into a lower-dimensional subspace ( $d_k = d_{\text{model}} / h$ ). The heads learn different relationship patterns (positional, syntactic, semantic) and their outputs are concatenated and projected back.

Key design decisions:

$1/\sqrt{d_k}$ scaling: Without this, the dot product $QK^\top$ grows in magnitude with $d_k$ , pushing Softmax into saturation where gradients vanish.
Causal masking: For decoder/generation models, an attention mask sets future positions to $-\infty$ before Softmax, preventing the model from attending to future tokens.
batch_first=True: Modern PyTorch convention; if False (old default), tensors are (seq, batch, dim) not (batch, seq, dim).

PyTorch API:

attn = nn.MultiheadAttention(
    embed_dim=512,
    num_heads=8,
    dropout=0.1,
    batch_first=True
)
output, weights = attn(query, key, value, attn_mask=causal_mask)

Memory complexity: $O(n^2 \cdot d)$ in sequence length $n$ — quadratic attention is the scaling bottleneck that FlashAttention, sparse attention, and linear attention variants aim to address.

PyTorch:

attn = nn.MultiheadAttention(embed_dim=512, num_heads=8, dropout=0.1, batch_first=True)
# query, key, value: (batch, seq_len, embed_dim)
output, weights = attn(query, key, value, attn_mask=causal_mask)

TensorFlow:

attn = tf.keras.layers.MultiHeadAttention(num_heads=8, key_dim=64, dropout=0.1)
# query, value: (batch, seq_len, embed_dim); key optional (defaults to value)
output = attn(query, value, key=key, attention_mask=causal_mask)

SwiGLU — The Modern FFN Standard

While not a standalone PyTorch module, SwiGLU is the modern replacement for the transformer FFN's ReLU/GELU:

$\text{SwiGLU}(x, W, V, b, c) = (xW + b) \otimes \text{SiLU}(xV + c)$

It is GLU but with SiLU replacing Sigmoid as the gate. Used in PaLM, LLaMA, Mistral, Gemma. To maintain parameter count, the hidden dimension is typically scaled to $8d/3$ instead of $4d$ .

PyTorch (SwiGLU FFN block):

class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        # Three projections: gate, value, output
        self.w1 = nn.Linear(d_model, d_ff, bias=False)   # gate
        self.w2 = nn.Linear(d_ff, d_model, bias=False)   # output
        self.w3 = nn.Linear(d_model, d_ff, bias=False)   # value

    def forward(self, x):
        return self.w2(nn.functional.silu(self.w1(x)) * self.w3(x))

TensorFlow (SwiGLU FFN block):

class SwiGLUFFN(tf.keras.layers.Layer):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = tf.keras.layers.Dense(d_ff, use_bias=False)   # gate
        self.w2 = tf.keras.layers.Dense(d_model, use_bias=False) # output
        self.w3 = tf.keras.layers.Dense(d_ff, use_bias=False)   # value

    def call(self, x):
        return self.w2(tf.nn.swish(self.w1(x)) * self.w3(x))

Putting It Together: Transformer FFN Variants

FFN Type	Activation	Used In
Original	ReLU	Vaswani et al. 2017
BERT	GELU	BERT, GPT-2, RoBERTa
GLU-based	GeGLU	T5v1.1, Flan-T5
SwiGLU	SiLU gate	LLaMA, Mistral, PaLM
MixGLU	Mish gate	Research

The trend is clear: smooth, self-gating activations have replaced ReLU in large-scale NLP, with GLU-variant FFNs becoming the de facto standard.

References

Vaswani et al. (2017) — Attention Is All You Need — Introduced the Transformer with multi-head attention

Grave et al. (2017) — Efficient softmax approximation for GPUs (Adaptive Softmax) — Introduced AdaptiveLogSoftmax for large vocabulary NLP

Shazeer (2020) — GLU Variants Improve Transformer (SwiGLU) — Showed SwiGLU consistently outperforms GELU/ReLU FFNs

Touvron et al. (2023) — LLaMA — Popularized SwiGLU for open-weight large language models

Previous Take Quiz →

NLP & Advanced Activations

LogSigmoid

AdaptiveLogSoftmaxWithLoss — Large Vocabulary NLP

MultiheadAttention

SwiGLU — The Modern FFN Standard

Putting It Together: Transformer FFN Variants

Privacy Policy

What we collect

What we don't collect

Your choices

Contact