Supplement · Neural Network Architectures

The Transformer

18 min read

By the end of this reading you will be able to:

Trace one token through a transformer encoder block — identifying the multi-head self-attention sub-layer, the FFN sub-layer, and the residual + layer-norm wrapping each — and state the dimensionality at each step
Explain why positional encodings are necessary, distinguish sinusoidal from learned positional encodings, and state what RoPE achieves that absolute encodings cannot
Distinguish encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformer variants by their attention masking, training objective, and typical use case
Explain the transformer's quadratic attention bottleneck and name two architectural changes (e.g., grouped-query attention, sparse attention) that reduce it

Overview

The transformer (Vaswani et al., 2017) replaced recurrence with self-attention and became the dominant architecture for sequence modeling. Its key properties:

Parallelizable — no sequential dependency, processes all positions simultaneously
O(1) path length between any two positions — direct long-range dependencies
Scalable — empirical scaling laws show predictable gains from more data, compute, and parameters

The original transformer has an encoder and a decoder, each composed of stacked identical blocks.

The Encoder Block

Each encoder block applies two sub-layers, each wrapped with a residual connection and layer normalization (pre-norm shown, which is now more common than post-norm):

┌──────────────────────────────────────────────┐
│  x  →  LayerNorm  →  MultiHead Self-Attention  →  (+x)  →  z  │
│  z  →  LayerNorm  →  Feed-Forward Network       →  (+z)  →  y  │
└──────────────────────────────────────────────┘

Sub-layer 1: Multi-Head Self-Attention

Each token attends to all other tokens in the sequence. For a sequence of $n$ tokens with $d_{\text{model}}$ -dimensional embeddings:

$\text{Attn}(X) = \text{MultiHead}(X, X, X)$

Output shape: $(n \times d_{\text{model}})$ — same as input.

Sub-layer 2: Position-Wise Feed-Forward Network

Applied independently to each token's representation:

$\text{FFN}(\mathbf{x}) = W_2\, \phi(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$

where $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ , $W_2 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ , and $d_{\text{ff}}$ is typically $4 \times d_{\text{model}}$ .

The FFN is a two-layer MLP applied pointwise — each token is processed independently, without interaction between tokens. The attention layer handles all cross-token communication; the FFN processes each token's representation in isolation.

Parameter count per encoder block: $4 d_{\text{model}}^2$ (attention) $+ 2 \cdot 4 d_{\text{model}}^2$ (FFN) $= 12 d_{\text{model}}^2$ . For $d_{\text{model}} = 768$ (BERT-base): ~7M parameters per block × 12 blocks = 85M parameters (plus embeddings).

Positional Encoding

Self-attention is permutation-equivariant: shuffling the input tokens gives shuffled outputs. Without explicit position information, the transformer treats "cat sat mat" and "mat sat cat" identically.

Positional encodings inject position information. Two strategies:

Sinusoidal (Original Transformer)

$\text{PE}(\text{pos}, 2i) = \sin\!\left(\text{pos} / 10000^{2i/d_{\text{model}}}\right)$ $\text{PE}(\text{pos}, 2i{+}1) = \cos\!\left(\text{pos} / 10000^{2i/d_{\text{model}}}\right)$

Added to token embeddings before the first layer. Fixed, not learned. Can generalize to sequence lengths not seen during training (different frequencies encode different scales of position).

Learned Absolute Positions

A learnable embedding table indexed by position. BERT, GPT-2 use this. Cannot generalize beyond the training context length.

Rotary Position Embedding (RoPE)

Used by LLaMA, Mistral, and most modern LLMs. Rather than adding position encodings to embeddings, rotates the query and key vectors before the dot product:

$\mathbf{q}_m \cdot \mathbf{k}_n = (R_m \mathbf{q})^\top (R_n \mathbf{k}) = \mathbf{q}^\top R_{n-m} \mathbf{k}$

The attention score depends only on the relative position $(n - m)$ , not absolute positions. This makes it naturally relative, enabling better length generalization.

Decoder Block and Encoder-Decoder Interaction

The decoder block adds a masked self-attention sub-layer and a cross-attention sub-layer:

  x → LayerNorm → Masked Self-Attention → (+x) → z
  z → LayerNorm → Cross-Attention(Q=z, K=encoder_out, V=encoder_out) → (+z) → w
  w → LayerNorm → Feed-Forward Network → (+w) → y

Masked self-attention ensures the decoder cannot attend to future tokens (autoregressive). Cross-attention lets the decoder query the encoder's representations at each step.

The Three Transformer Families

Encoder-Only (BERT, RoBERTa)

Attention: bidirectional self-attention (every token sees every other)
Training objective: Masked Language Modeling (predict masked tokens) + NSP
Use cases: classification, NER, question answering, sentence embeddings
Representative: BERT-base (110M), BERT-large (340M)

Decoder-Only (GPT series, LLaMA, Mistral, Gemma)

Attention: masked (causal) self-attention only
Training objective: next-token prediction (autoregressive language modeling)
Use cases: text generation, chat, code, reasoning
Representative: GPT-2 (1.5B), LLaMA 3 (8B–70B)

Encoder-Decoder (T5, BART, original transformer)

Attention: encoder has full self-attention; decoder has masked self-attention + cross-attention to encoder
Training objective: text-to-text (T5), denoising (BART), translation loss
Use cases: translation, summarization, structured prediction
Representative: T5-base (220M), mT5

Scaling and Efficiency

Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022): Model loss scales predictably as a power law with parameters, data, and compute. Doubling parameters roughly halves the loss at fixed compute — this motivated the rapid growth from GPT-2 (1.5B) to GPT-3 (175B).

Quadratic bottleneck: Attention is $O(n^2)$ in both time and memory. Solutions:

Method	Idea	Used in
Grouped-Query Attention (GQA)	Share K/V heads across groups of Q heads	LLaMA 2/3, Mistral
Multi-Query Attention (MQA)	Single K/V head shared by all Q heads	PaLM, Falcon
Sliding Window Attention	Each token attends only to a local window	Mistral, Longformer
FlashAttention	Rewrite attention to minimize HBM reads (kernel fusion)	Most modern training

FlashAttention does not change the mathematical output — it is a hardware-efficient reimplementation that reduces memory IO by fusing the softmax and value aggregation into a single kernel pass.

PyTorch and TensorFlow

PyTorch — full transformer, encoder-only, decoder-only:

import torch
import torch.nn as nn

# Built-in full encoder-decoder transformer (Vaswani et al. architecture)
transformer = nn.Transformer(
    d_model=512, nhead=8,
    num_encoder_layers=6, num_decoder_layers=6,
    dim_feedforward=2048, dropout=0.1,
    batch_first=True,
)
src = torch.randn(2, 10, 512)   # (B, src_len, d_model)
tgt = torch.randn(2,  7, 512)   # (B, tgt_len, d_model)
out = transformer(src, tgt)     # (2, 7, 512)

# Encoder-only model (BERT-style) — stack of TransformerEncoderLayers
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8,
                                            dim_feedforward=2048,
                                            batch_first=True, norm_first=True)  # pre-norm
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
x   = torch.randn(2, 10, 512)
out = encoder(x)                # (2, 10, 512)

# Decoder-only (GPT-style) — causal mask prevents attending to future tokens
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8,
                                            batch_first=True, norm_first=True)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
T      = 10
mask   = nn.Transformer.generate_square_subsequent_mask(T)   # causal mask (T, T)
out    = decoder(x, memory=x, tgt_mask=mask)                 # (2, 10, 512)

# Sinusoidal positional encoding
class SinusoidalPE(nn.Module):
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        pos = torch.arange(max_len).unsqueeze(1)              # (max_len, 1)
        div = torch.exp(torch.arange(0, d_model, 2) * (-torch.log(torch.tensor(10000.0)) / d_model))
        pe        = torch.zeros(1, max_len, d_model)
        pe[0, :, 0::2] = torch.sin(pos * div)
        pe[0, :, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe)                        # not a parameter

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]                     # broadcast over batch

TensorFlow / Keras:

import tensorflow as tf

# Transformer encoder block using Keras layers (pre-norm)
class TransformerEncoderBlock(tf.keras.layers.Layer):
    def __init__(self, d_model: int, num_heads: int, ffn_dim: int, dropout: float = 0.1):
        super().__init__()
        self.attn  = tf.keras.layers.MultiHeadAttention(num_heads, key_dim=d_model // num_heads)
        self.ffn   = tf.keras.Sequential([
            tf.keras.layers.Dense(ffn_dim, activation='gelu'),
            tf.keras.layers.Dense(d_model),
        ])
        self.norm1 = tf.keras.layers.LayerNormalization()
        self.norm2 = tf.keras.layers.LayerNormalization()
        self.drop  = tf.keras.layers.Dropout(dropout)

    def call(self, x, training=False):
        # Pre-norm: x + SubLayer(LN(x))
        x = x + self.drop(self.attn(self.norm1(x), self.norm1(x)), training=training)
        x = x + self.drop(self.ffn(self.norm2(x)),                  training=training)
        return x

# Built-in high-level option via transformers library (Hugging Face)
# from transformers import BertModel, GPT2Model, T5ForConditionalGeneration
# model = GPT2Model.from_pretrained('gpt2')   # works in both PyTorch and TF

References

Vaswani et al. 2017 — Attention Is All You Need

Devlin et al. 2019 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Dao et al. 2022 — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Previous Take Quiz →