The Transformer
- Trace one token through a transformer encoder block — identifying the multi-head self-attention sub-layer, the FFN sub-layer, and the residual + layer-norm wrapping each — and state the dimensionality at each step
- Explain why positional encodings are necessary, distinguish sinusoidal from learned positional encodings, and state what RoPE achieves that absolute encodings cannot
- Distinguish encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformer variants by their attention masking, training objective, and typical use case
- Explain the transformer's quadratic attention bottleneck and name two architectural changes (e.g., grouped-query attention, sparse attention) that reduce it
Overview
The transformer (Vaswani et al., 2017) replaced recurrence with self-attention and became the dominant architecture for sequence modeling. Its key properties:
- Parallelizable — no sequential dependency, processes all positions simultaneously
- O(1) path length between any two positions — direct long-range dependencies
- Scalable — empirical scaling laws show predictable gains from more data, compute, and parameters
The original transformer has an encoder and a decoder, each composed of stacked identical blocks.
The Encoder Block
Each encoder block applies two sub-layers, each wrapped with a residual connection and layer normalization (pre-norm shown, which is now more common than post-norm):
┌──────────────────────────────────────────────┐
│ x → LayerNorm → MultiHead Self-Attention → (+x) → z │
│ z → LayerNorm → Feed-Forward Network → (+z) → y │
└──────────────────────────────────────────────┘
Sub-layer 1: Multi-Head Self-Attention
Each token attends to all other tokens in the sequence. For a sequence of tokens with -dimensional embeddings:
Output shape: — same as input.
Sub-layer 2: Position-Wise Feed-Forward Network
Applied independently to each token's representation:
where , , and is typically .
The FFN is a two-layer MLP applied pointwise — each token is processed independently, without interaction between tokens. The attention layer handles all cross-token communication; the FFN processes each token's representation in isolation.
Parameter count per encoder block: (attention) (FFN) . For (BERT-base): ~7M parameters per block × 12 blocks = 85M parameters (plus embeddings).
Positional Encoding
Self-attention is permutation-equivariant: shuffling the input tokens gives shuffled outputs. Without explicit position information, the transformer treats "cat sat mat" and "mat sat cat" identically.
Positional encodings inject position information. Two strategies:
Sinusoidal (Original Transformer)
Added to token embeddings before the first layer. Fixed, not learned. Can generalize to sequence lengths not seen during training (different frequencies encode different scales of position).
Learned Absolute Positions
A learnable embedding table indexed by position. BERT, GPT-2 use this. Cannot generalize beyond the training context length.
Rotary Position Embedding (RoPE)
Used by LLaMA, Mistral, and most modern LLMs. Rather than adding position encodings to embeddings, rotates the query and key vectors before the dot product:
The attention score depends only on the relative position , not absolute positions. This makes it naturally relative, enabling better length generalization.
Decoder Block and Encoder-Decoder Interaction
The decoder block adds a masked self-attention sub-layer and a cross-attention sub-layer:
x → LayerNorm → Masked Self-Attention → (+x) → z
z → LayerNorm → Cross-Attention(Q=z, K=encoder_out, V=encoder_out) → (+z) → w
w → LayerNorm → Feed-Forward Network → (+w) → y
Masked self-attention ensures the decoder cannot attend to future tokens (autoregressive). Cross-attention lets the decoder query the encoder's representations at each step.
The Three Transformer Families
Encoder-Only (BERT, RoBERTa)
- Attention: bidirectional self-attention (every token sees every other)
- Training objective: Masked Language Modeling (predict masked tokens) + NSP
- Use cases: classification, NER, question answering, sentence embeddings
- Representative: BERT-base (110M), BERT-large (340M)
Decoder-Only (GPT series, LLaMA, Mistral, Gemma)
- Attention: masked (causal) self-attention only
- Training objective: next-token prediction (autoregressive language modeling)
- Use cases: text generation, chat, code, reasoning
- Representative: GPT-2 (1.5B), LLaMA 3 (8B–70B)
Encoder-Decoder (T5, BART, original transformer)
- Attention: encoder has full self-attention; decoder has masked self-attention + cross-attention to encoder
- Training objective: text-to-text (T5), denoising (BART), translation loss
- Use cases: translation, summarization, structured prediction
- Representative: T5-base (220M), mT5
Scaling and Efficiency
Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022): Model loss scales predictably as a power law with parameters, data, and compute. Doubling parameters roughly halves the loss at fixed compute — this motivated the rapid growth from GPT-2 (1.5B) to GPT-3 (175B).
Quadratic bottleneck: Attention is in both time and memory. Solutions:
| Method | Idea | Used in |
|---|---|---|
| Grouped-Query Attention (GQA) | Share K/V heads across groups of Q heads | LLaMA 2/3, Mistral |
| Multi-Query Attention (MQA) | Single K/V head shared by all Q heads | PaLM, Falcon |
| Sliding Window Attention | Each token attends only to a local window | Mistral, Longformer |
| FlashAttention | Rewrite attention to minimize HBM reads (kernel fusion) | Most modern training |
FlashAttention does not change the mathematical output — it is a hardware-efficient reimplementation that reduces memory IO by fusing the softmax and value aggregation into a single kernel pass.
PyTorch and TensorFlow
PyTorch — full transformer, encoder-only, decoder-only:
import torch
import torch.nn as nn
# Built-in full encoder-decoder transformer (Vaswani et al. architecture)
transformer = nn.Transformer(
d_model=512, nhead=8,
num_encoder_layers=6, num_decoder_layers=6,
dim_feedforward=2048, dropout=0.1,
batch_first=True,
)
src = torch.randn(2, 10, 512) # (B, src_len, d_model)
tgt = torch.randn(2, 7, 512) # (B, tgt_len, d_model)
out = transformer(src, tgt) # (2, 7, 512)
# Encoder-only model (BERT-style) — stack of TransformerEncoderLayers
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8,
dim_feedforward=2048,
batch_first=True, norm_first=True) # pre-norm
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
x = torch.randn(2, 10, 512)
out = encoder(x) # (2, 10, 512)
# Decoder-only (GPT-style) — causal mask prevents attending to future tokens
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8,
batch_first=True, norm_first=True)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
T = 10
mask = nn.Transformer.generate_square_subsequent_mask(T) # causal mask (T, T)
out = decoder(x, memory=x, tgt_mask=mask) # (2, 10, 512)
# Sinusoidal positional encoding
class SinusoidalPE(nn.Module):
def __init__(self, d_model: int, max_len: int = 5000):
super().__init__()
pos = torch.arange(max_len).unsqueeze(1) # (max_len, 1)
div = torch.exp(torch.arange(0, d_model, 2) * (-torch.log(torch.tensor(10000.0)) / d_model))
pe = torch.zeros(1, max_len, d_model)
pe[0, :, 0::2] = torch.sin(pos * div)
pe[0, :, 1::2] = torch.cos(pos * div)
self.register_buffer('pe', pe) # not a parameter
def forward(self, x):
return x + self.pe[:, :x.size(1)] # broadcast over batch
TensorFlow / Keras:
import tensorflow as tf
# Transformer encoder block using Keras layers (pre-norm)
class TransformerEncoderBlock(tf.keras.layers.Layer):
def __init__(self, d_model: int, num_heads: int, ffn_dim: int, dropout: float = 0.1):
super().__init__()
self.attn = tf.keras.layers.MultiHeadAttention(num_heads, key_dim=d_model // num_heads)
self.ffn = tf.keras.Sequential([
tf.keras.layers.Dense(ffn_dim, activation='gelu'),
tf.keras.layers.Dense(d_model),
])
self.norm1 = tf.keras.layers.LayerNormalization()
self.norm2 = tf.keras.layers.LayerNormalization()
self.drop = tf.keras.layers.Dropout(dropout)
def call(self, x, training=False):
# Pre-norm: x + SubLayer(LN(x))
x = x + self.drop(self.attn(self.norm1(x), self.norm1(x)), training=training)
x = x + self.drop(self.ffn(self.norm2(x)), training=training)
return x
# Built-in high-level option via transformers library (Hugging Face)
# from transformers import BertModel, GPT2Model, T5ForConditionalGeneration
# model = GPT2Model.from_pretrained('gpt2') # works in both PyTorch and TF