The MLP — Layers, Activations, and Universal Approximation
- Trace the forward pass of an L-layer MLP from input to output, identifying the affine transformation and non-linearity at each hidden layer
- Explain what the Universal Approximation Theorem guarantees and what it does not guarantee about MLP expressivity
- Distinguish the effects of adding depth vs. adding width to an MLP and identify scenarios where each is preferred
- Explain why layer normalization and dropout are applied after (or within) MLP layers and state what problem each addresses
The Foundation: Affine Transformation + Non-Linearity
Every modern deep learning architecture — transformers, ResNets, RNNs — is assembled from a small number of primitives. The most fundamental is a single dense (fully connected) layer: an affine transformation followed by a non-linearity.
For an input vector , a layer computes:
where is the weight matrix, is the bias, and is a pointwise non-linearity (ReLU, GELU, tanh, etc.).
Why the non-linearity? Without , stacking multiple linear layers collapses: — a single linear transformation. Non-linearities are what give depth its expressivity.
The Multi-Layer Perceptron
An MLP (multi-layer perceptron) chains such layers:
The final layer is linear (no activation) — the appropriate output non-linearity (softmax, sigmoid) is applied by the loss function or separately.
Key vocabulary:
- Input layer:
- Hidden layers: — these are the learned representations
- Output layer: — prediction in task-specific space
- Width: , the dimensionality of the hidden layers
- Depth: , the number of weight layers
Universal Approximation
The Universal Approximation Theorem (Hornik 1989; Cybenko 1989) states that a single hidden layer MLP with a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary precision.
What this guarantees: The function class of MLPs is expressive enough in principle.
What it does not guarantee:
- That shallow networks are practical — the required width may be exponential in the input dimension
- That gradient descent will find the right parameters
- That the network will generalize to unseen inputs
The theorem is primarily a theoretical existence result. In practice, depth is more important than width for most tasks.
Depth vs. Width
Depth (more layers, fixed width) offers:
- Compositional representations: each layer builds on the previous one — edges → shapes → objects in vision; morphemes → words → phrases in NLP
- Parameter efficiency: some functions require exponentially more neurons to represent in a shallow network than a deep one
- Better gradient flow with residual connections (see the ResNet reading)
Width (larger hidden dimension, fixed depth) offers:
- More capacity without gradient flow issues — easier to train
- Direct expressivity gain — more neurons in one layer
Practical guidance: modern architectures prefer moderate depth (4–48 layers) with sufficient width. For structured data, 2–4 layers is often enough; for unstructured data (images, text), depth is essential.
Layer Normalization
Raw pre-activations can have very different scales across layers, which destabilizes training. Layer normalization normalizes within each example across features:
where and are the mean and standard deviation computed over the feature dimension of a single example, and , are learned scale and shift parameters.
- BatchNorm (alternative) normalizes across the batch dimension — more common in CNNs
- LayerNorm normalizes across the feature dimension of each example — standard in transformers and RNNs, where batch statistics are unreliable
Dropout
Dropout randomly zeros out neurons during training with probability (typically 0.1–0.5):
The division by is inverted dropout — it preserves the expected sum of activations so the same network can be used at inference time without scaling.
Why it works: Forces the network to learn redundant representations — no single neuron can be relied upon. Acts as a form of ensemble averaging over exponentially many sub-networks.
Where to apply: After activation, before the next layer. Not typically used in the output layer. Often omitted entirely in modern transformers (replaced by layer norm).
Parameter Count
For an MLP with input dim , hidden layers of width , and output dim :
Hidden layers dominate: each weight matrix has parameters. For and : ~4M parameters per hidden layer.
This quadratic scaling in is what makes MLPs expensive to widen and motivates sparse architectures (Mixture of Experts) for very large models.
PyTorch and TensorFlow
PyTorch — building MLPs with nn.Sequential and nn.Module:
import torch
import torch.nn as nn
# Concise MLP with nn.Sequential
mlp = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 10),
)
x = torch.randn(32, 784) # batch of 32 flattened images
out = mlp(x) # (32, 10) logits
# Custom Module gives more control (separate forward logic, weight sharing, etc.)
class MLP(nn.Module):
def __init__(self, in_dim: int, hidden: int, out_dim: int, depth: int):
super().__init__()
layers = [nn.Linear(in_dim, hidden), nn.ReLU()]
for _ in range(depth - 1):
layers += [nn.Linear(hidden, hidden), nn.ReLU()]
layers.append(nn.Linear(hidden, out_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
model = MLP(in_dim=784, hidden=512, out_dim=10, depth=4)
print(sum(p.numel() for p in model.parameters())) # parameter count
TensorFlow / Keras:
import tensorflow as tf
# Sequential API
mlp = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(10),
])
# Functional API — preferred when you need multiple inputs/outputs or skip connections
inputs = tf.keras.Input(shape=(784,))
x = tf.keras.layers.Dense(256, activation='relu')(inputs)
x = tf.keras.layers.Dense(256, activation='relu')(x)
outputs = tf.keras.layers.Dense(10)(x)
model = tf.keras.Model(inputs, outputs)
model.summary() # prints layer shapes and parameter counts