Supplement · Neural Network Architectures

The MLP — Layers, Activations, and Universal Approximation

14 min read

By the end of this reading you will be able to:

Trace the forward pass of an L-layer MLP from input to output, identifying the affine transformation and non-linearity at each hidden layer
Explain what the Universal Approximation Theorem guarantees and what it does not guarantee about MLP expressivity
Distinguish the effects of adding depth vs. adding width to an MLP and identify scenarios where each is preferred
Explain why layer normalization and dropout are applied after (or within) MLP layers and state what problem each addresses

The Foundation: Affine Transformation + Non-Linearity

Every modern deep learning architecture — transformers, ResNets, RNNs — is assembled from a small number of primitives. The most fundamental is a single dense (fully connected) layer: an affine transformation followed by a non-linearity.

For an input vector $\mathbf{x} \in \mathbb{R}^{d_{\text{in}}}$ , a layer computes:

$\mathbf{h} = \phi(W\mathbf{x} + \mathbf{b})$

where $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ is the weight matrix, $\mathbf{b} \in \mathbb{R}^{d_{\text{out}}}$ is the bias, and $\phi$ is a pointwise non-linearity (ReLU, GELU, tanh, etc.).

Why the non-linearity? Without $\phi$ , stacking multiple linear layers collapses: $W_2(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = W'\mathbf{x} + \mathbf{b}'$ — a single linear transformation. Non-linearities are what give depth its expressivity.

The Multi-Layer Perceptron

An MLP (multi-layer perceptron) chains $L$ such layers:

$\mathbf{h}^{(0)} = \mathbf{x}$ $\mathbf{h}^{(\ell)} = \phi\bigl(W^{(\ell)}\mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}\bigr), \quad \ell = 1,\ldots,L-1$ $\hat{\mathbf{y}} = W^{(L)}\mathbf{h}^{(L-1)} + \mathbf{b}^{(L)}$

The final layer is linear (no activation) — the appropriate output non-linearity (softmax, sigmoid) is applied by the loss function or separately.

Key vocabulary:

Input layer: $\mathbf{h}^{(0)} = \mathbf{x}$
Hidden layers: $\mathbf{h}^{(1)}, \ldots, \mathbf{h}^{(L-1)}$ — these are the learned representations
Output layer: $\hat{\mathbf{y}}$ — prediction in task-specific space
Width: $d_h$ , the dimensionality of the hidden layers
Depth: $L$ , the number of weight layers

Universal Approximation

The Universal Approximation Theorem (Hornik 1989; Cybenko 1989) states that a single hidden layer MLP with a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary precision.

What this guarantees: The function class of MLPs is expressive enough in principle.

What it does not guarantee:

That shallow networks are practical — the required width may be exponential in the input dimension
That gradient descent will find the right parameters
That the network will generalize to unseen inputs

The theorem is primarily a theoretical existence result. In practice, depth is more important than width for most tasks.

Depth vs. Width

Depth (more layers, fixed width) offers:

Compositional representations: each layer builds on the previous one — edges → shapes → objects in vision; morphemes → words → phrases in NLP
Parameter efficiency: some functions require exponentially more neurons to represent in a shallow network than a deep one
Better gradient flow with residual connections (see the ResNet reading)

Width (larger hidden dimension, fixed depth) offers:

More capacity without gradient flow issues — easier to train
Direct expressivity gain — more neurons in one layer

Practical guidance: modern architectures prefer moderate depth (4–48 layers) with sufficient width. For structured data, 2–4 layers is often enough; for unstructured data (images, text), depth is essential.

Layer Normalization

Raw pre-activations can have very different scales across layers, which destabilizes training. Layer normalization normalizes within each example across features:

$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta$

where $\mu$ and $\sigma$ are the mean and standard deviation computed over the feature dimension of a single example, and $\gamma$ , $\beta$ are learned scale and shift parameters.

BatchNorm (alternative) normalizes across the batch dimension — more common in CNNs
LayerNorm normalizes across the feature dimension of each example — standard in transformers and RNNs, where batch statistics are unreliable

Dropout

Dropout randomly zeros out neurons during training with probability $p$ (typically 0.1–0.5):

$\mathbf{h}_{\text{drop}} = \mathbf{h} \odot \text{Bernoulli}(1-p) / (1-p)$

The division by $(1-p)$ is inverted dropout — it preserves the expected sum of activations so the same network can be used at inference time without scaling.

Why it works: Forces the network to learn redundant representations — no single neuron can be relied upon. Acts as a form of ensemble averaging over exponentially many sub-networks.

Where to apply: After activation, before the next layer. Not typically used in the output layer. Often omitted entirely in modern transformers (replaced by layer norm).

Parameter Count

For an MLP with input dim $d_{\text{in}}$ , $L-1$ hidden layers of width $d_h$ , and output dim $d_{\text{out}}$ :

$\text{params} = d_{\text{in}} \cdot d_h + (L-2) \cdot d_h^2 + d_h \cdot d_{\text{out}} + \text{biases}$

Hidden layers dominate: each $d_h \times d_h$ weight matrix has $d_h^2$ parameters. For $d_h = 1024$ and $L = 6$ : ~4M parameters per hidden layer.

This quadratic scaling in $d_h$ is what makes MLPs expensive to widen and motivates sparse architectures (Mixture of Experts) for very large models.

PyTorch and TensorFlow

PyTorch — building MLPs with nn.Sequential and nn.Module:

import torch
import torch.nn as nn

# Concise MLP with nn.Sequential
mlp = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

x   = torch.randn(32, 784)   # batch of 32 flattened images
out = mlp(x)                 # (32, 10) logits

# Custom Module gives more control (separate forward logic, weight sharing, etc.)
class MLP(nn.Module):
    def __init__(self, in_dim: int, hidden: int, out_dim: int, depth: int):
        super().__init__()
        layers = [nn.Linear(in_dim, hidden), nn.ReLU()]
        for _ in range(depth - 1):
            layers += [nn.Linear(hidden, hidden), nn.ReLU()]
        layers.append(nn.Linear(hidden, out_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = MLP(in_dim=784, hidden=512, out_dim=10, depth=4)
print(sum(p.numel() for p in model.parameters()))  # parameter count

TensorFlow / Keras:

import tensorflow as tf

# Sequential API
mlp = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(10),
])

# Functional API — preferred when you need multiple inputs/outputs or skip connections
inputs  = tf.keras.Input(shape=(784,))
x       = tf.keras.layers.Dense(256, activation='relu')(inputs)
x       = tf.keras.layers.Dense(256, activation='relu')(x)
outputs = tf.keras.layers.Dense(10)(x)
model   = tf.keras.Model(inputs, outputs)

model.summary()   # prints layer shapes and parameter counts

Overview Next →

The MLP — Layers, Activations, and Universal Approximation

The Foundation: Affine Transformation + Non-Linearity

The Multi-Layer Perceptron

Universal Approximation

Depth vs. Width

Layer Normalization

Dropout

Parameter Count

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact