Supplement · Regularization

L1 and L2 Weight Penalties

15 min read

By the end of this reading you will be able to:

Derive the L2-regularized gradient update and show that it is equivalent to decaying every weight by (1 − αλ) before the gradient step — the origin of the term 'weight decay'
Explain why L1 regularization induces sparsity while L2 regularization does not, using the geometry of their constraint sets and the shape of their gradients near zero
Interpret L2 regularization as a Gaussian prior on weights (MAP estimation) and L1 as a Laplace prior, and state what the strength λ controls in both cases
Explain why weight decay and L2 regularization are equivalent for SGD but not for Adam, and state what AdamW fixes

The Idea: Penalize Large Weights

A model with large weights is one that can change its output dramatically in response to small input changes — a sign of high variance. Penalizing weight magnitude directly limits how complex the learned function can be.

The regularized objective adds a penalty term to the training loss:

$\tilde{\mathcal{L}}(\theta) = \mathcal{L}(\theta) + \lambda\, \Omega(\theta)$

where $\Omega(\theta)$ is the penalty and $\lambda > 0$ controls its strength. Larger $\lambda$ = stronger regularization = simpler models.

L2 Regularization (Weight Decay)

The L2 penalty adds the sum of squared weights:

$\Omega(\theta) = \frac{1}{2}\|\theta\|_2^2 = \frac{1}{2}\sum_j \theta_j^2$

The gradient of the regularized loss:

$\nabla_\theta \tilde{\mathcal{L}} = \nabla_\theta \mathcal{L} + \lambda\theta$

The SGD update becomes:

$\theta \leftarrow \theta - \alpha(\nabla_\theta \mathcal{L} + \lambda\theta) = \underbrace{(1 - \alpha\lambda)}_{\text{decay factor}}\theta - \alpha\nabla_\theta\mathcal{L}$

This is weight decay: every weight is multiplied by $(1-\alpha\lambda) < 1$ before the gradient step. Weights are continuously pulled toward zero unless the gradient pushes them away.

Effect on the Loss Landscape

L2 adds a bowl-shaped penalty centered at the origin. The regularized minimum is pulled from the unregularized minimum toward zero. Parameters that contribute little to the loss shrink the most — they are not worth the penalty.

L2 as a Gaussian Prior (MAP Estimation)

Maximizing the posterior $p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta)\, p(\theta)$ with a Gaussian prior $p(\theta) = \mathcal{N}(0, 1/\lambda)$ is equivalent to minimizing:

$-\log p(\mathcal{D} \mid \theta) + \frac{\lambda}{2}\|\theta\|_2^2$

The L2 penalty is a Gaussian prior. $\lambda$ is the precision (inverse variance) of the prior — larger $\lambda$ concentrates the prior more tightly around zero.

L1 Regularization (Lasso)

The L1 penalty uses the sum of absolute values:

$\Omega(\theta) = \|\theta\|_1 = \sum_j |\theta_j|$

Gradient (where defined, $\theta_j \neq 0$ ):

$\frac{\partial}{\partial \theta_j}\tilde{\mathcal{L}} = \frac{\partial\mathcal{L}}{\partial\theta_j} + \lambda\,\text{sign}(\theta_j)$

Why L1 Induces Sparsity

L2 gradient at small $\theta_j$ : $\lambda\theta_j \to 0$ . The penalty becomes negligible — small weights are barely penalized, so they never reach exactly zero.

L1 gradient at small $\theta_j$ : $\lambda\,\text{sign}(\theta_j) = \pm\lambda$ . The penalty does not shrink as the weight approaches zero. At each step, exactly $\alpha\lambda$ is subtracted from $|\theta_j|$ — the weight reaches zero in finite steps and stays there (due to the subgradient being zero at the origin).

Geometric view: The L2 constraint set is a sphere (smooth, no corners). The L1 constraint set is a diamond/cross-polytope with corners at the axes. Convex optimization problems constrained to the L1 ball frequently have solutions at the corners — exactly on the axes — where many coordinates are zero.

L1 as a Laplace Prior

A Laplace prior $p(\theta_j) \propto \exp(-\lambda|\theta_j|)$ gives MAP estimation equivalent to L1 regularization. The Laplace prior has a sharp peak at zero and heavier tails than Gaussian — it strongly prefers zeros but allows large values when the data demands them.

Elastic Net

Elastic net combines L1 and L2:

$\Omega(\theta) = \alpha\|\theta\|_1 + \frac{1-\alpha}{2}\|\theta\|_2^2$

Retains L1's sparsity-inducing property while L2 ensures the solution is unique and handles correlated features. Useful when groups of correlated features should either all be selected or all dropped.

Weight Decay vs. L2: The Adam Distinction

For SGD, L2 regularization and weight decay are mathematically identical (as shown above). For adaptive optimizers like Adam, they diverge.

Adam's update scales the gradient by $1/\sqrt{v_t}$ where $v_t$ is the second moment. When the L2 penalty is added to the loss, it gets this same scaling — parameters with large gradients get less weight decay, not more. This defeats the purpose.

AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the gradient update:

$\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \frac{m_t}{\sqrt{v_t} + \epsilon} - \alpha\lambda\,\theta_{t-1}$

The decay $\alpha\lambda\theta$ is applied directly to the weights, not to the gradient before the adaptive step. This restores the "equal decay for all weights" behavior intended by weight decay, and is now the standard for training transformers.

Note: the optimizer supplement covers the AdamW update rule in detail. This reading covers the regularization geometry behind why decoupling matters.

Practical Guidance

	L1	L2 / Weight Decay
Effect	Sparse solutions	Small but nonzero solutions
Gradient at zero	Discontinuous (±λ)	Zero (no push to zero)
Typical $\lambda$	$10^{-4}$ – $10^{-2}$	$10^{-4}$ – $10^{-2}$
Common in DL	Rarely (pruning methods preferred)	Almost universally
Common in linear models	Feature selection (Lasso)	Ridge regression

PyTorch and TensorFlow

PyTorch — L2 via weight_decay, L1 manually, AdamW:

import torch
import torch.nn as nn

model = nn.Linear(64, 10)

# L2 regularization: built into every optimizer as weight_decay
# SGD update becomes: theta <- theta*(1 - lr*lambda) - lr*grad  (weight shrinkage)
optimizer_sgd   = torch.optim.SGD(model.parameters(),  lr=0.01, weight_decay=1e-4)
optimizer_adam  = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

# AdamW: decouples weight decay from the adaptive learning rate scaling
# PREFERRED over Adam + weight_decay for transformers and modern architectures
optimizer_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# L1 regularization: no built-in; add manually to the loss
def l1_penalty(model, lam=1e-4):
    return lam * sum(p.abs().sum() for p in model.parameters())

criterion = nn.CrossEntropyLoss()
x, y = torch.randn(16, 64), torch.randint(0, 10, (16,))
loss = criterion(model(x), y) + l1_penalty(model)
loss.backward()

# Elastic net: combine both terms
def elastic_net(model, l1=1e-5, l2=1e-4):
    return (l1 * sum(p.abs().sum()  for p in model.parameters()) +
            l2 * sum(p.pow(2).sum() for p in model.parameters()))

TensorFlow / Keras:

import tensorflow as tf

# Regularizers passed directly to layer constructors
dense_l2 = tf.keras.layers.Dense(64, activation='relu',
                                  kernel_regularizer=tf.keras.regularizers.L2(1e-4))
dense_l1 = tf.keras.layers.Dense(64,
                                  kernel_regularizer=tf.keras.regularizers.L1(1e-5))
dense_en = tf.keras.layers.Dense(64,
                                  kernel_regularizer=tf.keras.regularizers.L1L2(l1=1e-5, l2=1e-4))

# AdamW — built-in from TF 2.11+
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.01)

References

Loshchilov & Hutter 2019 — Decoupled Weight Decay Regularization (AdamW)

Previous Next →