L1 and L2 Weight Penalties
- Derive the L2-regularized gradient update and show that it is equivalent to decaying every weight by (1 − αλ) before the gradient step — the origin of the term 'weight decay'
- Explain why L1 regularization induces sparsity while L2 regularization does not, using the geometry of their constraint sets and the shape of their gradients near zero
- Interpret L2 regularization as a Gaussian prior on weights (MAP estimation) and L1 as a Laplace prior, and state what the strength λ controls in both cases
- Explain why weight decay and L2 regularization are equivalent for SGD but not for Adam, and state what AdamW fixes
The Idea: Penalize Large Weights
A model with large weights is one that can change its output dramatically in response to small input changes — a sign of high variance. Penalizing weight magnitude directly limits how complex the learned function can be.
The regularized objective adds a penalty term to the training loss:
where is the penalty and controls its strength. Larger = stronger regularization = simpler models.
L2 Regularization (Weight Decay)
The L2 penalty adds the sum of squared weights:
The gradient of the regularized loss:
The SGD update becomes:
This is weight decay: every weight is multiplied by before the gradient step. Weights are continuously pulled toward zero unless the gradient pushes them away.
Effect on the Loss Landscape
L2 adds a bowl-shaped penalty centered at the origin. The regularized minimum is pulled from the unregularized minimum toward zero. Parameters that contribute little to the loss shrink the most — they are not worth the penalty.
L2 as a Gaussian Prior (MAP Estimation)
Maximizing the posterior with a Gaussian prior is equivalent to minimizing:
The L2 penalty is a Gaussian prior. is the precision (inverse variance) of the prior — larger concentrates the prior more tightly around zero.
L1 Regularization (Lasso)
The L1 penalty uses the sum of absolute values:
Gradient (where defined, ):
Why L1 Induces Sparsity
L2 gradient at small : . The penalty becomes negligible — small weights are barely penalized, so they never reach exactly zero.
L1 gradient at small : . The penalty does not shrink as the weight approaches zero. At each step, exactly is subtracted from — the weight reaches zero in finite steps and stays there (due to the subgradient being zero at the origin).
Geometric view: The L2 constraint set is a sphere (smooth, no corners). The L1 constraint set is a diamond/cross-polytope with corners at the axes. Convex optimization problems constrained to the L1 ball frequently have solutions at the corners — exactly on the axes — where many coordinates are zero.
L1 as a Laplace Prior
A Laplace prior gives MAP estimation equivalent to L1 regularization. The Laplace prior has a sharp peak at zero and heavier tails than Gaussian — it strongly prefers zeros but allows large values when the data demands them.
Elastic Net
Elastic net combines L1 and L2:
Retains L1's sparsity-inducing property while L2 ensures the solution is unique and handles correlated features. Useful when groups of correlated features should either all be selected or all dropped.
Weight Decay vs. L2: The Adam Distinction
For SGD, L2 regularization and weight decay are mathematically identical (as shown above). For adaptive optimizers like Adam, they diverge.
Adam's update scales the gradient by where is the second moment. When the L2 penalty is added to the loss, it gets this same scaling — parameters with large gradients get less weight decay, not more. This defeats the purpose.
AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the gradient update:
The decay is applied directly to the weights, not to the gradient before the adaptive step. This restores the "equal decay for all weights" behavior intended by weight decay, and is now the standard for training transformers.
Note: the optimizer supplement covers the AdamW update rule in detail. This reading covers the regularization geometry behind why decoupling matters.
Practical Guidance
| L1 | L2 / Weight Decay | |
|---|---|---|
| Effect | Sparse solutions | Small but nonzero solutions |
| Gradient at zero | Discontinuous (±λ) | Zero (no push to zero) |
| Typical | – | – |
| Common in DL | Rarely (pruning methods preferred) | Almost universally |
| Common in linear models | Feature selection (Lasso) | Ridge regression |
PyTorch and TensorFlow
PyTorch — L2 via weight_decay, L1 manually, AdamW:
import torch
import torch.nn as nn
model = nn.Linear(64, 10)
# L2 regularization: built into every optimizer as weight_decay
# SGD update becomes: theta <- theta*(1 - lr*lambda) - lr*grad (weight shrinkage)
optimizer_sgd = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
optimizer_adam = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# AdamW: decouples weight decay from the adaptive learning rate scaling
# PREFERRED over Adam + weight_decay for transformers and modern architectures
optimizer_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# L1 regularization: no built-in; add manually to the loss
def l1_penalty(model, lam=1e-4):
return lam * sum(p.abs().sum() for p in model.parameters())
criterion = nn.CrossEntropyLoss()
x, y = torch.randn(16, 64), torch.randint(0, 10, (16,))
loss = criterion(model(x), y) + l1_penalty(model)
loss.backward()
# Elastic net: combine both terms
def elastic_net(model, l1=1e-5, l2=1e-4):
return (l1 * sum(p.abs().sum() for p in model.parameters()) +
l2 * sum(p.pow(2).sum() for p in model.parameters()))
TensorFlow / Keras:
import tensorflow as tf
# Regularizers passed directly to layer constructors
dense_l2 = tf.keras.layers.Dense(64, activation='relu',
kernel_regularizer=tf.keras.regularizers.L2(1e-4))
dense_l1 = tf.keras.layers.Dense(64,
kernel_regularizer=tf.keras.regularizers.L1(1e-5))
dense_en = tf.keras.layers.Dense(64,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=1e-5, l2=1e-4))
# AdamW — built-in from TF 2.11+
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.01)