Supplement · Regularization

Regularization in TensorFlow

Colab Notebook · ~45 min
Google Colab Notebook
Regularization in TensorFlow
Python · ~45 min
Open in Colab
Lab Objectives
1
Apply L1 and L2 penalties via kernel_regularizer and compare against optimizer-level weight decay in AdamW; verify parity with PyTorch results from Lab 1
2
Use tf.keras.layers.Dropout in a custom training loop with explicit training=True/False flags; implement MC Dropout by forcing training=True at inference time
3
Build a custom BatchNorm layer subclassing tf.keras.layers.Layer using tf.GradientTape; demonstrate the training argument difference in model(x, training=True/False)
4
Implement Mixup and CutMix as custom tf.keras.layers.Layer preprocessing layers that operate inside model.fit; train on CIFAR-10 and reproduce the Lab 1 accuracy comparison
5
Use the label_smoothing argument in SparseCategoricalCrossentropy; implement temperature scaling as a post-training logit rescaling step and evaluate calibration using Expected Calibration Error
6
Apply tf.keras.layers.SpectralNormalization to a discriminator; use clipnorm and clipvalue on the optimizer and compare their effects on gradient magnitude distributions

Lab Overview

This notebook is the TensorFlow/Keras companion to the PyTorch lab. Every technique is re-implemented using the TF API, with emphasis on Keras-specific patterns: kernel_regularizer, training= flags, custom layer subclassing, and model.compile integration.

Key API Differences vs PyTorch

Concept PyTorch TensorFlow / Keras
L2 regularization optimizer weight_decay or manual penalty kernel_regularizer=tf.keras.regularizers.L2(lam)
AdamW torch.optim.AdamW tf.keras.optimizers.AdamW
Dropout training mode model.train() / model.eval() layer(x, training=True/False)
BatchNorm mode model.train() / model.eval() layer(x, training=True/False)
Label smoothing custom or nn.CrossEntropyLoss(label_smoothing=) SparseCategoricalCrossentropy(label_smoothing=)
Spectral norm nn.utils.spectral_norm(layer) tf.keras.layers.SpectralNormalization(layer)
Gradient clipping clip_grad_norm_ before optimizer.step() optimizer = Adam(clipnorm=1.0)

Sections

Section Topic Key experiment
1 kernel_regularizer, AdamW L2 via regularizer vs optimizer weight_decay
2 Dropout, MC Dropout training=True at inference for uncertainty
3 BatchNorm custom layer Reproduce eval-mode bug; GradientTape training loop
4 Mixup & CutMix as Keras layers CIFAR-10 accuracy comparison
5 Label smoothing, temperature scaling ECE calibration curves
6 SpectralNormalization, gradient clipping Lipschitz verification; clipnorm vs clipvalue

Section 1 — Weight Penalties in Keras

The cleanest Keras pattern uses kernel_regularizer at the layer level:

tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L2(1e-4))

The regularization loss is automatically summed into model.losses and included in model.fit. Contrast this with manual penalty-in-loss, and with tf.keras.optimizers.AdamW(weight_decay=1e-4) which applies decay directly to the parameter update — the TF equivalent of PyTorch's AdamW.

Section 2 — Dropout and MC Dropout

Keras Dropout is controlled by the training argument, not a global mode flag. In a custom training loop:

with tf.GradientTape() as tape:
    logits = model(x, training=True)   # dropout active
preds = model(x, training=False)       # dropout inactive

For MC Dropout, force training=True at inference and collect T=100 predictions, then compute the mean and variance of the softmax output — identical conceptually to the PyTorch version.

Section 3 — Batch Normalization

Implement MyBatchNorm(num_features) as a tf.keras.layers.Layer with self.gamma, self.beta, self.running_mean, self.running_var. In call(self, x, training=False), branch on training to use batch vs running statistics.

The TF-specific gotcha: when using model.fit, Keras automatically passes the correct training flag. In a custom tf.GradientTape loop you must pass it explicitly — forgetting to do so is the TF analogue of forgetting model.eval() in PyTorch.

Section 4 — Mixup and CutMix as Keras Preprocessing Layers

Implement both as tf.keras.layers.Layer subclasses that operate on (image, label) pairs inside a tf.data pipeline:

dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.batch(128).map(mixup_layer)

Compare CIFAR-10 validation accuracy after 30 epochs across the same four conditions as Lab 1 (baseline, flips+crops, +Mixup, +CutMix).

Section 5 — Label Smoothing and Temperature Scaling

SparseCategoricalCrossentropy(label_smoothing=0.1) applies smoothing automatically. Verify that the output logit gap is bounded after training, consistent with the theoretical bound from the readings.

For temperature scaling, implement a thin calibration wrapper:

calibrated_logits = raw_logits / T   # T is a scalar you tune post-training

Evaluate calibration with Expected Calibration Error (ECE) on a validation set before and after temperature scaling. Plot reliability diagrams (confidence vs accuracy per bin) to visualise miscalibration.

Section 6 — Spectral Normalization and Gradient Clipping

tf.keras.layers.SpectralNormalization(layer) is the TF equivalent of nn.utils.spectral_norm. Wrap every dense layer in a discriminator and verify the spectral norm stays ≤ 1.0 after training.

For gradient clipping, compare the two Keras modes on a deep model:

Adam(clipnorm=1.0)    # clips global gradient norm
Adam(clipvalue=0.5)   # clips each component independently

clipnorm preserves gradient direction (only scales magnitude); clipvalue can distort direction by clipping components independently. Plot gradient norm distributions to see the difference.