Regularization in TensorFlow
kernel_regularizer and compare against optimizer-level weight decay in AdamW; verify parity with PyTorch results from Lab 1
tf.keras.layers.Dropout in a custom training loop with explicit training=True/False flags; implement MC Dropout by forcing training=True at inference time
tf.keras.layers.Layer using tf.GradientTape; demonstrate the training argument difference in model(x, training=True/False)
tf.keras.layers.Layer preprocessing layers that operate inside model.fit; train on CIFAR-10 and reproduce the Lab 1 accuracy comparison
label_smoothing argument in SparseCategoricalCrossentropy; implement temperature scaling as a post-training logit rescaling step and evaluate calibration using Expected Calibration Error
tf.keras.layers.SpectralNormalization to a discriminator; use clipnorm and clipvalue on the optimizer and compare their effects on gradient magnitude distributions
Lab Overview
This notebook is the TensorFlow/Keras companion to the PyTorch lab. Every technique is re-implemented using the TF API, with emphasis on Keras-specific patterns: kernel_regularizer, training= flags, custom layer subclassing, and model.compile integration.
Key API Differences vs PyTorch
| Concept | PyTorch | TensorFlow / Keras |
|---|---|---|
| L2 regularization | optimizer weight_decay or manual penalty |
kernel_regularizer=tf.keras.regularizers.L2(lam) |
| AdamW | torch.optim.AdamW |
tf.keras.optimizers.AdamW |
| Dropout training mode | model.train() / model.eval() |
layer(x, training=True/False) |
| BatchNorm mode | model.train() / model.eval() |
layer(x, training=True/False) |
| Label smoothing | custom or nn.CrossEntropyLoss(label_smoothing=) |
SparseCategoricalCrossentropy(label_smoothing=) |
| Spectral norm | nn.utils.spectral_norm(layer) |
tf.keras.layers.SpectralNormalization(layer) |
| Gradient clipping | clip_grad_norm_ before optimizer.step() |
optimizer = Adam(clipnorm=1.0) |
Sections
| Section | Topic | Key experiment |
|---|---|---|
| 1 | kernel_regularizer, AdamW |
L2 via regularizer vs optimizer weight_decay |
| 2 | Dropout, MC Dropout | training=True at inference for uncertainty |
| 3 | BatchNorm custom layer | Reproduce eval-mode bug; GradientTape training loop |
| 4 | Mixup & CutMix as Keras layers | CIFAR-10 accuracy comparison |
| 5 | Label smoothing, temperature scaling | ECE calibration curves |
| 6 | SpectralNormalization, gradient clipping | Lipschitz verification; clipnorm vs clipvalue |
Section 1 — Weight Penalties in Keras
The cleanest Keras pattern uses kernel_regularizer at the layer level:
tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.L2(1e-4))
The regularization loss is automatically summed into model.losses and included in model.fit. Contrast this with manual penalty-in-loss, and with tf.keras.optimizers.AdamW(weight_decay=1e-4) which applies decay directly to the parameter update — the TF equivalent of PyTorch's AdamW.
Section 2 — Dropout and MC Dropout
Keras Dropout is controlled by the training argument, not a global mode flag. In a custom training loop:
with tf.GradientTape() as tape:
logits = model(x, training=True) # dropout active
preds = model(x, training=False) # dropout inactive
For MC Dropout, force training=True at inference and collect T=100 predictions, then compute the mean and variance of the softmax output — identical conceptually to the PyTorch version.
Section 3 — Batch Normalization
Implement MyBatchNorm(num_features) as a tf.keras.layers.Layer with self.gamma, self.beta, self.running_mean, self.running_var. In call(self, x, training=False), branch on training to use batch vs running statistics.
The TF-specific gotcha: when using model.fit, Keras automatically passes the correct training flag. In a custom tf.GradientTape loop you must pass it explicitly — forgetting to do so is the TF analogue of forgetting model.eval() in PyTorch.
Section 4 — Mixup and CutMix as Keras Preprocessing Layers
Implement both as tf.keras.layers.Layer subclasses that operate on (image, label) pairs inside a tf.data pipeline:
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.batch(128).map(mixup_layer)
Compare CIFAR-10 validation accuracy after 30 epochs across the same four conditions as Lab 1 (baseline, flips+crops, +Mixup, +CutMix).
Section 5 — Label Smoothing and Temperature Scaling
SparseCategoricalCrossentropy(label_smoothing=0.1) applies smoothing automatically. Verify that the output logit gap is bounded after training, consistent with the theoretical bound from the readings.
For temperature scaling, implement a thin calibration wrapper:
calibrated_logits = raw_logits / T # T is a scalar you tune post-training
Evaluate calibration with Expected Calibration Error (ECE) on a validation set before and after temperature scaling. Plot reliability diagrams (confidence vs accuracy per bin) to visualise miscalibration.
Section 6 — Spectral Normalization and Gradient Clipping
tf.keras.layers.SpectralNormalization(layer) is the TF equivalent of nn.utils.spectral_norm. Wrap every dense layer in a discriminator and verify the spectral norm stays ≤ 1.0 after training.
For gradient clipping, compare the two Keras modes on a deep model:
Adam(clipnorm=1.0) # clips global gradient norm
Adam(clipvalue=0.5) # clips each component independently
clipnorm preserves gradient direction (only scales magnitude); clipvalue can distort direction by clipping components independently. Plot gradient norm distributions to see the difference.