Supplement · Regularization

Label Smoothing and Output Regularizers

12 min read
By the end of this reading you will be able to:
  • Derive the label-smoothed cross-entropy target distribution and explain why it bounds the maximum logit difference, preventing overconfident outputs
  • Explain what overconfident models learn about their penultimate-layer representations and why label smoothing improves calibration and cross-architecture transfer
  • Distinguish label smoothing from temperature scaling — identifying which is applied at training time vs. inference time and what each controls
  • Explain the confidence penalty as an entropy regularizer on the output distribution and relate it to maximum-entropy reinforcement learning policies

The Overconfidence Problem

A neural network trained with one-hot targets and cross-entropy loss has a strong incentive to push the correct class logit to ++\infty while pushing all others to -\infty. The softmax output approaches a unit spike: the predicted probability for the correct class approaches 1.

This creates several problems:

  1. Poor calibration: the model's confidence does not match actual accuracy
  2. Brittle representations: the penultimate layer's features cluster tightly toward the corners of the simplex, leaving little room for uncertainty or transfer
  3. Gradient vanishing near convergence: once softmax saturates, gradients approach zero and training stalls

Label Smoothing

Label smoothing (Szegedy et al., 2016; first popularized in Inception-v3) replaces the one-hot target y\mathbf{y} with a soft distribution:

y~k=(1ε)yk+εK\tilde{y}_k = (1-\varepsilon)\, y_k + \frac{\varepsilon}{K}

For the correct class kk^*: y~k=1ε+ε/K1ε\tilde{y}_{k^*} = 1 - \varepsilon + \varepsilon/K \approx 1-\varepsilon For all other classes kkk \neq k^*: y~k=ε/K\tilde{y}_k = \varepsilon/K

Typically ε=0.1\varepsilon = 0.1 and KK is the number of classes.

The cross-entropy loss with smoothed labels:

LLS=(1ε)LCE(p^,y)+εKk=1KLCE(p^,uK)\mathcal{L}_{\text{LS}} = (1-\varepsilon)\mathcal{L}_{\text{CE}}(\hat{\mathbf{p}}, \mathbf{y}) + \frac{\varepsilon}{K}\sum_{k=1}^K \mathcal{L}_{\text{CE}}(\hat{\mathbf{p}}, \mathbf{u}_K)

where uK\mathbf{u}_K is the uniform distribution over all classes. The second term penalizes the model for being too far from uniform — it regularizes the output distribution toward uniformity.

Effect on Logits

The optimal logits under label-smoothed cross-entropy satisfy:

zkzk=log(1ε)(K1)ε(for all kk)z_{k^*} - z_k = \log\frac{(1-\varepsilon)(K-1)}{{\varepsilon}} \quad \text{(for all } k \neq k^*\text{)}

This is finite. Label smoothing imposes a maximum logit gap — the model cannot drive zkzkz_{k^*} - z_k \to \infty without increasing the loss. This prevents the saturation that halts training and the overconfidence that hurts calibration.

Effect on Representations

Müller et al. (2019) showed that label smoothing causes penultimate-layer representations to form compact, well-separated clusters in feature space, rather than collapsing toward simplex corners. This is why features from models trained with label smoothing transfer better to other tasks via knowledge distillation or fine-tuning.


Temperature Scaling

Temperature scaling is a post-hoc calibration technique applied at inference, not during training. Divide logits by temperature TT before softmax:

p^k=Softmax(z/T)k\hat{p}_k = \text{Softmax}(\mathbf{z}/T)_k

  • T>1T > 1: softer distribution — spreads probability mass across classes, reduces confidence
  • T<1T < 1: sharper distribution — concentrates probability on the top class, increases confidence
  • T=1T = 1: original distribution

TT is fit on a held-out validation set by minimizing NLL. It does not change predictions (argmax is the same) — only the calibration of the probabilities.

Label smoothing (training-time): changes the target distribution; affects learned features. Temperature scaling (inference-time): re-scales already-trained logits; does not affect features.


The Confidence Penalty

Pereyra et al. (2017) add an explicit entropy bonus to the loss, penalizing low-entropy output distributions:

Lconf=LCEβH(p^)=LCE+βkp^klogp^k\mathcal{L}_{\text{conf}} = \mathcal{L}_{\text{CE}} - \beta H(\hat{\mathbf{p}}) = \mathcal{L}_{\text{CE}} + \beta\sum_k \hat{p}_k \log \hat{p}_k

  • When p^\hat{\mathbf{p}} is peaked (confident), H(p^)H(\hat{\mathbf{p}}) is low and the penalty is large
  • When p^\hat{\mathbf{p}} is flat (uncertain), H(p^)H(\hat{\mathbf{p}}) is high and the penalty is small

This is equivalent to label smoothing when the smoothed distribution is uniform — they are two parameterizations of the same regularization idea.

Maximum-Entropy RL Policies

The confidence penalty appears prominently in reinforcement learning. Soft Actor-Critic (SAC) adds an entropy bonus to the RL objective:

J(π)=tE[rt+αH(π(st))]J(\pi) = \sum_t \mathbb{E}\bigl[r_t + \alpha H(\pi(\cdot|s_t))\bigr]

Maximizing entropy encourages the policy to be as random as possible while still collecting rewards — this provides natural exploration, prevents premature convergence to a suboptimal deterministic policy, and improves robustness. The temperature α\alpha balances reward maximization against entropy maximization.


PyTorch and TensorFlow

PyTorch — label smoothing, temperature scaling, confidence penalty:

import torch
import torch.nn as nn
import torch.nn.functional as F

# Label smoothing — built into CrossEntropyLoss since PyTorch 1.10
# Internally computes: (1-eps)*one_hot + eps/K for K classes
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

logits = torch.randn(16, 10)          # raw logits from model
targets = torch.randint(0, 10, (16,))
loss = criterion(logits, targets)     # smoothed cross-entropy

# Temperature scaling at inference (post-training calibration)
def temperature_scale(logits, T=1.5):
    """Divide logits by T before softmax to soften the probability distribution."""
    return F.softmax(logits / T, dim=-1)

probs_sharp = F.softmax(logits, dim=-1)          # high-confidence
probs_soft  = temperature_scale(logits, T=2.0)   # calibrated / softer

# Confidence penalty: add entropy bonus to loss (discourages overconfident outputs)
def confidence_penalty(logits, beta=0.1):
    probs   = F.softmax(logits, dim=-1)
    entropy = -(probs * probs.log()).sum(dim=-1).mean()   # H(p)
    return -beta * entropy     # subtract entropy to penalize low-entropy (overconfidence)

loss_total = criterion(logits, targets) + confidence_penalty(logits)

TensorFlow / Keras:

import tensorflow as tf

# Label smoothing built into CategoricalCrossentropy and SparseCategoricalCrossentropy
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    label_smoothing=0.1,    # y_smoothed = (1-0.1)*y_onehot + 0.1/K
)

logits  = tf.random.normal((16, 10))
targets = tf.random.uniform((16,), 0, 10, dtype=tf.int32)
loss    = loss_fn(targets, logits)

# Temperature scaling
def temperature_scale(logits, T=1.5):
    return tf.nn.softmax(logits / T, axis=-1)