Exponentials, Logarithms, and Their Derivatives
- State the derivatives of e^x and ln(x) and apply them with the chain rule to differentiate compositions involving exponentials and logarithms
- Derive the sigmoid derivative σ'(x) = σ(x)(1 − σ(x)) using the quotient rule and exponential derivative
- Explain why log appears in cross-entropy loss and why numerical stability requires computing log-probabilities rather than raw probabilities
The Natural Exponential Function
The exponential function (where ) has a remarkable property: it is its own derivative.
No other base has this property. is defined precisely so that this holds — it is the unique base for which the derivative of at equals 1.
Why does this matter? Because a function that equals its own rate of change is the natural model for anything that grows (or decays) proportionally to its current size: compound interest, population growth, radioactive decay, and — centrally for ML — the exponential in the softmax function.
The Natural Logarithm
The natural logarithm is the inverse of : and . Its derivative:
Notice that as grows, shrinks toward zero — the logarithm's slope becomes shallower and shallower, reflecting how slowly grows for large . For small positive near zero, is large — falls steeply toward .
Chain Rule Combinations
With the chain rule from r3, we can differentiate any composition involving and :
Examples:
| Function | Derivative |
|---|---|
Deriving the Sigmoid Derivative
The sigmoid function is one of the most important functions in ML. Let's derive its derivative from first principles.
Write and apply the chain rule:
Now rewrite using a clever algebraic trick. Factor and recognize :
Note that . Therefore:
This is the result previewed in r3. The sigmoid derivative is expressible entirely in terms of the sigmoid output itself — no separate computation needed. At (the steepest point), . At extreme values near 0 or 1, collapses toward zero — the vanishing gradient problem.
Cross-Entropy Loss and the Log Derivative
For a binary classifier with prediction and true label , the cross-entropy loss is:
When , this reduces to . Why log?
- Probabilistic motivation: Maximum likelihood estimation of is equivalent to minimizing , since is monotone.
- Gradient behavior: The gradient of with respect to is . When the model is confidently wrong (), this gradient is very large — the loss sends a strong correction signal. When the model is correct (), the gradient is small — small nudge needed.
The chain rule then gives the gradient with respect to the network's output (before sigmoid):
This elegantly simple form — prediction minus label — is why logistic regression with cross-entropy loss trains so cleanly.
Numerical Stability: Log-Probabilities
Computing before exponentiation prevents underflow. The softmax denominator can overflow for large logits. The standard fix:
For large positive : . For large negative : . Both cases are numerically stable. This is why torch.nn.BCEWithLogitsLoss and log_softmax exist — they fuse the log and exp operations to avoid intermediate overflow.
PyTorch and TensorFlow
import torch
import torch.nn.functional as F
# Sigmoid derivative: σ(x)(1 - σ(x))
x = torch.tensor(0.0, requires_grad=True) # at x=0, σ=0.5
a = torch.sigmoid(x)
a.backward()
print(x.grad.item()) # 0.25 = 0.5 * (1 - 0.5) ✓
# Numerically stable log-sigmoid
logits = torch.tensor([2.0, -3.0, 0.5])
log_probs = F.log_softmax(logits, dim=0)
print(log_probs) # safe: never computes raw softmax
import tensorflow as tf
import numpy as np
# Sigmoid derivative at x=0
x = tf.Variable(0.0)
with tf.GradientTape() as tape:
a = tf.sigmoid(x)
print(tape.gradient(a, x).numpy()) # 0.25
# Numerically stable cross-entropy (pass logits, not probabilities)
logits = tf.constant([[2.0, -3.0, 0.5]])
labels = tf.constant([[1, 0, 0]])
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
print(loss(labels, logits).numpy()) # computed via log-softmax internally