Prerequisite · Calculus Foundations

Exponentials, Logarithms, and Their Derivatives

15 min read
0:00
Audio overview generated with
By the end of this reading you will be able to:
  • State the derivatives of e^x and ln(x) and apply them with the chain rule to differentiate compositions involving exponentials and logarithms
  • Derive the sigmoid derivative σ'(x) = σ(x)(1 − σ(x)) using the quotient rule and exponential derivative
  • Explain why log appears in cross-entropy loss and why numerical stability requires computing log-probabilities rather than raw probabilities

The Natural Exponential Function

The exponential function exe^x (where e2.71828...e \approx 2.71828...) has a remarkable property: it is its own derivative.

ddx[ex]=ex\frac{d}{dx}\left[e^x\right] = e^x

No other base has this property. ee is defined precisely so that this holds — it is the unique base for which the derivative of axa^x at x=0x = 0 equals 1.

Why does this matter? Because a function that equals its own rate of change is the natural model for anything that grows (or decays) proportionally to its current size: compound interest, population growth, radioactive decay, and — centrally for ML — the exponential in the softmax function.


The Natural Logarithm

The natural logarithm lnx\ln x is the inverse of exe^x: ln(ex)=x\ln(e^x) = x and elnx=xe^{\ln x} = x. Its derivative:

ddx[lnx]=1x(x>0)\frac{d}{dx}[\ln x] = \frac{1}{x} \qquad (x > 0)

Notice that as xx grows, 1/x1/x shrinks toward zero — the logarithm's slope becomes shallower and shallower, reflecting how slowly lnx\ln x grows for large xx. For small positive xx near zero, 1/x1/x is large — lnx\ln x falls steeply toward -\infty.


Chain Rule Combinations

With the chain rule from r3, we can differentiate any composition involving exe^x and lnx\ln x:

ddx[eg(x)]=g(x)eg(x)\frac{d}{dx}\left[e^{g(x)}\right] = g'(x)\,e^{g(x)}

ddx[lng(x)]=g(x)g(x)\frac{d}{dx}\left[\ln g(x)\right] = \frac{g'(x)}{g(x)}

Examples:

Function Derivative
e3xe^{3x} 3e3x3e^{3x}
ex2e^{-x^2} 2xex2-2x\,e^{-x^2}
ln(x2+1)\ln(x^2 + 1) 2xx2+1\frac{2x}{x^2+1}
ln(sinx)\ln(\sin x) cosxsinx=cotx\frac{\cos x}{\sin x} = \cot x

Deriving the Sigmoid Derivative

The sigmoid function σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}} is one of the most important functions in ML. Let's derive its derivative from first principles.

Write σ(x)=(1+ex)1\sigma(x) = (1 + e^{-x})^{-1} and apply the chain rule:

σ(x)=(1+ex)2(ex)=ex(1+ex)2\sigma'(x) = -(1+e^{-x})^{-2} \cdot (-e^{-x}) = \frac{e^{-x}}{(1+e^{-x})^2}

Now rewrite using a clever algebraic trick. Factor and recognize σ\sigma:

σ(x)=11+exex1+ex=σ(x)ex1+ex\sigma'(x) = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} = \sigma(x) \cdot \frac{e^{-x}}{1+e^{-x}}

Note that ex1+ex=111+ex=1σ(x)\frac{e^{-x}}{1+e^{-x}} = 1 - \frac{1}{1+e^{-x}} = 1 - \sigma(x). Therefore:

σ(x)=σ(x)(1σ(x))\boxed{\sigma'(x) = \sigma(x)\,(1 - \sigma(x))}

This is the result previewed in r3. The sigmoid derivative is expressible entirely in terms of the sigmoid output itself — no separate computation needed. At σ=0.5\sigma = 0.5 (the steepest point), σ=0.25\sigma' = 0.25. At extreme values near 0 or 1, σ\sigma' collapses toward zero — the vanishing gradient problem.


Cross-Entropy Loss and the Log Derivative

For a binary classifier with prediction p^(0,1)\hat{p} \in (0,1) and true label y{0,1}y \in \{0,1\}, the cross-entropy loss is:

L=ylnp^(1y)ln(1p^)\mathcal{L} = -y \ln \hat{p} - (1-y) \ln(1-\hat{p})

When y=1y = 1, this reduces to lnp^-\ln \hat{p}. Why log?

  1. Probabilistic motivation: Maximum likelihood estimation of p^\hat{p} is equivalent to minimizing lnp^-\ln \hat{p}, since ln\ln is monotone.
  2. Gradient behavior: The gradient of lnp^-\ln \hat{p} with respect to p^\hat{p} is 1/p^-1/\hat{p}. When the model is confidently wrong (p^0\hat{p} \approx 0), this gradient is very large — the loss sends a strong correction signal. When the model is correct (p^1\hat{p} \approx 1), the gradient is small — small nudge needed.

The chain rule then gives the gradient with respect to the network's output zz (before sigmoid): Lz=p^y\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y

This elegantly simple form — prediction minus label — is why logistic regression with cross-entropy loss trains so cleanly.


Numerical Stability: Log-Probabilities

Computing lnp\ln p before exponentiation prevents underflow. The softmax denominator kezk\sum_k e^{z_k} can overflow for large logits. The standard fix:

lnσ(x)=ln ⁣(11+ex)=ln(1+ex)\ln \sigma(x) = \ln\!\left(\frac{1}{1+e^{-x}}\right) = -\ln(1 + e^{-x})

For large positive xx: ex0\approx -e^{-x} \approx 0. For large negative xx: x\approx x. Both cases are numerically stable. This is why torch.nn.BCEWithLogitsLoss and log_softmax exist — they fuse the log and exp operations to avoid intermediate overflow.


PyTorch and TensorFlow

import torch
import torch.nn.functional as F

# Sigmoid derivative: σ(x)(1 - σ(x))
x = torch.tensor(0.0, requires_grad=True)  # at x=0, σ=0.5
a = torch.sigmoid(x)
a.backward()
print(x.grad.item())   # 0.25  = 0.5 * (1 - 0.5)  ✓

# Numerically stable log-sigmoid
logits = torch.tensor([2.0, -3.0, 0.5])
log_probs = F.log_softmax(logits, dim=0)
print(log_probs)       # safe: never computes raw softmax
import tensorflow as tf
import numpy as np

# Sigmoid derivative at x=0
x = tf.Variable(0.0)
with tf.GradientTape() as tape:
    a = tf.sigmoid(x)
print(tape.gradient(a, x).numpy())  # 0.25

# Numerically stable cross-entropy (pass logits, not probabilities)
logits = tf.constant([[2.0, -3.0, 0.5]])
labels = tf.constant([[1, 0, 0]])
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
print(loss(labels, logits).numpy())  # computed via log-softmax internally