Prerequisite · Calculus Foundations

Exponentials, Logarithms, and Their Derivatives

15 min read

Audio overview generated with

By the end of this reading you will be able to:

State the derivatives of e^x and ln(x) and apply them with the chain rule to differentiate compositions involving exponentials and logarithms
Derive the sigmoid derivative σ'(x) = σ(x)(1 − σ(x)) using the quotient rule and exponential derivative
Explain why log appears in cross-entropy loss and why numerical stability requires computing log-probabilities rather than raw probabilities

The Natural Exponential Function

The exponential function $e^x$ (where $e \approx 2.71828...$ ) has a remarkable property: it is its own derivative.

$\frac{d}{dx}\left[e^x\right] = e^x$

No other base has this property. $e$ is defined precisely so that this holds — it is the unique base for which the derivative of $a^x$ at $x = 0$ equals 1.

Why does this matter? Because a function that equals its own rate of change is the natural model for anything that grows (or decays) proportionally to its current size: compound interest, population growth, radioactive decay, and — centrally for ML — the exponential in the softmax function.

The Natural Logarithm

The natural logarithm $\ln x$ is the inverse of $e^x$ : $\ln(e^x) = x$ and $e^{\ln x} = x$ . Its derivative:

$\frac{d}{dx}[\ln x] = \frac{1}{x} \qquad (x > 0)$

Notice that as $x$ grows, $1/x$ shrinks toward zero — the logarithm's slope becomes shallower and shallower, reflecting how slowly $\ln x$ grows for large $x$ . For small positive $x$ near zero, $1/x$ is large — $\ln x$ falls steeply toward $-\infty$ .

Chain Rule Combinations

With the chain rule from r3, we can differentiate any composition involving $e^x$ and $\ln x$ :

$\frac{d}{dx}\left[e^{g(x)}\right] = g'(x)\,e^{g(x)}$

$\frac{d}{dx}\left[\ln g(x)\right] = \frac{g'(x)}{g(x)}$

Examples:

Function	Derivative
$e^{3x}$	$3e^{3x}$
$e^{-x^2}$	$-2x\,e^{-x^2}$
$\ln(x^2 + 1)$	$\frac{2x}{x^2+1}$
$\ln(\sin x)$	$\frac{\cos x}{\sin x} = \cot x$

Deriving the Sigmoid Derivative

The sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$ is one of the most important functions in ML. Let's derive its derivative from first principles.

Write $\sigma(x) = (1 + e^{-x})^{-1}$ and apply the chain rule:

$\sigma'(x) = -(1+e^{-x})^{-2} \cdot (-e^{-x}) = \frac{e^{-x}}{(1+e^{-x})^2}$

Now rewrite using a clever algebraic trick. Factor and recognize $\sigma$ :

$\sigma'(x) = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} = \sigma(x) \cdot \frac{e^{-x}}{1+e^{-x}}$

Note that $\frac{e^{-x}}{1+e^{-x}} = 1 - \frac{1}{1+e^{-x}} = 1 - \sigma(x)$ . Therefore:

$\boxed{\sigma'(x) = \sigma(x)\,(1 - \sigma(x))}$

This is the result previewed in r3. The sigmoid derivative is expressible entirely in terms of the sigmoid output itself — no separate computation needed. At $\sigma = 0.5$ (the steepest point), $\sigma' = 0.25$ . At extreme values near 0 or 1, $\sigma'$ collapses toward zero — the vanishing gradient problem.

Cross-Entropy Loss and the Log Derivative

For a binary classifier with prediction $\hat{p} \in (0,1)$ and true label $y \in \{0,1\}$ , the cross-entropy loss is:

$\mathcal{L} = -y \ln \hat{p} - (1-y) \ln(1-\hat{p})$

When $y = 1$ , this reduces to $-\ln \hat{p}$ . Why log?

Probabilistic motivation: Maximum likelihood estimation of $\hat{p}$ is equivalent to minimizing $-\ln \hat{p}$ , since $\ln$ is monotone.
Gradient behavior: The gradient of $-\ln \hat{p}$ with respect to $\hat{p}$ is $-1/\hat{p}$ . When the model is confidently wrong ( $\hat{p} \approx 0$ ), this gradient is very large — the loss sends a strong correction signal. When the model is correct ( $\hat{p} \approx 1$ ), the gradient is small — small nudge needed.

The chain rule then gives the gradient with respect to the network's output $z$ (before sigmoid): $\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y$

This elegantly simple form — prediction minus label — is why logistic regression with cross-entropy loss trains so cleanly.

Numerical Stability: Log-Probabilities

Computing $\ln p$ before exponentiation prevents underflow. The softmax denominator $\sum_k e^{z_k}$ can overflow for large logits. The standard fix:

$\ln \sigma(x) = \ln\!\left(\frac{1}{1+e^{-x}}\right) = -\ln(1 + e^{-x})$

For large positive $x$ : $\approx -e^{-x} \approx 0$ . For large negative $x$ : $\approx x$ . Both cases are numerically stable. This is why torch.nn.BCEWithLogitsLoss and log_softmax exist — they fuse the log and exp operations to avoid intermediate overflow.

PyTorch and TensorFlow

import torch
import torch.nn.functional as F

# Sigmoid derivative: σ(x)(1 - σ(x))
x = torch.tensor(0.0, requires_grad=True)  # at x=0, σ=0.5
a = torch.sigmoid(x)
a.backward()
print(x.grad.item())   # 0.25  = 0.5 * (1 - 0.5)  ✓

# Numerically stable log-sigmoid
logits = torch.tensor([2.0, -3.0, 0.5])
log_probs = F.log_softmax(logits, dim=0)
print(log_probs)       # safe: never computes raw softmax

import tensorflow as tf
import numpy as np

# Sigmoid derivative at x=0
x = tf.Variable(0.0)
with tf.GradientTape() as tape:
    a = tf.sigmoid(x)
print(tape.gradient(a, x).numpy())  # 0.25

# Numerically stable cross-entropy (pass logits, not probabilities)
logits = tf.constant([[2.0, -3.0, 0.5]])
labels = tf.constant([[1, 0, 0]])
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
print(loss(labels, logits).numpy())  # computed via log-softmax internally

References

MIT 18.01SC — Sessions 16–18 — Exponential Functions and Their Derivatives

Previous Take Quiz →

Exponentials, Logarithms, and Their Derivatives

The Natural Exponential Function

The Natural Logarithm

Chain Rule Combinations

Deriving the Sigmoid Derivative

Cross-Entropy Loss and the Log Derivative

Numerical Stability: Log-Probabilities

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact