Prerequisite · Probability Foundations

Expected Value, Variance, and Standard Deviation

14 min read

By the end of this reading you will be able to:

Compute the expected value of a discrete or continuous random variable and apply the linearity of expectation to simplify E[aX + bY + c]
Compute variance using both Var(X) = E[(X−μ)²] and the shortcut Var(X) = E[X²] − (E[X])², and derive standard deviation from variance
Explain why Var(X + Y) = Var(X) + Var(Y) only when X and Y are independent, and define covariance and correlation as measures of linear dependence
Identify where expected value and variance appear in ML: loss as E[ℓ], gradient variance in SGD, and weight initialization schemes derived from variance preservation

Expected Value

The expected value (or expectation, or mean) of a random variable $X$ is its probability-weighted average. It answers: if I were to draw $X$ many times and average the results, what number would the average converge to?

Discrete:

$\mathbb{E}[X] = \sum_x x \cdot p(x)$

Continuous:

$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x)\, dx$

Example: For a fair die with $p(x) = 1/6$ for $x \in \{1,\ldots,6\}$ :

$\mathbb{E}[X] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + \cdots + 6 \cdot \tfrac{1}{6} = \frac{21}{6} = 3.5$

Expectation of a Function

For a function $g(X)$ , you do not need to first compute the distribution of $g(X)$ :

$\mathbb{E}[g(X)] = \sum_x g(x)\, p(x) \qquad \text{(discrete)}$

This is the Law of the Unconscious Statistician (LOTUS) — useful for computing moments like $\mathbb{E}[X^2]$ directly.

Linearity of Expectation

This is one of the most useful facts in probability:

$\mathbb{E}[aX + bY + c] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y] + c$

This holds for any two random variables $X$ and $Y$ , whether or not they are independent.

Linearity of expectation means you can break apart complex expressions into simpler pieces. For example, the expected loss over a batch of $N$ examples is:

$\mathbb{E}\!\left[\frac{1}{N}\sum_{i=1}^N \ell_i\right] = \frac{1}{N}\sum_{i=1}^N \mathbb{E}[\ell_i] = \mathbb{E}[\ell]$

The per-batch mean is an unbiased estimator of the true expected loss — this justifies training with mini-batches.

Variance

Expected value tells you where the distribution is centered. Variance tells you how spread out it is.

$\text{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right]$

Variance is the average squared deviation from the mean. By LOTUS:

$\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

This shortcut is often easier to compute: find $\mathbb{E}[X^2]$ and subtract the square of the mean.

Example — fair die:

$\mathbb{E}[X^2] = \frac{1^2 + 2^2 + \cdots + 6^2}{6} = \frac{91}{6} \approx 15.17$

$\text{Var}(X) = 15.17 - 3.5^2 = 15.17 - 12.25 = 2.92$

Standard Deviation

Variance is in squared units, which is hard to interpret alongside the original scale. The standard deviation returns to the original units:

$\sigma_X = \sqrt{\text{Var}(X)}$

For the die: $\sigma = \sqrt{2.92} \approx 1.71$ . Roughly speaking, individual rolls are about 1.71 units away from the mean of 3.5.

Variance of Linear Combinations

Variance is not linear — squaring introduces cross terms.

$\text{Var}(aX + b) = a^2 \text{Var}(X)$

Adding a constant $b$ shifts the distribution but does not change its spread; scaling by $a$ scales variance by $a^2$ .

$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X, Y)$

If $X$ and $Y$ are independent: $\text{Cov}(X,Y) = 0$ , so

$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \qquad (X \perp Y)$

This is why variances add for independent random variables — including independent gradient estimates in mini-batch SGD.

Covariance and Correlation

Covariance measures how two random variables move together:

$\text{Cov}(X, Y) = \mathbb{E}\!\left[(X - \mu_X)(Y - \mu_Y)\right] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]$

$\text{Cov}(X,Y) > 0$ : $X$ and $Y$ tend to be large or small together
$\text{Cov}(X,Y) < 0$ : when $X$ is large, $Y$ tends to be small
$\text{Cov}(X,Y) = 0$ : no linear relationship (but they might still be dependent in a nonlinear way)

Note: $\text{Cov}(X,X) = \text{Var}(X)$ .

Correlation normalizes covariance to $[-1, 1]$ :

$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$

$\rho = 1$ means perfect positive linear relationship; $\rho = -1$ means perfect negative; $\rho = 0$ means no linear relationship.

Where These Appear in ML

Loss as expectation. The training objective is an expected loss over the data distribution:

$\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim p_{\text{data}}}[\ell(f_\theta(x), y)]$

Stochastic gradient descent approximates $\nabla_\theta \mathcal{L}(\theta)$ with a mini-batch estimate. This estimate is unbiased by linearity of expectation, but has variance that decreases as batch size grows.

Gradient variance. High gradient variance in SGD causes noisy updates and slow convergence. Techniques like gradient clipping, batch normalization, and careful learning rate scheduling all address this variance.

Weight initialization. The Xavier/Glorot initialization sets $\text{Var}(W_{ij}) = 2/(n_{\text{in}} + n_{\text{out}})$ to preserve signal variance through layers. The He initialization sets $\text{Var}(W_{ij}) = 2/n_{\text{in}}$ for ReLU networks. Both are derived by asking: what variance should the weights have so that the output variance equals the input variance?

Adam optimizer. Maintains $m_t = \mathbb{E}[g_t]$ (first moment — mean) and $v_t = \mathbb{E}[g_t^2]$ (second moment — uncentered variance) of the gradient to adapt the learning rate per parameter.

References

Downey — Think Stats 2e, Ch. 2, 6 — Distributions, Moments, and Summary Statistics

Previous Next →

Expected Value, Variance, and Standard Deviation

Expected Value

Expectation of a Function

Linearity of Expectation

Variance

Standard Deviation

Variance of Linear Combinations

Covariance and Correlation

Where These Appear in ML

Privacy Policy

What we collect

What we don't collect

Your choices

Contact