Prerequisite · Probability Foundations

Expected Value, Variance, and Standard Deviation

14 min read
By the end of this reading you will be able to:
  • Compute the expected value of a discrete or continuous random variable and apply the linearity of expectation to simplify E[aX + bY + c]
  • Compute variance using both Var(X) = E[(X−μ)²] and the shortcut Var(X) = E[X²] − (E[X])², and derive standard deviation from variance
  • Explain why Var(X + Y) = Var(X) + Var(Y) only when X and Y are independent, and define covariance and correlation as measures of linear dependence
  • Identify where expected value and variance appear in ML: loss as E[ℓ], gradient variance in SGD, and weight initialization schemes derived from variance preservation

Expected Value

The expected value (or expectation, or mean) of a random variable XX is its probability-weighted average. It answers: if I were to draw XX many times and average the results, what number would the average converge to?

Discrete:

E[X]=xxp(x)\mathbb{E}[X] = \sum_x x \cdot p(x)

Continuous:

E[X]=xf(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x)\, dx

Example: For a fair die with p(x)=1/6p(x) = 1/6 for x{1,,6}x \in \{1,\ldots,6\}:

E[X]=116+216++616=216=3.5\mathbb{E}[X] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + \cdots + 6 \cdot \tfrac{1}{6} = \frac{21}{6} = 3.5

Expectation of a Function

For a function g(X)g(X), you do not need to first compute the distribution of g(X)g(X):

E[g(X)]=xg(x)p(x)(discrete)\mathbb{E}[g(X)] = \sum_x g(x)\, p(x) \qquad \text{(discrete)}

This is the Law of the Unconscious Statistician (LOTUS) — useful for computing moments like E[X2]\mathbb{E}[X^2] directly.


Linearity of Expectation

This is one of the most useful facts in probability:

E[aX+bY+c]=aE[X]+bE[Y]+c\mathbb{E}[aX + bY + c] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y] + c

This holds for any two random variables XX and YY, whether or not they are independent.

Linearity of expectation means you can break apart complex expressions into simpler pieces. For example, the expected loss over a batch of NN examples is:

E ⁣[1Ni=1Ni]=1Ni=1NE[i]=E[]\mathbb{E}\!\left[\frac{1}{N}\sum_{i=1}^N \ell_i\right] = \frac{1}{N}\sum_{i=1}^N \mathbb{E}[\ell_i] = \mathbb{E}[\ell]

The per-batch mean is an unbiased estimator of the true expected loss — this justifies training with mini-batches.


Variance

Expected value tells you where the distribution is centered. Variance tells you how spread out it is.

Var(X)=E ⁣[(XE[X])2]\text{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right]

Variance is the average squared deviation from the mean. By LOTUS:

Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

This shortcut is often easier to compute: find E[X2]\mathbb{E}[X^2] and subtract the square of the mean.

Example — fair die:

E[X2]=12+22++626=91615.17\mathbb{E}[X^2] = \frac{1^2 + 2^2 + \cdots + 6^2}{6} = \frac{91}{6} \approx 15.17

Var(X)=15.173.52=15.1712.25=2.92\text{Var}(X) = 15.17 - 3.5^2 = 15.17 - 12.25 = 2.92

Standard Deviation

Variance is in squared units, which is hard to interpret alongside the original scale. The standard deviation returns to the original units:

σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)}

For the die: σ=2.921.71\sigma = \sqrt{2.92} \approx 1.71. Roughly speaking, individual rolls are about 1.71 units away from the mean of 3.5.


Variance of Linear Combinations

Variance is not linear — squaring introduces cross terms.

Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)

Adding a constant bb shifts the distribution but does not change its spread; scaling by aa scales variance by a2a^2.

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X, Y)

If XX and YY are independent: Cov(X,Y)=0\text{Cov}(X,Y) = 0, so

Var(X+Y)=Var(X)+Var(Y)(XY)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \qquad (X \perp Y)

This is why variances add for independent random variables — including independent gradient estimates in mini-batch SGD.


Covariance and Correlation

Covariance measures how two random variables move together:

Cov(X,Y)=E ⁣[(XμX)(YμY)]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}\!\left[(X - \mu_X)(Y - \mu_Y)\right] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]

  • Cov(X,Y)>0\text{Cov}(X,Y) > 0: XX and YY tend to be large or small together
  • Cov(X,Y)<0\text{Cov}(X,Y) < 0: when XX is large, YY tends to be small
  • Cov(X,Y)=0\text{Cov}(X,Y) = 0: no linear relationship (but they might still be dependent in a nonlinear way)

Note: Cov(X,X)=Var(X)\text{Cov}(X,X) = \text{Var}(X).

Correlation normalizes covariance to [1,1][-1, 1]:

ρX,Y=Cov(X,Y)σXσY\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

ρ=1\rho = 1 means perfect positive linear relationship; ρ=1\rho = -1 means perfect negative; ρ=0\rho = 0 means no linear relationship.


Where These Appear in ML

Loss as expectation. The training objective is an expected loss over the data distribution:

L(θ)=E(x,y)pdata[(fθ(x),y)]\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim p_{\text{data}}}[\ell(f_\theta(x), y)]

Stochastic gradient descent approximates θL(θ)\nabla_\theta \mathcal{L}(\theta) with a mini-batch estimate. This estimate is unbiased by linearity of expectation, but has variance that decreases as batch size grows.

Gradient variance. High gradient variance in SGD causes noisy updates and slow convergence. Techniques like gradient clipping, batch normalization, and careful learning rate scheduling all address this variance.

Weight initialization. The Xavier/Glorot initialization sets Var(Wij)=2/(nin+nout)\text{Var}(W_{ij}) = 2/(n_{\text{in}} + n_{\text{out}}) to preserve signal variance through layers. The He initialization sets Var(Wij)=2/nin\text{Var}(W_{ij}) = 2/n_{\text{in}} for ReLU networks. Both are derived by asking: what variance should the weights have so that the output variance equals the input variance?

Adam optimizer. Maintains mt=E[gt]m_t = \mathbb{E}[g_t] (first moment — mean) and vt=E[gt2]v_t = \mathbb{E}[g_t^2] (second moment — uncentered variance) of the gradient to adapt the learning rate per parameter.