Expected Value, Variance, and Standard Deviation
- Compute the expected value of a discrete or continuous random variable and apply the linearity of expectation to simplify E[aX + bY + c]
- Compute variance using both Var(X) = E[(X−μ)²] and the shortcut Var(X) = E[X²] − (E[X])², and derive standard deviation from variance
- Explain why Var(X + Y) = Var(X) + Var(Y) only when X and Y are independent, and define covariance and correlation as measures of linear dependence
- Identify where expected value and variance appear in ML: loss as E[ℓ], gradient variance in SGD, and weight initialization schemes derived from variance preservation
Expected Value
The expected value (or expectation, or mean) of a random variable is its probability-weighted average. It answers: if I were to draw many times and average the results, what number would the average converge to?
Discrete:
Continuous:
Example: For a fair die with for :
Expectation of a Function
For a function , you do not need to first compute the distribution of :
This is the Law of the Unconscious Statistician (LOTUS) — useful for computing moments like directly.
Linearity of Expectation
This is one of the most useful facts in probability:
This holds for any two random variables and , whether or not they are independent.
Linearity of expectation means you can break apart complex expressions into simpler pieces. For example, the expected loss over a batch of examples is:
The per-batch mean is an unbiased estimator of the true expected loss — this justifies training with mini-batches.
Variance
Expected value tells you where the distribution is centered. Variance tells you how spread out it is.
Variance is the average squared deviation from the mean. By LOTUS:
This shortcut is often easier to compute: find and subtract the square of the mean.
Example — fair die:
Standard Deviation
Variance is in squared units, which is hard to interpret alongside the original scale. The standard deviation returns to the original units:
For the die: . Roughly speaking, individual rolls are about 1.71 units away from the mean of 3.5.
Variance of Linear Combinations
Variance is not linear — squaring introduces cross terms.
Adding a constant shifts the distribution but does not change its spread; scaling by scales variance by .
If and are independent: , so
This is why variances add for independent random variables — including independent gradient estimates in mini-batch SGD.
Covariance and Correlation
Covariance measures how two random variables move together:
- : and tend to be large or small together
- : when is large, tends to be small
- : no linear relationship (but they might still be dependent in a nonlinear way)
Note: .
Correlation normalizes covariance to :
means perfect positive linear relationship; means perfect negative; means no linear relationship.
Where These Appear in ML
Loss as expectation. The training objective is an expected loss over the data distribution:
Stochastic gradient descent approximates with a mini-batch estimate. This estimate is unbiased by linearity of expectation, but has variance that decreases as batch size grows.
Gradient variance. High gradient variance in SGD causes noisy updates and slow convergence. Techniques like gradient clipping, batch normalization, and careful learning rate scheduling all address this variance.
Weight initialization. The Xavier/Glorot initialization sets to preserve signal variance through layers. The He initialization sets for ReLU networks. Both are derived by asking: what variance should the weights have so that the output variance equals the input variance?
Adam optimizer. Maintains (first moment — mean) and (second moment — uncentered variance) of the gradient to adapt the learning rate per parameter.