Prerequisite · Probability Foundations

Importance Sampling

14 min read
By the end of this reading you will be able to:
  • Derive the importance sampling identity E_p[f(x)] = E_q[f(x) w(x)] from a change of measure, where w(x) = p(x)/q(x) are the importance weights
  • Explain when vanilla IS fails (high-variance weights from a poor proposal) and state the self-normalized IS estimator used when the normalizing constant of p is unknown
  • Identify importance sampling in off-policy reinforcement learning and policy gradient algorithms, explaining how the importance weight corrects for the mismatch between behavior and target policies
  • Explain why effective sample size (ESS) quantifies the quality of an IS estimate and state what ESS → 1/N signals about proposal quality

The Problem: Sampling From the Wrong Distribution

Many quantities in ML take the form of an expectation:

Exp[f(x)]=f(x)p(x)dx\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx

For example:

  • Expected reward under a policy: Eτπ[R(τ)]\mathbb{E}_{\tau \sim \pi}[R(\tau)]
  • Expected gradient in policy gradient methods
  • Marginal likelihood in a latent variable model

The standard Monte Carlo estimator draws NN samples xipx_i \sim p and averages:

Ep[f(x)]1Ni=1Nf(xi)\mathbb{E}_p[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i)

But what if sampling from pp is expensive, impossible, or dangerous?

  • In off-policy RL, we collected data under a behavior policy qq (what we did in the past), but want to evaluate a target policy pp (what we want to do now)
  • The normalizing constant of pp might be unknown (common in Bayesian inference)
  • pp might be a rare-event distribution where direct sampling yields too few hits

Importance sampling solves all of these by sampling from a different distribution qq instead of pp.


The Importance Sampling Identity

The key insight is a change of measure. Multiply and divide by q(x)q(x):

Exp[f(x)]=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=Exq ⁣[f(x)p(x)q(x)]\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx = \int f(x) \frac{p(x)}{q(x)} q(x)\, dx = \mathbb{E}_{x \sim q}\!\left[f(x) \cdot \frac{p(x)}{q(x)}\right]

Define the importance weight w(x)=p(x)/q(x)w(x) = p(x)/q(x). Then:

Exp[f(x)]=Exq[f(x)w(x)]\boxed{\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}[f(x)\, w(x)]}

This is exact — no approximation has been made. The estimator is:

μ^IS=1Ni=1Nf(xi)w(xi),xiq\hat{\mu}_{\text{IS}} = \frac{1}{N} \sum_{i=1}^N f(x_i)\, w(x_i), \quad x_i \sim q

This estimator is unbiased: E[μ^IS]=Ep[f(x)]\mathbb{E}[\hat{\mu}_{\text{IS}}] = \mathbb{E}_p[f(x)].

Requirement: the proposal qq must have support wherever pp has support — q(x)>0q(x) > 0 whenever p(x)>0p(x) > 0. Otherwise weights become infinite or undefined.


Variance of the IS Estimator

Unbiasedness is necessary but not sufficient — we also need low variance. The variance of μ^IS\hat{\mu}_{\text{IS}} is:

Var(μ^IS)=1NVarq[f(x)w(x)]\text{Var}(\hat{\mu}_{\text{IS}}) = \frac{1}{N} \text{Var}_q[f(x)\, w(x)]

The variance can be very large if w(x)=p(x)/q(x)w(x) = p(x)/q(x) varies wildly. This happens when qq and pp are very different — some samples get enormous weights while most get near-zero weights.

Optimal proposal (zero variance): If q(x)f(x)p(x)q^*(x) \propto |f(x)| p(x), the estimator has zero variance. But this requires knowing the answer — it's a theoretical baseline, not a practical one.

Rule of thumb: The proposal qq should be close to pp (ideally similar shape and scale) and should cover all high-probability regions of f(x)p(x)f(x)\, p(x).


Self-Normalized Importance Sampling

In Bayesian inference, pp is often known only up to a normalizing constant: p(x)=p~(x)/Zp(x) = \tilde{p}(x) / Z where ZZ is intractable. The raw weights w(x)=p~(x)/(Zq(x))w(x) = \tilde{p}(x) / (Z\, q(x)) depend on the unknown ZZ.

Self-normalized IS avoids this by dividing by the sum of weights:

μ^SNIS=i=1Nf(xi)w~(xi)i=1Nw~(xi),w~(xi)=p~(xi)q(xi)\hat{\mu}_{\text{SNIS}} = \frac{\sum_{i=1}^N f(x_i)\, \tilde{w}(x_i)}{\sum_{i=1}^N \tilde{w}(x_i)}, \quad \tilde{w}(x_i) = \frac{\tilde{p}(x_i)}{q(x_i)}

Note: SNIS is biased (the ratio of expectations is not the expectation of the ratio), but the bias decreases as O(1/N)O(1/N) and is often negligible in practice.


Effective Sample Size

How many effective samples does an IS estimate correspond to? The effective sample size (ESS) measures this:

ESS=(i=1Nwi)2i=1Nwi2\text{ESS} = \frac{\left(\sum_{i=1}^N w_i\right)^2}{\sum_{i=1}^N w_i^2}

where wiw_i are the normalized weights (wi=1\sum w_i = 1).

  • ESS=N\text{ESS} = N: all weights are equal (proposal = target). Perfect.
  • ESS=1\text{ESS} = 1: one weight dominates. Effectively only one sample is contributing — almost useless.
  • ESS/N\text{ESS}/N is the efficiency of the IS estimator relative to direct Monte Carlo.

Importance Sampling in Reinforcement Learning

This is the primary reason Spinning Up lists importance sampling as a prerequisite.

Off-Policy Evaluation

You collected NN trajectories τi\tau_i by following a behavior policy q(τ)q(\tau). You want to evaluate the expected return under a different target policy p(τ)p(\tau):

Eτp[R(τ)]=Eτq ⁣[R(τ)p(τ)q(τ)]\mathbb{E}_{\tau \sim p}[R(\tau)] = \mathbb{E}_{\tau \sim q}\!\left[R(\tau) \cdot \frac{p(\tau)}{q(\tau)}\right]

For a trajectory τ=(s0,a0,s1,a1,)\tau = (s_0, a_0, s_1, a_1, \ldots), the importance weight factorizes as a product of per-step ratios:

p(τ)q(τ)=t=0T1π(atst)q(atst)\frac{p(\tau)}{q(\tau)} = \prod_{t=0}^{T-1} \frac{\pi(a_t \mid s_t)}{q(a_t \mid s_t)}

Proximal Policy Optimization (PPO)

PPO uses the IS ratio rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t) to reuse experience from the old policy:

LCLIP(θ)=Et ⁣[min(rt(θ)At,  clip(rt(θ),1ε,1+ε)At)]\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\left(r_t(\theta)\, A_t,\; \text{clip}(r_t(\theta),\, 1-\varepsilon,\, 1+\varepsilon)\, A_t\right)\right]

The clipping prevents the IS weights from going so large that the optimization step overshoots — a direct consequence of the variance problem with IS weights.

Trust Region Policy Optimization (TRPO)

TRPO explicitly constrains the KL divergence between old and new policy to keep the IS weights close to 1, preventing high-variance updates.