Importance Sampling
- Derive the importance sampling identity E_p[f(x)] = E_q[f(x) w(x)] from a change of measure, where w(x) = p(x)/q(x) are the importance weights
- Explain when vanilla IS fails (high-variance weights from a poor proposal) and state the self-normalized IS estimator used when the normalizing constant of p is unknown
- Identify importance sampling in off-policy reinforcement learning and policy gradient algorithms, explaining how the importance weight corrects for the mismatch between behavior and target policies
- Explain why effective sample size (ESS) quantifies the quality of an IS estimate and state what ESS → 1/N signals about proposal quality
The Problem: Sampling From the Wrong Distribution
Many quantities in ML take the form of an expectation:
For example:
- Expected reward under a policy:
- Expected gradient in policy gradient methods
- Marginal likelihood in a latent variable model
The standard Monte Carlo estimator draws samples and averages:
But what if sampling from is expensive, impossible, or dangerous?
- In off-policy RL, we collected data under a behavior policy (what we did in the past), but want to evaluate a target policy (what we want to do now)
- The normalizing constant of might be unknown (common in Bayesian inference)
- might be a rare-event distribution where direct sampling yields too few hits
Importance sampling solves all of these by sampling from a different distribution instead of .
The Importance Sampling Identity
The key insight is a change of measure. Multiply and divide by :
Define the importance weight . Then:
This is exact — no approximation has been made. The estimator is:
This estimator is unbiased: .
Requirement: the proposal must have support wherever has support — whenever . Otherwise weights become infinite or undefined.
Variance of the IS Estimator
Unbiasedness is necessary but not sufficient — we also need low variance. The variance of is:
The variance can be very large if varies wildly. This happens when and are very different — some samples get enormous weights while most get near-zero weights.
Optimal proposal (zero variance): If , the estimator has zero variance. But this requires knowing the answer — it's a theoretical baseline, not a practical one.
Rule of thumb: The proposal should be close to (ideally similar shape and scale) and should cover all high-probability regions of .
Self-Normalized Importance Sampling
In Bayesian inference, is often known only up to a normalizing constant: where is intractable. The raw weights depend on the unknown .
Self-normalized IS avoids this by dividing by the sum of weights:
Note: SNIS is biased (the ratio of expectations is not the expectation of the ratio), but the bias decreases as and is often negligible in practice.
Effective Sample Size
How many effective samples does an IS estimate correspond to? The effective sample size (ESS) measures this:
where are the normalized weights ().
- : all weights are equal (proposal = target). Perfect.
- : one weight dominates. Effectively only one sample is contributing — almost useless.
- is the efficiency of the IS estimator relative to direct Monte Carlo.
Importance Sampling in Reinforcement Learning
This is the primary reason Spinning Up lists importance sampling as a prerequisite.
Off-Policy Evaluation
You collected trajectories by following a behavior policy . You want to evaluate the expected return under a different target policy :
For a trajectory , the importance weight factorizes as a product of per-step ratios:
Proximal Policy Optimization (PPO)
PPO uses the IS ratio to reuse experience from the old policy:
The clipping prevents the IS weights from going so large that the optimization step overshoots — a direct consequence of the variance problem with IS weights.
Trust Region Policy Optimization (TRPO)
TRPO explicitly constrains the KL divergence between old and new policy to keep the IS weights close to 1, preventing high-variance updates.