Prerequisite · Probability Foundations

Importance Sampling

14 min read

By the end of this reading you will be able to:

Derive the importance sampling identity E_p[f(x)] = E_q[f(x) w(x)] from a change of measure, where w(x) = p(x)/q(x) are the importance weights
Explain when vanilla IS fails (high-variance weights from a poor proposal) and state the self-normalized IS estimator used when the normalizing constant of p is unknown
Identify importance sampling in off-policy reinforcement learning and policy gradient algorithms, explaining how the importance weight corrects for the mismatch between behavior and target policies
Explain why effective sample size (ESS) quantifies the quality of an IS estimate and state what ESS → 1/N signals about proposal quality

The Problem: Sampling From the Wrong Distribution

Many quantities in ML take the form of an expectation:

$\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx$

For example:

Expected reward under a policy: $\mathbb{E}_{\tau \sim \pi}[R(\tau)]$
Expected gradient in policy gradient methods
Marginal likelihood in a latent variable model

The standard Monte Carlo estimator draws $N$ samples $x_i \sim p$ and averages:

$\mathbb{E}_p[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i)$

But what if sampling from $p$ is expensive, impossible, or dangerous?

In off-policy RL, we collected data under a behavior policy $q$ (what we did in the past), but want to evaluate a target policy $p$ (what we want to do now)
The normalizing constant of $p$ might be unknown (common in Bayesian inference)
$p$ might be a rare-event distribution where direct sampling yields too few hits

Importance sampling solves all of these by sampling from a different distribution $q$ instead of $p$ .

The Importance Sampling Identity

The key insight is a change of measure. Multiply and divide by $q(x)$ :

$\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx = \int f(x) \frac{p(x)}{q(x)} q(x)\, dx = \mathbb{E}_{x \sim q}\!\left[f(x) \cdot \frac{p(x)}{q(x)}\right]$

Define the importance weight $w(x) = p(x)/q(x)$ . Then:

$\boxed{\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}[f(x)\, w(x)]}$

This is exact — no approximation has been made. The estimator is:

$\hat{\mu}_{\text{IS}} = \frac{1}{N} \sum_{i=1}^N f(x_i)\, w(x_i), \quad x_i \sim q$

This estimator is unbiased: $\mathbb{E}[\hat{\mu}_{\text{IS}}] = \mathbb{E}_p[f(x)]$ .

Requirement: the proposal $q$ must have support wherever $p$ has support — $q(x) > 0$ whenever $p(x) > 0$ . Otherwise weights become infinite or undefined.

Variance of the IS Estimator

Unbiasedness is necessary but not sufficient — we also need low variance. The variance of $\hat{\mu}_{\text{IS}}$ is:

$\text{Var}(\hat{\mu}_{\text{IS}}) = \frac{1}{N} \text{Var}_q[f(x)\, w(x)]$

The variance can be very large if $w(x) = p(x)/q(x)$ varies wildly. This happens when $q$ and $p$ are very different — some samples get enormous weights while most get near-zero weights.

Optimal proposal (zero variance): If $q^*(x) \propto |f(x)| p(x)$ , the estimator has zero variance. But this requires knowing the answer — it's a theoretical baseline, not a practical one.

Rule of thumb: The proposal $q$ should be close to $p$ (ideally similar shape and scale) and should cover all high-probability regions of $f(x)\, p(x)$ .

Self-Normalized Importance Sampling

In Bayesian inference, $p$ is often known only up to a normalizing constant: $p(x) = \tilde{p}(x) / Z$ where $Z$ is intractable. The raw weights $w(x) = \tilde{p}(x) / (Z\, q(x))$ depend on the unknown $Z$ .

Self-normalized IS avoids this by dividing by the sum of weights:

$\hat{\mu}_{\text{SNIS}} = \frac{\sum_{i=1}^N f(x_i)\, \tilde{w}(x_i)}{\sum_{i=1}^N \tilde{w}(x_i)}, \quad \tilde{w}(x_i) = \frac{\tilde{p}(x_i)}{q(x_i)}$

Note: SNIS is biased (the ratio of expectations is not the expectation of the ratio), but the bias decreases as $O(1/N)$ and is often negligible in practice.

Effective Sample Size

How many effective samples does an IS estimate correspond to? The effective sample size (ESS) measures this:

$\text{ESS} = \frac{\left(\sum_{i=1}^N w_i\right)^2}{\sum_{i=1}^N w_i^2}$

where $w_i$ are the normalized weights ( $\sum w_i = 1$ ).

$\text{ESS} = N$ : all weights are equal (proposal = target). Perfect.
$\text{ESS} = 1$ : one weight dominates. Effectively only one sample is contributing — almost useless.
$\text{ESS}/N$ is the efficiency of the IS estimator relative to direct Monte Carlo.

Importance Sampling in Reinforcement Learning

This is the primary reason Spinning Up lists importance sampling as a prerequisite.

Off-Policy Evaluation

You collected $N$ trajectories $\tau_i$ by following a behavior policy $q(\tau)$ . You want to evaluate the expected return under a different target policy $p(\tau)$ :

$\mathbb{E}_{\tau \sim p}[R(\tau)] = \mathbb{E}_{\tau \sim q}\!\left[R(\tau) \cdot \frac{p(\tau)}{q(\tau)}\right]$

For a trajectory $\tau = (s_0, a_0, s_1, a_1, \ldots)$ , the importance weight factorizes as a product of per-step ratios:

$\frac{p(\tau)}{q(\tau)} = \prod_{t=0}^{T-1} \frac{\pi(a_t \mid s_t)}{q(a_t \mid s_t)}$

Proximal Policy Optimization (PPO)

PPO uses the IS ratio $r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t)$ to reuse experience from the old policy:

$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\left(r_t(\theta)\, A_t,\; \text{clip}(r_t(\theta),\, 1-\varepsilon,\, 1+\varepsilon)\, A_t\right)\right]$

The clipping prevents the IS weights from going so large that the optimization step overshoots — a direct consequence of the variance problem with IS weights.

Trust Region Policy Optimization (TRPO)

TRPO explicitly constrains the KL divergence between old and new policy to keep the IS weights close to 1, preventing high-variance updates.

References

OpenAI Spinning Up — Prerequisites — The Right Background for Deep RL

Previous Next →

Importance Sampling

The Problem: Sampling From the Wrong Distribution

The Importance Sampling Identity

Variance of the IS Estimator

Self-Normalized Importance Sampling

Effective Sample Size

Importance Sampling in Reinforcement Learning

Off-Policy Evaluation

Proximal Policy Optimization (PPO)

Trust Region Policy Optimization (TRPO)

Privacy Policy

What we collect

What we don't collect

Your choices

Contact