TD3 & Soft Actor-Critic (SAC)
- Describe TD3's three tricks (clipped double-Q, delayed policy updates, target policy smoothing) and explain the failure mode each addresses
- Derive State SAC's entropy-regularized objective and explain how it balances expected return against policy entropy
- Explain the reparameterization trick in SAC and why squashed Gaussian policies are used for bounded action spaces
- Compare TD3 and SAC on stability, sample efficiency, and the role of stochasticity in exploration
Twin Delayed DDPG (TD3)
TD3 addresses DDPG's brittleness through three targeted improvements, each attacking a specific failure mode.
Trick 1: Clipped Double-Q Learning
Problem: DDPG overestimates Q-values. The policy exploits these inflated estimates, producing suboptimal behavior.
Fix: Learn two Q-functions and , and use the minimum for computing Bellman targets:
Both Q-functions are trained to regress to this shared target. Taking the minimum prevents either Q-function from overestimating — if one goes high, the other acts as a corrector.
Trick 2: Delayed Policy Updates
Problem: When the Q-function has high error, policy updates derived from it push the policy in the wrong direction. These bad policy updates then corrupt future Q-estimates — a destabilizing feedback loop.
Fix: Update the policy (and target networks) less frequently than the Q-functions. Default: 1 policy update per 2 Q-function updates (policy_delay=2).
This gives the Q-function time to settle before the policy sees a gradient from it.
Trick 3: Target Policy Smoothing
Problem: The policy can exploit sharp peaks in the Q-function — narrow regions where Q is erroneously high. Because the deterministic policy always picks the action with highest Q, it gets "stuck" exploiting these artifacts.
Fix: Add clipped noise to the target actions used in the Bellman update:
This smooths the Q-function over nearby actions, making it harder to exploit spurious peaks.
TD3 PyTorch
from spinup import td3_pytorch as td3
import gym
td3(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[256, 256]),
epochs=100,
polyak=0.995,
pi_lr=1e-3,
q_lr=1e-3,
act_noise=0.1,
target_noise=0.2, # σ for target policy smoothing
noise_clip=0.5, # c: clip limit on smoothing noise
policy_delay=2, # update policy every 2nd Q step
start_steps=10000,
logger_kwargs=dict(output_dir='/tmp/td3', exp_name='td3')
)
Soft Actor-Critic (SAC)
SAC is an off-policy algorithm that trains a stochastic policy — a significant departure from DDPG and TD3. Rather than just maximizing expected return, SAC optimizes a trade-off between return and policy entropy.
Entropy-Regularized Objective
The agent seeks to maximize:
where is the policy entropy and is the temperature.
Why entropy regularization?
- More exploration: a high-entropy policy tries many actions, avoiding premature convergence.
- Robustness: spreading probability mass over multiple good actions avoids overfitting to a single mode.
- Better stability: the entropy bonus prevents the policy from becoming deterministic too quickly.
Value Functions under Entropy Regularization
With entropy bonuses added at every timestep, the value functions change slightly:
The Bellman equation becomes:
Learning Q in SAC
SAC uses the clipped double-Q trick from TD3. The Bellman target is:
where is sampled fresh from the current policy (not from the replay buffer).
The Reparameterization Trick
SAC trains a squashed Gaussian policy — the network outputs mean and log-std, and actions are sampled using:
The squashing ensures actions stay in (or the valid range after rescaling). More importantly, the reparameterization trick separates the randomness () from the parameters (), making the expectation over actions differentiable:
SAC PyTorch
from spinup import sac_pytorch as sac
import gym
sac(
env_fn=lambda: gym.make('HalfCheetah-v2'),
ac_kwargs=dict(hidden_sizes=[256, 256]),
epochs=100,
alpha=0.2, # entropy temperature (fixed in Spinning Up)
gamma=0.99,
polyak=0.995,
lr=1e-3,
batch_size=100,
start_steps=10000,
update_after=1000,
update_every=50,
logger_kwargs=dict(output_dir='/tmp/sac', exp_name='sac')
)
TD3 vs SAC
| TD3 | SAC | |
|---|---|---|
| Policy type | Deterministic | Stochastic (squashed Gaussian) |
| Exploration | Gaussian noise added externally | Inherent from stochastic policy |
| Q-overestimation fix | Clipped double-Q | Clipped double-Q |
| Target smoothing | Explicit (target policy noise) | Implicit (stochastic policy) |
| Entropy | Not considered | Explicitly maximized |
| Typical performance | Strong | State-of-the-art for continuous control |
Rule of thumb: SAC generally outperforms TD3 on standard MuJoCo benchmarks and is more robust to hyperparameter choices, making it the preferred starting point for continuous control tasks.