Deep Reinforcement Learning · Off-Policy Methods & Tooling

TD3 & Soft Actor-Critic (SAC)

14 min read

By the end of this reading you will be able to:

Describe TD3's three tricks (clipped double-Q, delayed policy updates, target policy smoothing) and explain the failure mode each addresses
Derive State SAC's entropy-regularized objective and explain how it balances expected return against policy entropy
Explain the reparameterization trick in SAC and why squashed Gaussian policies are used for bounded action spaces
Compare TD3 and SAC on stability, sample efficiency, and the role of stochasticity in exploration

Twin Delayed DDPG (TD3)

TD3 addresses DDPG's brittleness through three targeted improvements, each attacking a specific failure mode.

Trick 1: Clipped Double-Q Learning

Problem: DDPG overestimates Q-values. The policy exploits these inflated estimates, producing suboptimal behavior.

Fix: Learn two Q-functions $Q_{\phi_1}$ and $Q_{\phi_2}$ , and use the minimum for computing Bellman targets: $y(r,s',d) = r + \gamma(1-d)\min_{i=1,2} Q_{\phi_{i,\text{targ}}}(s', a'(s'))$

Both Q-functions are trained to regress to this shared target. Taking the minimum prevents either Q-function from overestimating — if one goes high, the other acts as a corrector.

Trick 2: Delayed Policy Updates

Problem: When the Q-function has high error, policy updates derived from it push the policy in the wrong direction. These bad policy updates then corrupt future Q-estimates — a destabilizing feedback loop.

Fix: Update the policy (and target networks) less frequently than the Q-functions. Default: 1 policy update per 2 Q-function updates (policy_delay=2).

This gives the Q-function time to settle before the policy sees a gradient from it.

Trick 3: Target Policy Smoothing

Problem: The policy can exploit sharp peaks in the Q-function — narrow regions where Q is erroneously high. Because the deterministic policy always picks the action with highest Q, it gets "stuck" exploiting these artifacts.

Fix: Add clipped noise to the target actions used in the Bellman update: $a'(s') = \text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon, -c, c), a_{Low}, a_{High}\right), \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$

This smooths the Q-function over nearby actions, making it harder to exploit spurious peaks.

TD3 PyTorch

from spinup import td3_pytorch as td3
import gym

td3(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    epochs=100,
    polyak=0.995,
    pi_lr=1e-3,
    q_lr=1e-3,
    act_noise=0.1,
    target_noise=0.2,   # σ for target policy smoothing
    noise_clip=0.5,     # c: clip limit on smoothing noise
    policy_delay=2,     # update policy every 2nd Q step
    start_steps=10000,
    logger_kwargs=dict(output_dir='/tmp/td3', exp_name='td3')
)

Soft Actor-Critic (SAC)

SAC is an off-policy algorithm that trains a stochastic policy — a significant departure from DDPG and TD3. Rather than just maximizing expected return, SAC optimizes a trade-off between return and policy entropy.

Entropy-Regularized Objective

The agent seeks to maximize: $\pi^* = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^\infty \gamma^t \left(r_t + \alpha H(\pi(\cdot|s_t))\right)\right]$

where $H(\pi(\cdot|s)) = \mathbb{E}_{a \sim \pi}[-\log \pi(a|s)]$ is the policy entropy and $\alpha > 0$ is the temperature.

Why entropy regularization?

More exploration: a high-entropy policy tries many actions, avoiding premature convergence.
Robustness: spreading probability mass over multiple good actions avoids overfitting to a single mode.
Better stability: the entropy bonus prevents the policy from becoming deterministic too quickly.

Value Functions under Entropy Regularization

With entropy bonuses added at every timestep, the value functions change slightly: $V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^\infty \gamma^t (r_t + \alpha H(\pi(\cdot|s_t))) \mid s_0=s\right]$

The Bellman equation becomes: $Q^\pi(s,a) \approx r + \gamma \mathbb{E}_{a' \sim \pi}\left[Q^\pi(s',a') - \alpha \log \pi(a'|s')\right]$

Learning Q in SAC

SAC uses the clipped double-Q trick from TD3. The Bellman target is: $y(r,s',d) = r + \gamma(1-d)\left(\min_{j=1,2} Q_{\phi_{j,\text{targ}}}(s',\tilde{a}') - \alpha \log \pi_\theta(\tilde{a}'|s')\right)$

where $\tilde{a}' \sim \pi_\theta(\cdot|s')$ is sampled fresh from the current policy (not from the replay buffer).

The Reparameterization Trick

SAC trains a squashed Gaussian policy — the network outputs mean and log-std, and actions are sampled using: $\tilde{a}_\theta(s, \xi) = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \xi), \quad \xi \sim \mathcal{N}(0, I)$

The $\tanh$ squashing ensures actions stay in $[-1, 1]$ (or the valid range after rescaling). More importantly, the reparameterization trick separates the randomness ( $\xi$ ) from the parameters ( $\mu_\theta, \sigma_\theta$ ), making the expectation over actions differentiable: $\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[Q(s,a)] = \mathbb{E}_{\xi}\left[\nabla_\theta Q(s, \tilde{a}_\theta(s,\xi))\right]$

SAC PyTorch

from spinup import sac_pytorch as sac
import gym

sac(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    epochs=100,
    alpha=0.2,          # entropy temperature (fixed in Spinning Up)
    gamma=0.99,
    polyak=0.995,
    lr=1e-3,
    batch_size=100,
    start_steps=10000,
    update_after=1000,
    update_every=50,
    logger_kwargs=dict(output_dir='/tmp/sac', exp_name='sac')
)

TD3 vs SAC

	TD3	SAC
Policy type	Deterministic	Stochastic (squashed Gaussian)
Exploration	Gaussian noise added externally	Inherent from stochastic policy
Q-overestimation fix	Clipped double-Q	Clipped double-Q
Target smoothing	Explicit (target policy noise)	Implicit (stochastic policy)
Entropy	Not considered	Explicitly maximized
Typical performance	Strong	State-of-the-art for continuous control

Rule of thumb: SAC generally outperforms TD3 on standard MuJoCo benchmarks and is more robust to hyperparameter choices, making it the preferred starting point for continuous control tasks.

References

Fujimoto et al. 2018 — Addressing Function Approximation Error in Actor-Critic Methods (TD3)

Haarnoja et al. 2018 — Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning

OpenAI Spinning Up — TD3 Documentation

OpenAI Spinning Up — SAC Documentation

Previous Next →