Deep Reinforcement Learning · Off-Policy Methods & Tooling

TD3 & Soft Actor-Critic (SAC)

14 min read
By the end of this reading you will be able to:
  • Describe TD3's three tricks (clipped double-Q, delayed policy updates, target policy smoothing) and explain the failure mode each addresses
  • Derive State SAC's entropy-regularized objective and explain how it balances expected return against policy entropy
  • Explain the reparameterization trick in SAC and why squashed Gaussian policies are used for bounded action spaces
  • Compare TD3 and SAC on stability, sample efficiency, and the role of stochasticity in exploration

Twin Delayed DDPG (TD3)

TD3 addresses DDPG's brittleness through three targeted improvements, each attacking a specific failure mode.

Trick 1: Clipped Double-Q Learning

Problem: DDPG overestimates Q-values. The policy exploits these inflated estimates, producing suboptimal behavior.

Fix: Learn two Q-functions Qϕ1Q_{\phi_1} and Qϕ2Q_{\phi_2}, and use the minimum for computing Bellman targets: y(r,s,d)=r+γ(1d)mini=1,2Qϕi,targ(s,a(s))y(r,s',d) = r + \gamma(1-d)\min_{i=1,2} Q_{\phi_{i,\text{targ}}}(s', a'(s'))

Both Q-functions are trained to regress to this shared target. Taking the minimum prevents either Q-function from overestimating — if one goes high, the other acts as a corrector.

Trick 2: Delayed Policy Updates

Problem: When the Q-function has high error, policy updates derived from it push the policy in the wrong direction. These bad policy updates then corrupt future Q-estimates — a destabilizing feedback loop.

Fix: Update the policy (and target networks) less frequently than the Q-functions. Default: 1 policy update per 2 Q-function updates (policy_delay=2).

This gives the Q-function time to settle before the policy sees a gradient from it.

Trick 3: Target Policy Smoothing

Problem: The policy can exploit sharp peaks in the Q-function — narrow regions where Q is erroneously high. Because the deterministic policy always picks the action with highest Q, it gets "stuck" exploiting these artifacts.

Fix: Add clipped noise to the target actions used in the Bellman update: a(s)=clip(μθtarg(s)+clip(ϵ,c,c),aLow,aHigh),ϵN(0,σ2)a'(s') = \text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon, -c, c), a_{Low}, a_{High}\right), \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

This smooths the Q-function over nearby actions, making it harder to exploit spurious peaks.

TD3 PyTorch

from spinup import td3_pytorch as td3
import gym

td3(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    epochs=100,
    polyak=0.995,
    pi_lr=1e-3,
    q_lr=1e-3,
    act_noise=0.1,
    target_noise=0.2,   # σ for target policy smoothing
    noise_clip=0.5,     # c: clip limit on smoothing noise
    policy_delay=2,     # update policy every 2nd Q step
    start_steps=10000,
    logger_kwargs=dict(output_dir='/tmp/td3', exp_name='td3')
)

Soft Actor-Critic (SAC)

SAC is an off-policy algorithm that trains a stochastic policy — a significant departure from DDPG and TD3. Rather than just maximizing expected return, SAC optimizes a trade-off between return and policy entropy.

Entropy-Regularized Objective

The agent seeks to maximize: π=argmaxπ  Eτπ[t=0γt(rt+αH(π(st)))]\pi^* = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^\infty \gamma^t \left(r_t + \alpha H(\pi(\cdot|s_t))\right)\right]

where H(π(s))=Eaπ[logπ(as)]H(\pi(\cdot|s)) = \mathbb{E}_{a \sim \pi}[-\log \pi(a|s)] is the policy entropy and α>0\alpha > 0 is the temperature.

Why entropy regularization?

  • More exploration: a high-entropy policy tries many actions, avoiding premature convergence.
  • Robustness: spreading probability mass over multiple good actions avoids overfitting to a single mode.
  • Better stability: the entropy bonus prevents the policy from becoming deterministic too quickly.

Value Functions under Entropy Regularization

With entropy bonuses added at every timestep, the value functions change slightly: Vπ(s)=Eτπ[t=0γt(rt+αH(π(st)))s0=s]V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^\infty \gamma^t (r_t + \alpha H(\pi(\cdot|s_t))) \mid s_0=s\right]

The Bellman equation becomes: Qπ(s,a)r+γEaπ[Qπ(s,a)αlogπ(as)]Q^\pi(s,a) \approx r + \gamma \mathbb{E}_{a' \sim \pi}\left[Q^\pi(s',a') - \alpha \log \pi(a'|s')\right]

Learning Q in SAC

SAC uses the clipped double-Q trick from TD3. The Bellman target is: y(r,s,d)=r+γ(1d)(minj=1,2Qϕj,targ(s,a~)αlogπθ(a~s))y(r,s',d) = r + \gamma(1-d)\left(\min_{j=1,2} Q_{\phi_{j,\text{targ}}}(s',\tilde{a}') - \alpha \log \pi_\theta(\tilde{a}'|s')\right)

where a~πθ(s)\tilde{a}' \sim \pi_\theta(\cdot|s') is sampled fresh from the current policy (not from the replay buffer).

The Reparameterization Trick

SAC trains a squashed Gaussian policy — the network outputs mean and log-std, and actions are sampled using: a~θ(s,ξ)=tanh(μθ(s)+σθ(s)ξ),ξN(0,I)\tilde{a}_\theta(s, \xi) = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \xi), \quad \xi \sim \mathcal{N}(0, I)

The tanh\tanh squashing ensures actions stay in [1,1][-1, 1] (or the valid range after rescaling). More importantly, the reparameterization trick separates the randomness (ξ\xi) from the parameters (μθ,σθ\mu_\theta, \sigma_\theta), making the expectation over actions differentiable: θEaπθ[Q(s,a)]=Eξ[θQ(s,a~θ(s,ξ))]\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[Q(s,a)] = \mathbb{E}_{\xi}\left[\nabla_\theta Q(s, \tilde{a}_\theta(s,\xi))\right]

SAC PyTorch

from spinup import sac_pytorch as sac
import gym

sac(
    env_fn=lambda: gym.make('HalfCheetah-v2'),
    ac_kwargs=dict(hidden_sizes=[256, 256]),
    epochs=100,
    alpha=0.2,          # entropy temperature (fixed in Spinning Up)
    gamma=0.99,
    polyak=0.995,
    lr=1e-3,
    batch_size=100,
    start_steps=10000,
    update_after=1000,
    update_every=50,
    logger_kwargs=dict(output_dir='/tmp/sac', exp_name='sac')
)

TD3 vs SAC

TD3 SAC
Policy type Deterministic Stochastic (squashed Gaussian)
Exploration Gaussian noise added externally Inherent from stochastic policy
Q-overestimation fix Clipped double-Q Clipped double-Q
Target smoothing Explicit (target policy noise) Implicit (stochastic policy)
Entropy Not considered Explicitly maximized
Typical performance Strong State-of-the-art for continuous control

Rule of thumb: SAC generally outperforms TD3 on standard MuJoCo benchmarks and is more robust to hyperparameter choices, making it the preferred starting point for continuous control tasks.