Supplement · Activation Functions

Shrinkage & Threshold Functions

13 min read
By the end of this reading you will be able to:
  • Explain the shrinkage principle: setting small-magnitude inputs to zero while shifting larger inputs toward zero by a threshold
  • Distinguish Hardshrink (hard threshold, discontinuous) from Softshrink (soft threshold, continuous) in terms of their effect on near-zero inputs
  • Apply Softplus as a smooth differentiable approximation to ReLU and explain when continuity at zero matters for gradient-based optimisation
  • Connect Softshrink to the proximal operator of the L1 norm and explain its role in sparse signal recovery

The Shrinkage Principle

Shrinkage functions selectively suppress activations based on magnitude. They are connected to sparsity-inducing regularization: Hardshrink corresponds to hard thresholding, Softshrink is the proximal operator of the L1 norm (appearing in LASSO), and Tanhshrink is a smooth alternative. These functions are most useful in sparse autoencoders, compressed sensing models, and signal denoising networks.

Hardshrink

f(x)={xx>λ0xλf(x) = \begin{cases} x & |x| > \lambda \\ 0 & |x| \leq \lambda \end{cases}

Hardshrink zeroes out all values within [λ,λ][-\lambda, \lambda] and passes values outside this band through unchanged. The output range is (,λ]{0}[λ,)(-\infty, -\lambda] \cup \{0\} \cup [\lambda, \infty) — there is a gap of zeros around the origin.

Gradient: f(x)=1f'(x) = 1 for x>λ|x| > \lambda, f(x)=0f'(x) = 0 for x<λ|x| < \lambda. Non-differentiable at x=±λx = \pm\lambda. This hard discontinuity can cause gradient issues, but the sparsity it induces is exact — activations either survive completely or are fully zeroed.

Use case: Sparse coding, denoising autoencoders where you want exactly zero activations for small features.

PyTorch:

x = torch.tensor([-2., -0.3, 0., 0.3, 2.])
print(nn.Hardshrink(lambd=0.5)(x))   # tensor([-2., 0., 0., 0., 2.])
print(F.hardshrink(x, lambd=0.5))    # identical

TensorFlow:

# No built-in Hardshrink; implement with tf.where
x = tf.constant([-2., -0.3, 0., 0.3, 2.])
lambd = 0.5
hardshrink = lambda x, l: tf.where(tf.abs(x) > l, x, tf.zeros_like(x))
print(hardshrink(x, lambd))   # [-2.  0.  0.  0.  2.]

Softshrink — Soft Thresholding

f(x)=sign(x)max(0,xλ)={xλx>λ0xλx+λx<λf(x) = \text{sign}(x) \cdot \max(0, |x| - \lambda) = \begin{cases} x - \lambda & x > \lambda \\ 0 & |x| \leq \lambda \\ x + \lambda & x < -\lambda \end{cases}

Softshrink is the proximal operator of the L1 norm: minimizing 12ux2+λu1\frac{1}{2}\|u - x\|^2 + \lambda\|u\|_1 with respect to uu yields u=Softshrink(x,λ)u = \text{Softshrink}(x, \lambda). This connects directly to LASSO regression and compressed sensing.

Gradient: f(x)=1f'(x) = 1 for x>λ|x| > \lambda, 0 for x<λ|x| < \lambda. Non-differentiable at ±λ\pm\lambda.

Difference from Hardshrink: Hardshrink passes values as-is when they exceed λ\lambda. Softshrink shifts them toward zero by λ\lambda. Softshrink is bias-corrected — the surviving values are smaller.

PyTorch:

x = torch.tensor([-2., -0.3, 0., 0.3, 2.])
print(nn.Softshrink(lambd=0.5)(x))   # tensor([-1.5,  0.0,  0.0,  0.0,  1.5])
# Values that survive are shifted toward zero by lambd

TensorFlow:

# No built-in Softshrink; implement as proximal operator of L1
x = tf.constant([-2., -0.3, 0., 0.3, 2.])
lambd = 0.5
softshrink = lambda x, l: tf.math.sign(x) * tf.nn.relu(tf.abs(x) - l)
print(softshrink(x, lambd))   # [-1.5  0.   0.   0.   1.5]

Tanhshrink

f(x)=xtanh(x)f(x) = x - \tanh(x)

Tanhshrink computes the residual between the identity and Tanh. Since tanh(x)x\tanh(x) \approx x near the origin, f(x)0f(x) \approx 0 for small xx — the function shrinks small values without a hard threshold. For large x|x|, tanh(x)±1\tanh(x) \to \pm 1, so f(x)x1f(x) \approx x \mp 1.

Gradient: f(x)=1tanh(x)=1(1tanh2(x))=tanh2(x)f'(x) = 1 - \tanh'(x) = 1 - (1 - \tanh^2(x)) = \tanh^2(x)

Smooth everywhere; gradient is always in [0,1][0, 1]. No dead zones, but shrinks most at the origin.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.Tanhshrink()(x))   # tensor([-1.0036, -0.2384,  0.0000,  0.2384,  1.0036])
# = x - tanh(x); smooth; no hard threshold

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
tanhshrink = lambda x: x - tf.math.tanh(x)
print(tanhshrink(x))   # [-1.0036 -0.2384  0.      0.2384  1.0036]

Threshold

f(x)={xx>thresholdvaluexthresholdf(x) = \begin{cases} x & x > \text{threshold} \\ \text{value} & x \leq \text{threshold} \end{cases}

Threshold is the most general step function: values above the threshold pass through unchanged; values at or below are replaced with a specified constant value. Unlike Hardshrink, the replacement value can be anything (not necessarily 0).

Example: nn.Threshold(0.1, 20) replaces all x0.1x \leq 0.1 with 20 — this is an unusual but valid setup for binary indicator logic.

Gradient: 1 for x>thresholdx > \text{threshold}, 0 otherwise — always non-differentiable at the threshold.

PyTorch:

x = torch.tensor([-1., 0., 0.1, 0.5, 2.])
# Values <= 0.1 replaced with 0.0
print(nn.Threshold(threshold=0.1, value=0.0)(x))   # tensor([0.0, 0.0, 0.0, 0.5, 2.0])

TensorFlow:

x = tf.constant([-1., 0., 0.1, 0.5, 2.])
threshold, value = 0.1, 0.0
print(tf.where(x > threshold, x, tf.fill(tf.shape(x), value)))
# [0.  0.  0.  0.5 2. ]

Softplus — Smooth ReLU

f(x)=1βlog ⁣(1+eβx)f(x) = \frac{1}{\beta} \log\!\left(1 + e^{\beta x}\right)

Softplus is a smooth, always-positive approximation of ReLU. As β\beta \to \infty, Softplus converges to ReLU. The default β=1\beta = 1 gives a smooth curve that crosses (0,ln2)(0, \ln 2) and grows linearly for large xx.

Always positive: Unlike ReLU, Softplus is strictly >0> 0 for all xx. This makes it ideal for network outputs that must be positive, such as:

  • Variance parameters in VAEs: var = F.softplus(log_var)
  • Scale parameters in normalizing flows
  • Poisson rate parameters in count models

Note on threshold parameter: For βx>threshold|\beta x| > \text{threshold} (default 20), PyTorch falls back to the linear approximation to avoid overflow: f(x)xf(x) \approx x for large xx.

Gradient: f(x)=σ(βx)f'(x) = \sigma(\beta x) — the gradient of Softplus is exactly the sigmoid.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.Softplus(beta=1)(x))   # tensor([0.1269, 0.3133, 0.6931, 1.3133, 2.1269])
# Always positive; approaches ReLU as beta → ∞
# Use for variance outputs: var = F.softplus(raw_var)

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.softplus(x))   # [0.1269 0.3133 0.6931 1.3133 2.1269]
print(tf.nn.softplus(x))                  # identical

Comparison Table

Function Formula Gradient at small xx Smoothness
Hardshrink xx if x>λ|x| > \lambda, else 00 0 Discontinuous
Softshrink sign(x)max(0,xλ)\text{sign}(x)\max(0,|x|-\lambda) 0 Discontinuous
Tanhshrink xtanh(x)x - \tanh(x) tanh2(x)0\tanh^2(x) \to 0 Smooth
Threshold xx if x>tx > t, else vv 0 Discontinuous
Softplus (1/β)log(1+eβx)(1/\beta)\log(1+e^{\beta x}) σ(βx)>0\sigma(\beta x) > 0 Smooth
References
Donoho (1995) — Soft Thresholding and Wavelet Denoising — Established soft thresholding (Softshrink) as the proximal operator of L1 norm
Glorot et al. (2011) — Deep Sparse Rectifier Networks — Discussed sparsity properties of ReLU activations in relation to shrinkage