Supplement · Activation Functions

Shrinkage & Threshold Functions

13 min read

By the end of this reading you will be able to:

Explain the shrinkage principle: setting small-magnitude inputs to zero while shifting larger inputs toward zero by a threshold
Distinguish Hardshrink (hard threshold, discontinuous) from Softshrink (soft threshold, continuous) in terms of their effect on near-zero inputs
Apply Softplus as a smooth differentiable approximation to ReLU and explain when continuity at zero matters for gradient-based optimisation
Connect Softshrink to the proximal operator of the L1 norm and explain its role in sparse signal recovery

The Shrinkage Principle

Shrinkage functions selectively suppress activations based on magnitude. They are connected to sparsity-inducing regularization: Hardshrink corresponds to hard thresholding, Softshrink is the proximal operator of the L1 norm (appearing in LASSO), and Tanhshrink is a smooth alternative. These functions are most useful in sparse autoencoders, compressed sensing models, and signal denoising networks.

Hardshrink

$f(x) = \begin{cases} x & |x| > \lambda \\ 0 & |x| \leq \lambda \end{cases}$

Hardshrink zeroes out all values within $[-\lambda, \lambda]$ and passes values outside this band through unchanged. The output range is $(-\infty, -\lambda] \cup \{0\} \cup [\lambda, \infty)$ — there is a gap of zeros around the origin.

Gradient: $f'(x) = 1$ for $|x| > \lambda$ , $f'(x) = 0$ for $|x| < \lambda$ . Non-differentiable at $x = \pm\lambda$ . This hard discontinuity can cause gradient issues, but the sparsity it induces is exact — activations either survive completely or are fully zeroed.

Use case: Sparse coding, denoising autoencoders where you want exactly zero activations for small features.

PyTorch:

x = torch.tensor([-2., -0.3, 0., 0.3, 2.])
print(nn.Hardshrink(lambd=0.5)(x))   # tensor([-2., 0., 0., 0., 2.])
print(F.hardshrink(x, lambd=0.5))    # identical

TensorFlow:

# No built-in Hardshrink; implement with tf.where
x = tf.constant([-2., -0.3, 0., 0.3, 2.])
lambd = 0.5
hardshrink = lambda x, l: tf.where(tf.abs(x) > l, x, tf.zeros_like(x))
print(hardshrink(x, lambd))   # [-2.  0.  0.  0.  2.]

Softshrink — Soft Thresholding

$f(x) = \text{sign}(x) \cdot \max(0, |x| - \lambda) = \begin{cases} x - \lambda & x > \lambda \\ 0 & |x| \leq \lambda \\ x + \lambda & x < -\lambda \end{cases}$

Softshrink is the proximal operator of the L1 norm: minimizing $\frac{1}{2}\|u - x\|^2 + \lambda\|u\|_1$ with respect to $u$ yields $u = \text{Softshrink}(x, \lambda)$ . This connects directly to LASSO regression and compressed sensing.

Gradient: $f'(x) = 1$ for $|x| > \lambda$ , 0 for $|x| < \lambda$ . Non-differentiable at $\pm\lambda$ .

Difference from Hardshrink: Hardshrink passes values as-is when they exceed $\lambda$ . Softshrink shifts them toward zero by $\lambda$ . Softshrink is bias-corrected — the surviving values are smaller.

PyTorch:

x = torch.tensor([-2., -0.3, 0., 0.3, 2.])
print(nn.Softshrink(lambd=0.5)(x))   # tensor([-1.5,  0.0,  0.0,  0.0,  1.5])
# Values that survive are shifted toward zero by lambd

TensorFlow:

# No built-in Softshrink; implement as proximal operator of L1
x = tf.constant([-2., -0.3, 0., 0.3, 2.])
lambd = 0.5
softshrink = lambda x, l: tf.math.sign(x) * tf.nn.relu(tf.abs(x) - l)
print(softshrink(x, lambd))   # [-1.5  0.   0.   0.   1.5]

Tanhshrink

$f(x) = x - \tanh(x)$

Tanhshrink computes the residual between the identity and Tanh. Since $\tanh(x) \approx x$ near the origin, $f(x) \approx 0$ for small $x$ — the function shrinks small values without a hard threshold. For large $|x|$ , $\tanh(x) \to \pm 1$ , so $f(x) \approx x \mp 1$ .

Gradient: $f'(x) = 1 - \tanh'(x) = 1 - (1 - \tanh^2(x)) = \tanh^2(x)$

Smooth everywhere; gradient is always in $[0, 1]$ . No dead zones, but shrinks most at the origin.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.Tanhshrink()(x))   # tensor([-1.0036, -0.2384,  0.0000,  0.2384,  1.0036])
# = x - tanh(x); smooth; no hard threshold

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
tanhshrink = lambda x: x - tf.math.tanh(x)
print(tanhshrink(x))   # [-1.0036 -0.2384  0.      0.2384  1.0036]

Threshold

$f(x) = \begin{cases} x & x > \text{threshold} \\ \text{value} & x \leq \text{threshold} \end{cases}$

Threshold is the most general step function: values above the threshold pass through unchanged; values at or below are replaced with a specified constant value. Unlike Hardshrink, the replacement value can be anything (not necessarily 0).

Example: nn.Threshold(0.1, 20) replaces all $x \leq 0.1$ with 20 — this is an unusual but valid setup for binary indicator logic.

Gradient: 1 for $x > \text{threshold}$ , 0 otherwise — always non-differentiable at the threshold.

PyTorch:

x = torch.tensor([-1., 0., 0.1, 0.5, 2.])
# Values <= 0.1 replaced with 0.0
print(nn.Threshold(threshold=0.1, value=0.0)(x))   # tensor([0.0, 0.0, 0.0, 0.5, 2.0])

TensorFlow:

x = tf.constant([-1., 0., 0.1, 0.5, 2.])
threshold, value = 0.1, 0.0
print(tf.where(x > threshold, x, tf.fill(tf.shape(x), value)))
# [0.  0.  0.  0.5 2. ]

Softplus — Smooth ReLU

$f(x) = \frac{1}{\beta} \log\!\left(1 + e^{\beta x}\right)$

Softplus is a smooth, always-positive approximation of ReLU. As $\beta \to \infty$ , Softplus converges to ReLU. The default $\beta = 1$ gives a smooth curve that crosses $(0, \ln 2)$ and grows linearly for large $x$ .

Always positive: Unlike ReLU, Softplus is strictly $> 0$ for all $x$ . This makes it ideal for network outputs that must be positive, such as:

Variance parameters in VAEs: var = F.softplus(log_var)
Scale parameters in normalizing flows
Poisson rate parameters in count models

Note on threshold parameter: For $|\beta x| > \text{threshold}$ (default 20), PyTorch falls back to the linear approximation to avoid overflow: $f(x) \approx x$ for large $x$ .

Gradient: $f'(x) = \sigma(\beta x)$ — the gradient of Softplus is exactly the sigmoid.

PyTorch:

x = torch.tensor([-2., -1., 0., 1., 2.])
print(nn.Softplus(beta=1)(x))   # tensor([0.1269, 0.3133, 0.6931, 1.3133, 2.1269])
# Always positive; approaches ReLU as beta → ∞
# Use for variance outputs: var = F.softplus(raw_var)

TensorFlow:

x = tf.constant([-2., -1., 0., 1., 2.])
print(tf.keras.activations.softplus(x))   # [0.1269 0.3133 0.6931 1.3133 2.1269]
print(tf.nn.softplus(x))                  # identical

Comparison Table

Function	Formula	Gradient at small $x$	Smoothness
Hardshrink	$x$ if $\|x\| > \lambda$ , else $0$	0	Discontinuous
Softshrink	$\text{sign}(x)\max(0,\|x\|-\lambda)$	0	Discontinuous
Tanhshrink	$x - \tanh(x)$	$\tanh^2(x) \to 0$	Smooth
Threshold	$x$ if $x > t$ , else $v$	0	Discontinuous
Softplus	$(1/\beta)\log(1+e^{\beta x})$	$\sigma(\beta x) > 0$	Smooth

References

Donoho (1995) — Soft Thresholding and Wavelet Denoising — Established soft thresholding (Softshrink) as the proximal operator of L1 norm

Glorot et al. (2011) — Deep Sparse Rectifier Networks — Discussed sparsity properties of ReLU activations in relation to shrinkage

Previous Next →

Shrinkage & Threshold Functions

The Shrinkage Principle

Hardshrink

Softshrink — Soft Thresholding

Tanhshrink

Threshold

Softplus — Smooth ReLU

Comparison Table

Privacy Policy

What we collect

What we don't collect

Your choices

Contact