Supplement · Neural Network Architectures

Vanilla RNNs and the Vanishing Gradient

14 min read

By the end of this reading you will be able to:

Trace the recurrent forward pass h_t = tanh(W_h h_{t-1} + W_x x_t + b) through a sequence of T timesteps, identifying the shared weights and the hidden state
Explain why gradients vanish through long RNN sequences by tracing the repeated multiplication of ∂h_t/∂h_{t-1} in backpropagation through time
Distinguish many-to-one, one-to-many, many-to-many (synced), and many-to-many (encoder-decoder) RNN configurations and give an example task for each
Explain exploding gradients and state the standard engineering fix (gradient clipping by norm), and state why gradient clipping does not solve vanishing gradients

Why Sequences Need Special Treatment

An MLP expects a fixed-size input. Language, audio, time series, and video are sequences — they have variable length and their meaning depends on order. Two approaches exist: (1) process the entire sequence at once with attention (transformers), or (2) process it step by step with a recurrent state.

Recurrent neural networks (RNNs) take the second approach: they maintain a hidden state that is updated at each timestep, encoding a summary of everything seen so far.

The Vanilla RNN

A vanilla (Elman) RNN processes a sequence $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ one element at a time:

$\mathbf{h}_t = \phi\bigl(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b}\bigr)$ $\hat{\mathbf{y}}_t = W_y \mathbf{h}_t + \mathbf{c}$

where:

$\mathbf{h}_t \in \mathbb{R}^{d_h}$ — the hidden state at time $t$ , the RNN's "memory"
$\mathbf{x}_t \in \mathbb{R}^{d_x}$ — the input at time $t$
$W_h \in \mathbb{R}^{d_h \times d_h}$ — recurrent weight matrix (shared across all $t$ )
$W_x \in \mathbb{R}^{d_h \times d_x}$ — input weight matrix (shared across all $t$ )
$\phi$ — typically tanh (saturating, keeps values in $[-1,1]$ , which stabilizes the state)
$\hat{\mathbf{y}}_t$ — output at time $t$ (not always used at every step)

Weight sharing across time is the RNN's equivalent of CNN weight sharing across space: the same function is applied at every position, regardless of $T$ .

Initial hidden state $\mathbf{h}_0$ is typically set to $\mathbf{0}$ .

Architectural Configurations

Configuration	Structure	Task examples
Many-to-one	Process full sequence, output at $T$ only	Sentiment analysis, document classification
One-to-many	Single input, generate sequence	Image captioning, music generation
Many-to-many (synced)	Output at every $t$	Sequence tagging, language modeling
Many-to-many (seq2seq)	Encoder (many-to-one) + decoder (one-to-many)	Machine translation, summarization

The seq2seq configuration, introduced by Sutskever et al. (2014), encodes the entire input sequence into a single context vector $\mathbf{h}_T$ , then decodes it into the output sequence. This became the dominant approach for translation before transformers.

Backpropagation Through Time

Training an RNN requires differentiating the loss through the recurrent computation. Backpropagation through time (BPTT) unrolls the RNN into a deep feedforward network and applies standard backprop.

The gradient of the loss with respect to the hidden state at time $t$ depends on all future hidden states:

$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} = \sum_{k=t}^T \frac{\partial \mathcal{L}_k}{\partial \mathbf{h}_k} \prod_{j=t}^{k-1} \frac{\partial \mathbf{h}_{j+1}}{\partial \mathbf{h}_j}$

The Jacobian of one step:

$\frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_t} = \text{diag}(\phi'(\cdot)) \cdot W_h$

For the gradient to flow from step $T$ back to step $1$ , this Jacobian is multiplied $T-1$ times.

The Vanishing Gradient Problem

If the spectral norm $\|W_h\| < 1$ or the activations $\phi'(\cdot) < 1$ , the product of $T$ Jacobians shrinks exponentially:

$\left\| \prod_{j=t}^{T-1} \frac{\partial \mathbf{h}_{j+1}}{\partial \mathbf{h}_j} \right\| \leq \left(\lambda_{\max} \cdot \max |\phi'|\right)^{T-t}$

For tanh: $\max |\phi'| = 1$ (at 0) and saturates to 0 at large values. Once activations saturate, $\phi'(\cdot) \approx 0$ , and the entire gradient vanishes.

The consequence: Gradients from late timesteps do not reach early timesteps. The RNN cannot learn to correlate events separated by many steps. The hidden state $\mathbf{h}_T$ encodes only recent history — the network has a short effective memory.

The Exploding Gradient Problem

If $\|W_h\| > 1$ , the product grows exponentially — gradients can become enormous, causing weight updates that destabilize training entirely (NaN loss).

Fix: gradient clipping by norm.

Before the optimizer step, if the gradient norm exceeds a threshold $\tau$ :

$\mathbf{g} \leftarrow \tau \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|}$

This is standard practice when training RNNs and transformers. In PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).

Why clipping does not fix vanishing gradients: Clipping rescales large gradients downward but cannot make small gradients larger. The zero-gradient problem requires architectural changes — specifically, gated architectures (LSTM, GRU, next reading).

Bidirectional RNNs

A standard RNN only sees past context at each step. For tasks where future context matters (e.g., part-of-speech tagging, NER), a bidirectional RNN runs two RNNs in opposite directions and concatenates their hidden states:

$\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t\, ;\, \overleftarrow{\mathbf{h}}_t]$

This doubles the hidden dimension. BERT uses a bidirectional architecture (via masked self-attention, not recurrence) to achieve the same effect.

PyTorch and TensorFlow

PyTorch — nn.RNN, bidirectional, manual unrolling:

import torch
import torch.nn as nn

# nn.RNN: input_size, hidden_size, num_layers
rnn = nn.RNN(input_size=32, hidden_size=64, num_layers=2,
             batch_first=True,    # input shape: (B, T, input_size)
             nonlinearity='tanh', # 'relu' also available
             dropout=0.2)          # applied between layers (not on final output)

x        = torch.randn(8, 20, 32)    # (batch=8, seq_len=20, input_size=32)
out, h_n = rnn(x)                    # out: (8,20,64)  h_n: (num_layers,8,64)
# out: hidden state at every timestep
# h_n: hidden state at the final timestep for each layer

# Bidirectional RNN: concatenates forward and backward hidden states
bi_rnn = nn.RNN(32, 64, batch_first=True, bidirectional=True)
out, h_n = bi_rnn(x)                 # out: (8,20,128)  h_n: (2,8,64)
# out last dim = 2 * hidden_size

# Gradient clipping to handle exploding gradients
optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-3)
loss = out.sum()
loss.backward()
torch.nn.utils.clip_grad_norm_(rnn.parameters(), max_norm=1.0)
optimizer.step()

TensorFlow / Keras:

import tensorflow as tf

# SimpleRNN — rarely used in practice but matches the vanilla RNN formulation
rnn = tf.keras.layers.SimpleRNN(units=64, return_sequences=True, dropout=0.2)
x   = tf.random.normal((8, 20, 32))
out = rnn(x)   # (8, 20, 64) with return_sequences=True

# Bidirectional wrapper
bi_rnn = tf.keras.layers.Bidirectional(
    tf.keras.layers.SimpleRNN(64, return_sequences=True)
)  # output: (8, 20, 128)

# Gradient clipping on the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, clipnorm=1.0)

Previous Next →

Vanilla RNNs and the Vanishing Gradient

Why Sequences Need Special Treatment

The Vanilla RNN

Architectural Configurations

Backpropagation Through Time

The Vanishing Gradient Problem

The Exploding Gradient Problem

Bidirectional RNNs

PyTorch and TensorFlow

Privacy Policy

What we collect

What we don't collect

Your choices

Contact