Supplement · Neural Network Architectures

Vanilla RNNs and the Vanishing Gradient

14 min read
By the end of this reading you will be able to:
  • Trace the recurrent forward pass h_t = tanh(W_h h_{t-1} + W_x x_t + b) through a sequence of T timesteps, identifying the shared weights and the hidden state
  • Explain why gradients vanish through long RNN sequences by tracing the repeated multiplication of ∂h_t/∂h_{t-1} in backpropagation through time
  • Distinguish many-to-one, one-to-many, many-to-many (synced), and many-to-many (encoder-decoder) RNN configurations and give an example task for each
  • Explain exploding gradients and state the standard engineering fix (gradient clipping by norm), and state why gradient clipping does not solve vanishing gradients

Why Sequences Need Special Treatment

An MLP expects a fixed-size input. Language, audio, time series, and video are sequences — they have variable length and their meaning depends on order. Two approaches exist: (1) process the entire sequence at once with attention (transformers), or (2) process it step by step with a recurrent state.

Recurrent neural networks (RNNs) take the second approach: they maintain a hidden state that is updated at each timestep, encoding a summary of everything seen so far.


The Vanilla RNN

A vanilla (Elman) RNN processes a sequence x1,x2,,xT\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T one element at a time:

ht=ϕ(Whht1+Wxxt+b)\mathbf{h}_t = \phi\bigl(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b}\bigr) y^t=Wyht+c\hat{\mathbf{y}}_t = W_y \mathbf{h}_t + \mathbf{c}

where:

  • htRdh\mathbf{h}_t \in \mathbb{R}^{d_h} — the hidden state at time tt, the RNN's "memory"
  • xtRdx\mathbf{x}_t \in \mathbb{R}^{d_x} — the input at time tt
  • WhRdh×dhW_h \in \mathbb{R}^{d_h \times d_h} — recurrent weight matrix (shared across all tt)
  • WxRdh×dxW_x \in \mathbb{R}^{d_h \times d_x} — input weight matrix (shared across all tt)
  • ϕ\phi — typically tanh (saturating, keeps values in [1,1][-1,1], which stabilizes the state)
  • y^t\hat{\mathbf{y}}_t — output at time tt (not always used at every step)

Weight sharing across time is the RNN's equivalent of CNN weight sharing across space: the same function is applied at every position, regardless of TT.

Initial hidden state h0\mathbf{h}_0 is typically set to 0\mathbf{0}.


Architectural Configurations

Configuration Structure Task examples
Many-to-one Process full sequence, output at TT only Sentiment analysis, document classification
One-to-many Single input, generate sequence Image captioning, music generation
Many-to-many (synced) Output at every tt Sequence tagging, language modeling
Many-to-many (seq2seq) Encoder (many-to-one) + decoder (one-to-many) Machine translation, summarization

The seq2seq configuration, introduced by Sutskever et al. (2014), encodes the entire input sequence into a single context vector hT\mathbf{h}_T, then decodes it into the output sequence. This became the dominant approach for translation before transformers.


Backpropagation Through Time

Training an RNN requires differentiating the loss through the recurrent computation. Backpropagation through time (BPTT) unrolls the RNN into a deep feedforward network and applies standard backprop.

The gradient of the loss with respect to the hidden state at time tt depends on all future hidden states:

Lht=k=tTLkhkj=tk1hj+1hj\frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} = \sum_{k=t}^T \frac{\partial \mathcal{L}_k}{\partial \mathbf{h}_k} \prod_{j=t}^{k-1} \frac{\partial \mathbf{h}_{j+1}}{\partial \mathbf{h}_j}

The Jacobian of one step:

ht+1ht=diag(ϕ())Wh\frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_t} = \text{diag}(\phi'(\cdot)) \cdot W_h

For the gradient to flow from step TT back to step 11, this Jacobian is multiplied T1T-1 times.


The Vanishing Gradient Problem

If the spectral norm Wh<1\|W_h\| < 1 or the activations ϕ()<1\phi'(\cdot) < 1, the product of TT Jacobians shrinks exponentially:

j=tT1hj+1hj(λmaxmaxϕ)Tt\left\| \prod_{j=t}^{T-1} \frac{\partial \mathbf{h}_{j+1}}{\partial \mathbf{h}_j} \right\| \leq \left(\lambda_{\max} \cdot \max |\phi'|\right)^{T-t}

For tanh: maxϕ=1\max |\phi'| = 1 (at 0) and saturates to 0 at large values. Once activations saturate, ϕ()0\phi'(\cdot) \approx 0, and the entire gradient vanishes.

The consequence: Gradients from late timesteps do not reach early timesteps. The RNN cannot learn to correlate events separated by many steps. The hidden state hT\mathbf{h}_T encodes only recent history — the network has a short effective memory.


The Exploding Gradient Problem

If Wh>1\|W_h\| > 1, the product grows exponentially — gradients can become enormous, causing weight updates that destabilize training entirely (NaN loss).

Fix: gradient clipping by norm.

Before the optimizer step, if the gradient norm exceeds a threshold τ\tau:

gτgg\mathbf{g} \leftarrow \tau \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|}

This is standard practice when training RNNs and transformers. In PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).

Why clipping does not fix vanishing gradients: Clipping rescales large gradients downward but cannot make small gradients larger. The zero-gradient problem requires architectural changes — specifically, gated architectures (LSTM, GRU, next reading).


Bidirectional RNNs

A standard RNN only sees past context at each step. For tasks where future context matters (e.g., part-of-speech tagging, NER), a bidirectional RNN runs two RNNs in opposite directions and concatenates their hidden states:

ht=[ht;ht]\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t\, ;\, \overleftarrow{\mathbf{h}}_t]

This doubles the hidden dimension. BERT uses a bidirectional architecture (via masked self-attention, not recurrence) to achieve the same effect.


PyTorch and TensorFlow

PyTorchnn.RNN, bidirectional, manual unrolling:

import torch
import torch.nn as nn

# nn.RNN: input_size, hidden_size, num_layers
rnn = nn.RNN(input_size=32, hidden_size=64, num_layers=2,
             batch_first=True,    # input shape: (B, T, input_size)
             nonlinearity='tanh', # 'relu' also available
             dropout=0.2)          # applied between layers (not on final output)

x        = torch.randn(8, 20, 32)    # (batch=8, seq_len=20, input_size=32)
out, h_n = rnn(x)                    # out: (8,20,64)  h_n: (num_layers,8,64)
# out: hidden state at every timestep
# h_n: hidden state at the final timestep for each layer

# Bidirectional RNN: concatenates forward and backward hidden states
bi_rnn = nn.RNN(32, 64, batch_first=True, bidirectional=True)
out, h_n = bi_rnn(x)                 # out: (8,20,128)  h_n: (2,8,64)
# out last dim = 2 * hidden_size

# Gradient clipping to handle exploding gradients
optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-3)
loss = out.sum()
loss.backward()
torch.nn.utils.clip_grad_norm_(rnn.parameters(), max_norm=1.0)
optimizer.step()

TensorFlow / Keras:

import tensorflow as tf

# SimpleRNN — rarely used in practice but matches the vanilla RNN formulation
rnn = tf.keras.layers.SimpleRNN(units=64, return_sequences=True, dropout=0.2)
x   = tf.random.normal((8, 20, 32))
out = rnn(x)   # (8, 20, 64) with return_sequences=True

# Bidirectional wrapper
bi_rnn = tf.keras.layers.Bidirectional(
    tf.keras.layers.SimpleRNN(64, return_sequences=True)
)  # output: (8, 20, 128)

# Gradient clipping on the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, clipnorm=1.0)