Vanilla RNNs and the Vanishing Gradient
- Trace the recurrent forward pass h_t = tanh(W_h h_{t-1} + W_x x_t + b) through a sequence of T timesteps, identifying the shared weights and the hidden state
- Explain why gradients vanish through long RNN sequences by tracing the repeated multiplication of ∂h_t/∂h_{t-1} in backpropagation through time
- Distinguish many-to-one, one-to-many, many-to-many (synced), and many-to-many (encoder-decoder) RNN configurations and give an example task for each
- Explain exploding gradients and state the standard engineering fix (gradient clipping by norm), and state why gradient clipping does not solve vanishing gradients
Why Sequences Need Special Treatment
An MLP expects a fixed-size input. Language, audio, time series, and video are sequences — they have variable length and their meaning depends on order. Two approaches exist: (1) process the entire sequence at once with attention (transformers), or (2) process it step by step with a recurrent state.
Recurrent neural networks (RNNs) take the second approach: they maintain a hidden state that is updated at each timestep, encoding a summary of everything seen so far.
The Vanilla RNN
A vanilla (Elman) RNN processes a sequence one element at a time:
where:
- — the hidden state at time , the RNN's "memory"
- — the input at time
- — recurrent weight matrix (shared across all )
- — input weight matrix (shared across all )
- — typically tanh (saturating, keeps values in , which stabilizes the state)
- — output at time (not always used at every step)
Weight sharing across time is the RNN's equivalent of CNN weight sharing across space: the same function is applied at every position, regardless of .
Initial hidden state is typically set to .
Architectural Configurations
| Configuration | Structure | Task examples |
|---|---|---|
| Many-to-one | Process full sequence, output at only | Sentiment analysis, document classification |
| One-to-many | Single input, generate sequence | Image captioning, music generation |
| Many-to-many (synced) | Output at every | Sequence tagging, language modeling |
| Many-to-many (seq2seq) | Encoder (many-to-one) + decoder (one-to-many) | Machine translation, summarization |
The seq2seq configuration, introduced by Sutskever et al. (2014), encodes the entire input sequence into a single context vector , then decodes it into the output sequence. This became the dominant approach for translation before transformers.
Backpropagation Through Time
Training an RNN requires differentiating the loss through the recurrent computation. Backpropagation through time (BPTT) unrolls the RNN into a deep feedforward network and applies standard backprop.
The gradient of the loss with respect to the hidden state at time depends on all future hidden states:
The Jacobian of one step:
For the gradient to flow from step back to step , this Jacobian is multiplied times.
The Vanishing Gradient Problem
If the spectral norm or the activations , the product of Jacobians shrinks exponentially:
For tanh: (at 0) and saturates to 0 at large values. Once activations saturate, , and the entire gradient vanishes.
The consequence: Gradients from late timesteps do not reach early timesteps. The RNN cannot learn to correlate events separated by many steps. The hidden state encodes only recent history — the network has a short effective memory.
The Exploding Gradient Problem
If , the product grows exponentially — gradients can become enormous, causing weight updates that destabilize training entirely (NaN loss).
Fix: gradient clipping by norm.
Before the optimizer step, if the gradient norm exceeds a threshold :
This is standard practice when training RNNs and transformers. In PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).
Why clipping does not fix vanishing gradients: Clipping rescales large gradients downward but cannot make small gradients larger. The zero-gradient problem requires architectural changes — specifically, gated architectures (LSTM, GRU, next reading).
Bidirectional RNNs
A standard RNN only sees past context at each step. For tasks where future context matters (e.g., part-of-speech tagging, NER), a bidirectional RNN runs two RNNs in opposite directions and concatenates their hidden states:
This doubles the hidden dimension. BERT uses a bidirectional architecture (via masked self-attention, not recurrence) to achieve the same effect.
PyTorch and TensorFlow
PyTorch — nn.RNN, bidirectional, manual unrolling:
import torch
import torch.nn as nn
# nn.RNN: input_size, hidden_size, num_layers
rnn = nn.RNN(input_size=32, hidden_size=64, num_layers=2,
batch_first=True, # input shape: (B, T, input_size)
nonlinearity='tanh', # 'relu' also available
dropout=0.2) # applied between layers (not on final output)
x = torch.randn(8, 20, 32) # (batch=8, seq_len=20, input_size=32)
out, h_n = rnn(x) # out: (8,20,64) h_n: (num_layers,8,64)
# out: hidden state at every timestep
# h_n: hidden state at the final timestep for each layer
# Bidirectional RNN: concatenates forward and backward hidden states
bi_rnn = nn.RNN(32, 64, batch_first=True, bidirectional=True)
out, h_n = bi_rnn(x) # out: (8,20,128) h_n: (2,8,64)
# out last dim = 2 * hidden_size
# Gradient clipping to handle exploding gradients
optimizer = torch.optim.Adam(rnn.parameters(), lr=1e-3)
loss = out.sum()
loss.backward()
torch.nn.utils.clip_grad_norm_(rnn.parameters(), max_norm=1.0)
optimizer.step()
TensorFlow / Keras:
import tensorflow as tf
# SimpleRNN — rarely used in practice but matches the vanilla RNN formulation
rnn = tf.keras.layers.SimpleRNN(units=64, return_sequences=True, dropout=0.2)
x = tf.random.normal((8, 20, 32))
out = rnn(x) # (8, 20, 64) with return_sequences=True
# Bidirectional wrapper
bi_rnn = tf.keras.layers.Bidirectional(
tf.keras.layers.SimpleRNN(64, return_sequences=True)
) # output: (8, 20, 128)
# Gradient clipping on the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, clipnorm=1.0)