Recurrent Neural Networks
Recurrent neural networks represent a fundamental departure from feedforward architectures by introducing connections that loop back through time, enabling the network to maintain memory of previous inputs when processing sequences. While convolutional networks excel at spatial patterns in fixed-size inputs, recurrent networks are designed for sequential data where the length varies and the order of elements carries meaning. From natural language and speech to time series and genomic sequences, RNNs provide the foundational architecture for learning from ordered data, though they have been largely superseded by transformers for many applications.
The Need for Sequential Processing
Traditional feedforward networks treat each input independently, making them fundamentally unsuitable for sequential data where context matters. Consider the task of predicting the next word in a sentence: "The clouds darkened and it began to _." The appropriate prediction depends entirely on the preceding words. A feedforward network processing each word in isolation cannot capture these dependencies. Similarly, understanding whether a stock price movement is anomalous requires knowledge of recent price history, and recognizing speech requires tracking phoneme sequences over time.
Sequential data exhibits temporal dependencies that span varying distances. Some patterns involve only adjacent elements, like the spelling rules that constrain which letters can follow others. Other patterns span many time steps, like the subject-verb agreement in "The student who excelled in all the advanced mathematics courses was..." where "was" must agree with "student" despite many intervening words. Effective sequence models must capture both short-range and long-range dependencies.
The key insight of recurrent networks is that the same operation can be applied repeatedly at each time step, with information passed from one step to the next through a hidden state. This hidden state acts as the network's memory, accumulating relevant information from past inputs to inform processing of the current input. The recurrent structure enables weight sharing across time steps, making the model applicable to sequences of any length.
import torch
import torch.nn as nn
import numpy as np
# Demonstrate why feedforward fails for sequences
def feedforward_limitation():
"""Show that feedforward networks can't handle variable-length sequences."""
# Feedforward expects fixed input size
ff_network = nn.Sequential(
nn.Linear(10, 32), # Expects exactly 10 inputs
nn.ReLU(),
nn.Linear(32, 1)
)
# Works for length 10
x_10 = torch.randn(1, 10)
print(f"Input length 10: {ff_network(x_10).shape}")
# Fails for length 15
x_15 = torch.randn(1, 15)
try:
ff_network(x_15)
except RuntimeError as e:
print(f"Input length 15: Error - {str(e)[:50]}...")
# Each position processed independently - no context
print("\nFeedforward treats each position independently:")
print("Cannot learn that 'it began to ___' needs rain/snow context")
feedforward_limitation()The Vanilla RNN Architecture
The basic recurrent neural network, often called vanilla RNN, processes sequences by maintaining a hidden state that evolves at each time step. At time $t$, the network receives the current input $x_t$ and the previous hidden state $h_{t-1}$, combining them to produce a new hidden state $h_t$. This hidden state can then be used to generate an output $y_t$ or passed to the next time step.
The mathematical formulation of a vanilla RNN is:
Here, $W_{xh}$ transforms the input, $W_{hh}$ transforms the previous hidden state, and $W_{hy}$ produces the output. The hyperbolic tangent activation squashes values to the range [-1, 1], preventing hidden state values from growing unboundedly. The same weights are applied at every time step, making the network applicable to sequences of any length.
The hidden state serves as a compressed summary of the sequence seen so far. At each step, the network must decide what information from the new input to incorporate and what to retain from its previous memory. This compression is both the power and limitation of RNNs: they can process arbitrarily long sequences with fixed memory, but that fixed capacity limits what they can remember.
import torch
import torch.nn as nn
class VanillaRNN(nn.Module):
"""
Vanilla RNN implementation from scratch.
"""
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
# Weight matrices
self.W_xh = nn.Linear(input_size, hidden_size) # Input to hidden
self.W_hh = nn.Linear(hidden_size, hidden_size) # Hidden to hidden
self.W_hy = nn.Linear(hidden_size, output_size) # Hidden to output
def forward(self, x, h_prev=None):
"""
Process one time step.
Args:
x: Input at current time step (batch, input_size)
h_prev: Previous hidden state (batch, hidden_size)
Returns:
output: Output at current time step
h: New hidden state
"""
if h_prev is None:
h_prev = torch.zeros(x.size(0), self.hidden_size, device=x.device)
# Combine input and previous hidden state
h = torch.tanh(self.W_xh(x) + self.W_hh(h_prev))
# Compute output
output = self.W_hy(h)
return output, h
def forward_sequence(self, x_seq):
"""
Process entire sequence.
Args:
x_seq: Sequence tensor (batch, seq_len, input_size)
Returns:
outputs: All outputs (batch, seq_len, output_size)
h_final: Final hidden state
"""
batch_size, seq_len, _ = x_seq.shape
h = None
outputs = []
for t in range(seq_len):
output, h = self.forward(x_seq[:, t, :], h)
outputs.append(output)
outputs = torch.stack(outputs, dim=1)
return outputs, h
# Test the vanilla RNN
rnn = VanillaRNN(input_size=10, hidden_size=32, output_size=5)
x_seq = torch.randn(4, 20, 10) # Batch of 4, sequence length 20, 10 features
outputs, h_final = rnn.forward_sequence(x_seq)
print(f"Input sequence: {x_seq.shape}")
print(f"Output sequence: {outputs.shape}")
print(f"Final hidden state: {h_final.shape}")Unfolding Through Time
Understanding how RNNs process sequences becomes clearer when we "unfold" the network through time, visualizing each time step as a separate layer. The unfolded view reveals that an RNN processing a sequence of length $T$ is equivalent to a deep feedforward network with $T$ layers, where all layers share the same weights. This perspective illuminates both the power of RNNs and their training challenges.
When unfolded, the RNN forms a computational graph connecting inputs to outputs through a chain of hidden states. The initial hidden state $h_0$ (often initialized to zeros) connects to $h_1$, which connects to $h_2$, and so on. Each hidden state receives input from both the current input $x_t$ and the previous hidden state, creating paths through which information flows across time.
This unfolded view directly informs how we train RNNs through backpropagation through time (BPTT). Gradients flow backward through the unfolded network, from the loss at the final output back through each time step to the beginning of the sequence. The gradient at each time step depends on gradients from all future time steps, making RNN training inherently sequential for the backward pass even when the forward pass could be parallelized.
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
def visualize_unfolding():
"""
Demonstrate RNN unfolding conceptually.
"""
print("RNN Unfolding Through Time:")
print("=" * 60)
print()
print("Folded (recurrent) view:")
print(" ┌────────────────┐")
print(" │ │")
print(" │ ┌──────┐ │")
print("x_t ─────►│ RNN │───┼───► y_t")
print(" │ └──────┘ │")
print(" │ ▲ │")
print(" │ │ │")
print(" └─────────┴──────┘")
print(" h_t (loops back)")
print()
print("Unfolded view (for sequence length 4):")
print()
print("x_0 x_1 x_2 x_3")
print(" │ │ │ │")
print(" ▼ ▼ ▼ ▼")
print("┌───┐ ┌───┐ ┌───┐ ┌───┐")
print("│RNN│───►│RNN│───►│RNN│───►│RNN│")
print("└───┘ └───┘ └───┘ └───┘")
print(" │ │ │ │")
print(" ▼ ▼ ▼ ▼")
print("y_0 y_1 y_2 y_3")
print()
print("h_0 ──► h_1 ──► h_2 ──► h_3 ──► h_4")
print()
print("All RNN blocks share the SAME weights!")
print("Gradient flows backward through all time steps (BPTT)")
visualize_unfolding()
# Demonstrate gradient flow
def trace_gradient_flow():
"""Show how gradients flow through time."""
rnn = nn.RNN(input_size=10, hidden_size=20, batch_first=True)
x = torch.randn(1, 5, 10, requires_grad=True)
output, h_n = rnn(x)
# Loss at final time step
loss = output[:, -1, :].sum()
loss.backward()
print("\nGradient flow demonstration:")
print(f"Input shape: {x.shape} (batch=1, seq_len=5, features=10)")
print(f"Input gradient shape: {x.grad.shape}")
print(f"Gradient magnitude at each time step:")
for t in range(5):
grad_mag = x.grad[0, t, :].norm().item()
print(f" t={t}: {grad_mag:.4f}")
trace_gradient_flow()The Vanishing Gradient Problem
The fundamental limitation of vanilla RNNs emerges from the dynamics of gradient flow through the unfolded network. When gradients backpropagate through many time steps, they pass through repeated matrix multiplications with the hidden-to-hidden weight matrix $W_{hh}$ and the derivative of the activation function. If the largest singular value of $W_{hh}$ is less than 1, gradients shrink exponentially; if greater than 1, they explode exponentially.
The vanishing gradient problem manifests as the network's inability to learn long-range dependencies. When gradients from distant time steps shrink to near zero by the time they reach early time steps, the network cannot effectively credit or blame early inputs for outcomes at the end of the sequence. In practice, vanilla RNNs struggle with dependencies spanning more than 10-20 time steps.
Mathematically, consider the gradient of the loss with respect to the hidden state at time $t$:
Each factor $\frac{\partial h_k}{\partial h_{k-1}}$ involves $W_{hh}$ and the derivative of tanh, which is at most 1. The product of many numbers less than 1 vanishes exponentially, while the product of numbers greater than 1 explodes.
import torch
import torch.nn as nn
import numpy as np
def demonstrate_vanishing_gradients():
"""
Show how gradients vanish in vanilla RNNs.
"""
hidden_size = 50
seq_lengths = [10, 50, 100, 200]
print("Vanishing Gradient Demonstration")
print("=" * 50)
for seq_len in seq_lengths:
# Create RNN and process sequence
rnn = nn.RNN(input_size=10, hidden_size=hidden_size, batch_first=True)
x = torch.randn(1, seq_len, 10, requires_grad=True)
output, _ = rnn(x)
loss = output[:, -1, :].sum()
loss.backward()
# Measure gradient magnitude at each time step
grad_norms = [x.grad[0, t, :].norm().item() for t in range(seq_len)]
# Compare first and last
ratio = grad_norms[0] / grad_norms[-1] if grad_norms[-1] > 0 else float('inf')
print(f"\nSequence length {seq_len}:")
print(f" Gradient at t=0: {grad_norms[0]:.6f}")
print(f" Gradient at t={seq_len-1}: {grad_norms[-1]:.6f}")
print(f" Ratio (first/last): {ratio:.2f}x")
# Theoretical analysis
print("\n" + "=" * 50)
print("Theoretical Analysis:")
print("For tanh activation, derivative is in (0, 1]")
print("If ||W_hh|| < 1, gradients vanish exponentially")
print("After T steps: gradient ≈ (||W_hh|| * tanh_derivative)^T")
W = torch.randn(hidden_size, hidden_size) * 0.5 # Small weights
singular_values = torch.linalg.svdvals(W)
print(f"\nExample W_hh largest singular value: {singular_values[0]:.3f}")
for T in [10, 50, 100]:
decay = (singular_values[0].item() * 0.5) ** T # Approximate
print(f" After {T} steps: gradient scale ≈ {decay:.2e}")
demonstrate_vanishing_gradients()Gradient Clipping
While the vanishing gradient problem leads to slow learning of long-range dependencies, the exploding gradient problem causes training instability where gradients become so large that weight updates overshoot drastically. Gradient clipping provides a simple but effective solution by rescaling gradients when their norm exceeds a threshold.
The most common approach, global gradient clipping, computes the norm of all gradients concatenated into a single vector and rescales if this norm exceeds a threshold:
This preserves the direction of the gradient update while limiting its magnitude. The threshold $\theta$ is a hyperparameter, typically set between 1 and 5. Gradient clipping is essential for training RNNs and remains important even for more advanced architectures like LSTMs.
import torch
import torch.nn as nn
def demonstrate_gradient_clipping():
"""
Show gradient clipping in action.
"""
# Create a simple RNN
rnn = nn.RNN(10, 20, batch_first=True)
x = torch.randn(1, 100, 10)
target = torch.randn(1, 100, 20)
# Forward pass
output, _ = rnn(x)
loss = nn.MSELoss()(output, target)
# Backward pass
loss.backward()
# Compute gradient norm before clipping
total_norm_before = 0
for p in rnn.parameters():
if p.grad is not None:
total_norm_before += p.grad.norm().item() ** 2
total_norm_before = total_norm_before ** 0.5
print("Gradient Clipping Demonstration")
print("=" * 50)
print(f"Gradient norm before clipping: {total_norm_before:.4f}")
# Apply gradient clipping
max_norm = 1.0
torch.nn.utils.clip_grad_norm_(rnn.parameters(), max_norm)
# Compute gradient norm after clipping
total_norm_after = 0
for p in rnn.parameters():
if p.grad is not None:
total_norm_after += p.grad.norm().item() ** 2
total_norm_after = total_norm_after ** 0.5
print(f"Gradient norm after clipping: {total_norm_after:.4f}")
print(f"Clipping threshold: {max_norm}")
print(f"Clipped: {total_norm_before > max_norm}")
demonstrate_gradient_clipping()
# Gradient clipping in training loop
def train_with_clipping(model, optimizer, data_loader, clip_value=1.0):
"""Training loop with gradient clipping."""
model.train()
total_loss = 0
for batch_x, batch_y in data_loader:
optimizer.zero_grad()
output, _ = model(batch_x)
loss = nn.CrossEntropyLoss()(output.view(-1, output.size(-1)),
batch_y.view(-1))
loss.backward()
# Clip gradients before optimizer step
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()
total_loss += loss.item()
return total_loss / len(data_loader)RNN Variants and Configurations
RNNs can be configured in various ways depending on the task. The relationship between input and output sequences determines the architecture: many-to-one for classification of entire sequences, one-to-many for generation from a single seed, many-to-many for sequence labeling or translation. Stacking multiple RNN layers creates deeper networks that can learn more complex transformations.
Bidirectional RNNs process sequences in both forward and backward directions, concatenating hidden states from both passes. This allows each position to access context from both past and future, beneficial for tasks like named entity recognition where understanding the full sentence helps identify entities. Bidirectional processing is only possible when the full sequence is available before generating outputs.
Deep RNNs stack multiple recurrent layers, where the output sequence of one layer becomes the input to the next. Each layer can learn different levels of abstraction, similar to layers in feedforward networks. Residual connections between layers help gradient flow in deep stacks.
import torch
import torch.nn as nn
# Different RNN configurations
# Many-to-one: Sequence classification
class SentimentClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_size, batch_first=True)
self.classifier = nn.Linear(hidden_size, num_classes)
def forward(self, x):
embedded = self.embedding(x)
_, h_n = self.rnn(embedded) # Only use final hidden state
return self.classifier(h_n.squeeze(0))
# One-to-many: Sequence generation
class TextGenerator(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_size, batch_first=True)
self.output = nn.Linear(hidden_size, vocab_size)
def forward(self, seed, length):
# Start with seed token
x = self.embedding(seed)
h = None
outputs = []
for _ in range(length):
out, h = self.rnn(x, h)
logits = self.output(out)
next_token = logits.argmax(dim=-1)
outputs.append(next_token)
x = self.embedding(next_token)
return torch.cat(outputs, dim=1)
# Many-to-many: Sequence labeling
class POSTagger(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_tags):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_size, batch_first=True)
self.tagger = nn.Linear(hidden_size, num_tags)
def forward(self, x):
embedded = self.embedding(x)
output, _ = self.rnn(embedded) # Output at each position
return self.tagger(output)
# Bidirectional RNN
class BiRNNEncoder(nn.Module):
def __init__(self, input_size, hidden_size):
super().__init__()
self.rnn = nn.RNN(input_size, hidden_size,
bidirectional=True, batch_first=True)
def forward(self, x):
# Output has 2*hidden_size features (forward + backward)
output, h_n = self.rnn(x)
# h_n shape: (2, batch, hidden) for forward and backward
return output, h_n
# Deep (stacked) RNN
class DeepRNN(nn.Module):
def __init__(self, input_size, hidden_size, num_layers):
super().__init__()
self.rnn = nn.RNN(input_size, hidden_size,
num_layers=num_layers, batch_first=True)
def forward(self, x):
# h_n shape: (num_layers, batch, hidden)
output, h_n = self.rnn(x)
return output, h_n
# Test configurations
print("RNN Configuration Examples:")
print("=" * 50)
x = torch.randn(4, 20, 32) # batch=4, seq_len=20, features=32
bi_rnn = BiRNNEncoder(32, 64)
output, h_n = bi_rnn(x)
print(f"\nBidirectional RNN:")
print(f" Input: {x.shape}")
print(f" Output: {output.shape} (hidden*2 = 128)")
print(f" Hidden: {h_n.shape} (2 directions)")
deep_rnn = DeepRNN(32, 64, num_layers=3)
output, h_n = deep_rnn(x)
print(f"\nDeep RNN (3 layers):")
print(f" Input: {x.shape}")
print(f" Output: {output.shape}")
print(f" Hidden: {h_n.shape} (3 layers)")Using PyTorch's RNN Module
PyTorch provides optimized RNN implementations that handle batching, padding, and GPU acceleration efficiently. The nn.RNN, nn.LSTM, and nn.GRU modules support variable-length sequences through packed sequences, multiple layers, bidirectional processing, and dropout between layers.
When working with variable-length sequences in batches, sequences must be padded to the same length but we want the RNN to ignore padding tokens. PyTorch's pack<em>padded</em>sequence and pad<em>packed</em>sequence utilities handle this efficiently, ensuring the RNN only processes actual tokens.
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
# Using PyTorch's RNN with packed sequences
class EfficientRNN(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_layers=1,
bidirectional=False, dropout=0.0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.rnn = nn.RNN(
input_size=embed_dim,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
bidirectional=bidirectional,
dropout=dropout if num_layers > 1 else 0
)
def forward(self, x, lengths):
"""
Forward pass with variable-length sequences.
Args:
x: Padded input (batch, max_seq_len)
lengths: Actual lengths of each sequence
"""
# Embed tokens
embedded = self.embedding(x)
# Pack sequences for efficient processing
packed = pack_padded_sequence(embedded, lengths.cpu(),
batch_first=True, enforce_sorted=False)
# Process with RNN
packed_output, h_n = self.rnn(packed)
# Unpack back to padded tensor
output, _ = pad_packed_sequence(packed_output, batch_first=True)
return output, h_n
# Example with variable-length sequences
model = EfficientRNN(vocab_size=1000, embed_dim=64, hidden_size=128)
# Batch with different lengths (padded with 0)
sequences = torch.tensor([
[1, 2, 3, 4, 5, 0, 0, 0], # Length 5
[1, 2, 3, 0, 0, 0, 0, 0], # Length 3
[1, 2, 3, 4, 5, 6, 7, 8], # Length 8
[1, 2, 0, 0, 0, 0, 0, 0], # Length 2
])
lengths = torch.tensor([5, 3, 8, 2])
output, h_n = model(sequences, lengths)
print(f"Variable-length sequence handling:")
print(f" Input shape: {sequences.shape}")
print(f" Lengths: {lengths.tolist()}")
print(f" Output shape: {output.shape}")
print(f" Hidden shape: {h_n.shape}")Key Takeaways
Recurrent neural networks process sequential data by maintaining hidden states that carry information across time steps, enabling learning from variable-length sequences where order matters. The vanilla RNN applies the same transformation at each step, combining the current input with the previous hidden state through learned weight matrices. When unfolded through time, RNNs reveal their structure as deep networks with shared weights, trained through backpropagation through time. The vanishing gradient problem severely limits vanilla RNNs' ability to learn long-range dependencies, as gradients shrink exponentially when backpropagating through many time steps. Gradient clipping addresses the complementary exploding gradient problem by rescaling large gradients. Various RNN configurations including bidirectional and deep stacked architectures adapt the basic structure for different tasks. Despite their historical importance, vanilla RNNs have been largely replaced by LSTMs, GRUs, and more recently transformers, which better address the long-range dependency problem.