The Transformer Architecture
The transformer architecture, introduced in "Attention Is All You Need" (2017), replaced recurrence with self-attention as the primary mechanism for sequence modeling. This architectural shift enabled unprecedented parallelization during training and led to dramatic improvements in translation, language modeling, and eventually all of natural language processing.
From RNNs to Transformers
Recurrent neural networks process sequences step by step, maintaining hidden state that accumulates information. This sequential nature creates two fundamental limitations: training cannot be parallelized across time steps, and information must traverse many steps to connect distant positions.
Transformers address both limitations by processing all positions simultaneously through self-attention. Every position can directly attend to every other position in a single operation, eliminating the need for sequential processing.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.attention = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout, batch_first=True)
self.ff = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim),
nn.Dropout(dropout)
)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual
attn_out, _ = self.attention(x, x, x, attn_mask=mask)
x = self.norm1(x + self.dropout(attn_out))
# Feed-forward with residual
ff_out = self.ff(x)
x = self.norm2(x + ff_out)
return xCore Components
The transformer consists of several key components working together:
Multi-Head Self-Attention: Allows each position to gather information from all other positions, with multiple heads capturing different relationship types.
Position-wise Feed-Forward Networks: Two-layer networks applied identically to each position, providing non-linear transformation capacity.
Layer Normalization: Stabilizes training by normalizing activations within each layer.
Residual Connections: Enable gradient flow through deep networks by adding inputs to outputs.
class Transformer(nn.Module):
def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers, max_len=512, dropout=0.1):
super().__init__()
# Embeddings
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
self.position_embedding = nn.Embedding(max_len, embed_dim)
self.dropout = nn.Dropout(dropout)
# Transformer blocks
self.layers = nn.ModuleList([
TransformerBlock(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_layers)
])
# Output
self.norm = nn.LayerNorm(embed_dim)
self.output = nn.Linear(embed_dim, vocab_size)
self.embed_dim = embed_dim
def forward(self, x, mask=None):
batch_size, seq_len = x.shape
# Token + position embeddings
positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
x = self.token_embedding(x) + self.position_embedding(positions)
x = self.dropout(x)
# Apply transformer blocks
for layer in self.layers:
x = layer(x, mask)
x = self.norm(x)
return self.output(x)The Attention Mechanism in Detail
Self-attention computes a weighted sum of values based on query-key compatibility:
- Project input to queries, keys, and values
- Compute attention scores as scaled dot products
- Apply softmax to get attention weights
- Multiply weights by values to get output
class SelfAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.qkv = nn.Linear(embed_dim, 3 * embed_dim)
self.proj = nn.Linear(embed_dim, embed_dim)
def forward(self, x, mask=None):
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0)
attn = (q @ k.transpose(-2, -1)) * self.scale
if mask is not None:
attn = attn.masked_fill(mask == 0, float('-inf'))
attn = attn.softmax(dim=-1)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
return self.proj(x)Feed-Forward Networks
Each transformer block includes a position-wise feed-forward network, typically expanding the dimension by 4x before projecting back:
class FeedForward(nn.Module):
def __init__(self, embed_dim, ff_dim, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)Why Transformers Work
Several factors contribute to transformer effectiveness:
Parallelization: All positions processed simultaneously during training, enabling efficient GPU utilization.
Direct connections: Any two positions connected through a single attention operation, avoiding vanishing gradients over long distances.
Flexibility: Attention patterns learned from data rather than fixed by architecture.
Scalability: Performance improves predictably with more parameters and data.
Key Takeaways
The transformer architecture replaces recurrence with self-attention, enabling parallel training and direct modeling of long-range dependencies. Core components include multi-head attention, feed-forward networks, layer normalization, and residual connections.
This architecture forms the foundation for modern language models including BERT, GPT, and their successors. Understanding transformer fundamentals is essential for working with contemporary NLP systems.