Regularization Techniques
Neural networks are powerful function approximators—sometimes too powerful. Given enough parameters, a network can memorize the training data perfectly while learning nothing generalizable. This is overfitting: excellent training performance, poor test performance.
Regularization techniques constrain the network to prevent overfitting. They add friction that discourages memorization and encourages learning patterns that generalize. The goal is to find the sweet spot where the network is expressive enough to capture real patterns but constrained enough to ignore noise.
The Overfitting Problem
A neural network with millions of parameters can fit almost any dataset. If your training set has 50,000 images, a large network could assign a unique weight pattern to recognize each one individually—without learning general features like "edges," "textures," or "shapes."
The training loss would be near zero, but the network would fail on new images it hasn't memorized. This is the fundamental tension in machine learning: we want models complex enough to capture real patterns, but simple enough to generalize beyond the training data.
import numpy as np
# Demonstrate overfitting with polynomial regression
np.random.seed(42)
# True function: simple sine wave
x = np.linspace(0, 2*np.pi, 20)
y_true = np.sin(x)
y_noisy = y_true + np.random.randn(20) * 0.3 # Add noise
# Fit polynomials of different degrees
def fit_polynomial(x, y, degree):
"""Fit polynomial and return predictions."""
coeffs = np.polyfit(x, y, degree)
return np.polyval(coeffs, x)
print("Overfitting with Polynomial Regression")
print("=" * 50)
print("True function: sin(x)")
print("Training data: 20 noisy samples\n")
for degree in [1, 3, 5, 15, 19]:
y_pred = fit_polynomial(x, y_noisy, degree)
train_error = np.mean((y_pred - y_noisy)**2)
true_error = np.mean((y_pred - y_true)**2)
status = ""
if degree <= 3:
status = "(underfitting)"
elif degree >= 15:
status = "(overfitting)"
else:
status = "(good fit)"
print(f"Degree {degree:2d}: train_error={train_error:.4f}, true_error={true_error:.4f} {status}")
print("\nDegree 19 has near-zero training error but high true error!")
print("It memorized the noise instead of learning the pattern.")L2 Regularization (Weight Decay)
L2 regularization adds a penalty proportional to the squared magnitude of weights:
This pushes weights toward zero, preferring simpler models with smaller weights. The intuition: if two models fit the data equally well, prefer the one with smaller weights—it's probably more generalizable.
In neural networks, L2 regularization is often called "weight decay" because it's equivalent to multiplying weights by a factor slightly less than 1 each update.
import numpy as np
def l2_penalty(weights, lambda_reg):
"""L2 regularization penalty."""
return lambda_reg * np.sum(weights**2)
def l2_gradient(weights, lambda_reg):
"""Gradient of L2 penalty."""
return 2 * lambda_reg * weights
# Simulate training with L2 regularization
np.random.seed(42)
weights_no_reg = np.random.randn(5) * 2 # Start with large weights
weights_l2 = weights_no_reg.copy()
lambda_reg = 0.1
lr = 0.1
print("L2 Regularization (Weight Decay)")
print("=" * 50)
print(f"Initial weights: {weights_l2.round(3)}")
print(f"Lambda: {lambda_reg}\n")
# Simulate gradient descent with L2 regularization
# Assume data gradient is small (to isolate regularization effect)
data_gradient = np.array([0.1, -0.05, 0.02, -0.1, 0.05])
for step in range(10):
# Without regularization
weights_no_reg = weights_no_reg - lr * data_gradient
# With L2 regularization
total_gradient = data_gradient + l2_gradient(weights_l2, lambda_reg)
weights_l2 = weights_l2 - lr * total_gradient
if step % 3 == 0:
print(f"Step {step}: no_reg = {weights_no_reg.round(3)}, L2 = {weights_l2.round(3)}")
print(f"\nL2 penalty pulls weights toward zero")
print(f"Final weight magnitudes: no_reg={np.abs(weights_no_reg).sum():.2f}, L2={np.abs(weights_l2).sum():.2f}")L1 Regularization (Sparsity)
L1 regularization penalizes the absolute value of weights:
Unlike L2, L1 encourages sparsity—it pushes many weights exactly to zero, not just small. This can be useful for feature selection: weights that become zero indicate unimportant features.
L1 is less common in deep learning because the non-smooth gradient at zero can cause optimization issues. But it's valuable when you want interpretable, sparse models.
import numpy as np
def l1_penalty(weights, lambda_reg):
"""L1 regularization penalty."""
return lambda_reg * np.sum(np.abs(weights))
def l1_gradient(weights, lambda_reg):
"""Gradient of L1 penalty (subgradient at 0)."""
return lambda_reg * np.sign(weights)
# Compare L1 vs L2
np.random.seed(42)
weights_l1 = np.array([2.0, 0.5, -0.3, 0.1, -1.5])
weights_l2 = weights_l1.copy()
lambda_reg = 0.1
lr = 0.1
print("L1 vs L2 Regularization")
print("=" * 50)
print(f"Initial weights: {weights_l1}")
print(f"Lambda: {lambda_reg}\n")
# Assume no data gradient (pure regularization effect)
for step in range(20):
# L1 update
weights_l1 = weights_l1 - lr * l1_gradient(weights_l1, lambda_reg)
# Clip to zero (weights don't oscillate around zero)
weights_l1 = np.where(np.abs(weights_l1) < lr * lambda_reg, 0, weights_l1)
# L2 update
weights_l2 = weights_l2 - lr * l1_gradient(weights_l2, lambda_reg)
if step % 5 == 0:
n_zero_l1 = np.sum(weights_l1 == 0)
print(f"Step {step:2d}: L1 = {weights_l1.round(3)} ({n_zero_l1} zeros)")
print(f" L2 = {weights_l2.round(3)}")
print("\nL1 drives weights to exactly zero (sparsity)")
print("L2 shrinks weights but rarely makes them exactly zero")Dropout
Dropout randomly sets neurons to zero during training. Each forward pass uses a random subset of neurons, forcing the network to learn redundant representations.
At training time, each neuron is zeroed with probability $p$ (typically 0.5 for hidden layers, lower for input). At inference time, all neurons are used but outputs are scaled by $(1-p)$ to compensate.
Dropout prevents co-adaptation: neurons can't rely on specific other neurons being present, so each must learn useful features independently.
import numpy as np
def dropout(x, p=0.5, training=True):
"""
Apply dropout to activations.
p: probability of dropping (zeroing) each neuron
"""
if not training:
return x # No dropout at inference
# Create random mask
mask = np.random.binomial(1, 1-p, size=x.shape)
# Scale by 1/(1-p) so expected value is unchanged
return x * mask / (1 - p)
np.random.seed(42)
# Simulate layer activations
activations = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
print("Dropout Regularization")
print("=" * 50)
print(f"Original activations: {activations}")
print(f"Dropout probability: 0.5\n")
print("Training mode (random neurons dropped):")
for i in range(5):
dropped = dropout(activations, p=0.5, training=True)
print(f" Forward pass {i+1}: {dropped.round(2)}")
print(f"\nInference mode (all neurons, no scaling needed with inverted dropout):")
inference = dropout(activations, p=0.5, training=False)
print(f" {inference}")
print("\nEach training pass sees different network 'architectures'")
print("This prevents co-adaptation and improves generalization")Dropout in Practice
Dropout rates vary by layer type. Input layers typically use lower dropout (0.2) to avoid losing too much information. Hidden layers commonly use 0.5. Final layers may use lower dropout or none.
Modern architectures often replace dropout with other techniques like batch normalization, but dropout remains effective for fully connected layers and in domains like NLP.
import numpy as np
def create_dropout_masks(layer_sizes, dropout_rates):
"""Create dropout masks for a network."""
masks = []
for size, rate in zip(layer_sizes, dropout_rates):
if rate > 0:
mask = np.random.binomial(1, 1-rate, size=size) / (1 - rate)
else:
mask = np.ones(size)
masks.append(mask)
return masks
np.random.seed(42)
# Network architecture with typical dropout rates
layer_sizes = [784, 512, 256, 128, 10]
dropout_rates = [0.2, 0.5, 0.5, 0.5, 0.0] # Lower for input, none for output
print("Dropout Rates by Layer")
print("=" * 50)
for i, (size, rate) in enumerate(zip(layer_sizes, dropout_rates)):
layer_type = "input" if i == 0 else "output" if i == len(layer_sizes)-1 else "hidden"
print(f"Layer {i} ({layer_type}): {size} neurons, dropout={rate}")
print("\nSimulating one forward pass:")
masks = create_dropout_masks(layer_sizes, dropout_rates)
for i, (mask, size, rate) in enumerate(zip(masks, layer_sizes, dropout_rates)):
active = np.sum(mask > 0)
print(f" Layer {i}: {active}/{size} neurons active ({100*active/size:.0f}%)")Early Stopping
Early stopping monitors validation loss and stops training when it starts increasing. The model is saved at its best validation performance.
Training loss typically decreases continuously. Validation loss decreases initially, then plateaus or increases as the model starts overfitting. Early stopping captures the model at the generalization sweet spot.
import numpy as np
def simulate_training(epochs, overfit_start=30):
"""Simulate training and validation loss curves."""
train_losses = []
val_losses = []
for e in range(epochs):
# Training loss always decreases
train_loss = 2.0 * np.exp(-e/20) + 0.1
# Validation loss decreases then increases (overfitting)
if e < overfit_start:
val_loss = 2.0 * np.exp(-e/25) + 0.2
else:
val_loss = 0.3 + 0.01 * (e - overfit_start)
train_losses.append(train_loss)
val_losses.append(val_loss)
return train_losses, val_losses
print("Early Stopping")
print("=" * 50)
train_losses, val_losses = simulate_training(60)
best_epoch = np.argmin(val_losses)
best_val_loss = val_losses[best_epoch]
print(f"{'Epoch':<8} {'Train':<10} {'Val':<10} {'Status':<15}")
print("-" * 43)
for e in [0, 10, 20, 30, 40, 50, 59]:
status = ""
if e == best_epoch:
status = "BEST (save model)"
elif e > best_epoch and val_losses[e] > best_val_loss * 1.1:
status = "overfitting"
print(f"{e:<8} {train_losses[e]:<10.4f} {val_losses[e]:<10.4f} {status}")
print(f"\nBest model at epoch {best_epoch} with val_loss={best_val_loss:.4f}")
print("Training continues improving, but validation gets worse after epoch 30")
print("Early stopping prevents overfitting by selecting the best validation model")Data Augmentation
Data augmentation artificially expands the training set by applying transformations that preserve labels. For images: rotation, flipping, cropping, color jitter. For text: synonym replacement, back-translation.
This forces the network to learn invariances. If rotated images still have the same label, the network must learn rotation-invariant features.
import numpy as np
def augment_image(image, transform):
"""Apply transformation to image (simulated)."""
# In practice, these would be actual image operations
return f"{image}_{transform}"
def augment_batch(images, labels, augmentation_factor=4):
"""Augment a batch of images."""
transforms = ['original', 'flip_h', 'flip_v', 'rotate_90', 'rotate_180',
'crop_center', 'brightness_up', 'brightness_down']
augmented_images = []
augmented_labels = []
for img, label in zip(images, labels):
# Original
augmented_images.append(img)
augmented_labels.append(label)
# Random augmentations
selected = np.random.choice(transforms[1:], augmentation_factor-1, replace=False)
for t in selected:
augmented_images.append(augment_image(img, t))
augmented_labels.append(label) # Label stays the same!
return augmented_images, augmented_labels
print("Data Augmentation")
print("=" * 50)
# Original training set
original_images = ['cat_01', 'dog_01', 'bird_01']
original_labels = ['cat', 'dog', 'bird']
print(f"Original training set: {len(original_images)} images")
# Augment
aug_images, aug_labels = augment_batch(original_images, original_labels)
print(f"After augmentation: {len(aug_images)} images")
print("\nAugmented samples:")
for img, label in zip(aug_images[:8], aug_labels[:8]):
print(f" {img} -> {label}")
print("\nThe network sees more variety while labels stay consistent")
print("This teaches invariance to transformations")Label Smoothing
Label smoothing replaces hard labels (0 or 1) with soft labels. Instead of training on [0, 0, 1] for class 2, you train on [0.033, 0.033, 0.933].
This prevents the network from becoming overconfident. Hard labels push the network to output extreme probabilities, which can hurt generalization. Soft labels encourage more calibrated confidence.
import numpy as np
def smooth_labels(labels, num_classes, smoothing=0.1):
"""
Convert hard labels to soft labels.
smoothing=0.1 means true class gets 0.9, others share 0.1
"""
n_samples = len(labels)
soft_labels = np.ones((n_samples, num_classes)) * smoothing / (num_classes - 1)
for i, label in enumerate(labels):
soft_labels[i, label] = 1 - smoothing
return soft_labels
# Example
labels = np.array([0, 1, 2]) # True class indices
num_classes = 3
print("Label Smoothing")
print("=" * 50)
print("Hard labels:")
hard = np.eye(num_classes)[labels]
print(hard)
print("\nSoft labels (smoothing=0.1):")
soft = smooth_labels(labels, num_classes, smoothing=0.1)
print(soft.round(3))
print("\nSoft labels (smoothing=0.2):")
soft2 = smooth_labels(labels, num_classes, smoothing=0.2)
print(soft2.round(3))
print("\nBenefits:")
print("- Prevents overconfident predictions")
print("- Better calibrated probabilities")
print("- Acts as regularization")Combining Regularization Techniques
In practice, you combine multiple regularization techniques. A typical setup might use:
- Weight decay (L2 regularization) in the optimizer
- Dropout in fully connected layers
- Data augmentation for images
- Early stopping based on validation loss
The key is balancing regularization strength. Too little and the model overfits; too much and it underfits.
import numpy as np
print("Combined Regularization Strategy")
print("=" * 60)
print("\nTypical Setup for Image Classification:")
print("-" * 60)
print("1. Data Augmentation")
print(" - Random crop, horizontal flip, color jitter")
print(" - RandAugment or AutoAugment for automatic selection")
print("\n2. Model Regularization")
print(" - Weight decay: 1e-4 to 1e-2 (higher for smaller datasets)")
print(" - Dropout: 0.5 for FC layers, less common in convolutions")
print(" - DropPath/Stochastic Depth for residual networks")
print("\n3. Training Regularization")
print(" - Label smoothing: 0.1")
print(" - Mixup or CutMix: blend images during training")
print("\n4. Early Stopping")
print(" - Monitor validation loss")
print(" - Patience: 10-20 epochs without improvement")
print("\nTypical Setup for Language Models:")
print("-" * 60)
print("1. Weight decay: 0.01 to 0.1")
print("2. Dropout: 0.1 in attention and feed-forward")
print("3. Label smoothing: 0.1")
print("4. Gradient clipping: max_norm=1.0")
print("\nGuidelines:")
print("-" * 60)
print("- Start with light regularization")
print("- Increase if validation loss >> training loss (overfitting)")
print("- Decrease if training loss stays high (underfitting)")
print("- More data = less regularization needed")Key Takeaways
- Overfitting occurs when models memorize training data instead of learning generalizable patterns
- L2 regularization (weight decay) penalizes large weights, preferring simpler models
- L1 regularization encourages sparsity, pushing weights to exactly zero
- Dropout randomly zeros neurons during training, preventing co-adaptation
- Early stopping saves the model at best validation performance
- Data augmentation expands training data with label-preserving transformations
- Label smoothing prevents overconfident predictions with soft labels
- Combine techniques: weight decay + dropout + augmentation + early stopping
- Balance regularization strength to avoid both overfitting and underfitting