Intermediate 90 min read

Chapter 10: Training Deep Networks

Initialization, optimizers, regularization, and experiment tracking.

Learning Objectives

["Initialize weights properly", "Use modern optimizers", "Apply regularization techniques"]


10.3 Regularization Techniques Intermediate

Regularization Techniques

Neural networks are powerful function approximators—sometimes too powerful. Given enough parameters, a network can memorize the training data perfectly while learning nothing generalizable. This is overfitting: excellent training performance, poor test performance.

Regularization techniques constrain the network to prevent overfitting. They add friction that discourages memorization and encourages learning patterns that generalize. The goal is to find the sweet spot where the network is expressive enough to capture real patterns but constrained enough to ignore noise.

The Overfitting Problem

A neural network with millions of parameters can fit almost any dataset. If your training set has 50,000 images, a large network could assign a unique weight pattern to recognize each one individually—without learning general features like "edges," "textures," or "shapes."

The training loss would be near zero, but the network would fail on new images it hasn't memorized. This is the fundamental tension in machine learning: we want models complex enough to capture real patterns, but simple enough to generalize beyond the training data.

PYTHON
import numpy as np

# Demonstrate overfitting with polynomial regression
np.random.seed(42)

# True function: simple sine wave
x = np.linspace(0, 2*np.pi, 20)
y_true = np.sin(x)
y_noisy = y_true + np.random.randn(20) * 0.3  # Add noise

# Fit polynomials of different degrees
def fit_polynomial(x, y, degree):
    """Fit polynomial and return predictions."""
    coeffs = np.polyfit(x, y, degree)
    return np.polyval(coeffs, x)

print("Overfitting with Polynomial Regression")
print("=" * 50)
print("True function: sin(x)")
print("Training data: 20 noisy samples\n")

for degree in [1, 3, 5, 15, 19]:
    y_pred = fit_polynomial(x, y_noisy, degree)
    train_error = np.mean((y_pred - y_noisy)**2)
    true_error = np.mean((y_pred - y_true)**2)

    status = ""
    if degree <= 3:
        status = "(underfitting)"
    elif degree >= 15:
        status = "(overfitting)"
    else:
        status = "(good fit)"

    print(f"Degree {degree:2d}: train_error={train_error:.4f}, true_error={true_error:.4f} {status}")

print("\nDegree 19 has near-zero training error but high true error!")
print("It memorized the noise instead of learning the pattern.")

L2 Regularization (Weight Decay)

L2 regularization adds a penalty proportional to the squared magnitude of weights:

$$L_{total} = L_{data} + \lambda \sum_i w_i^2$$

This pushes weights toward zero, preferring simpler models with smaller weights. The intuition: if two models fit the data equally well, prefer the one with smaller weights—it's probably more generalizable.

In neural networks, L2 regularization is often called "weight decay" because it's equivalent to multiplying weights by a factor slightly less than 1 each update.

PYTHON
import numpy as np

def l2_penalty(weights, lambda_reg):
    """L2 regularization penalty."""
    return lambda_reg * np.sum(weights**2)

def l2_gradient(weights, lambda_reg):
    """Gradient of L2 penalty."""
    return 2 * lambda_reg * weights

# Simulate training with L2 regularization
np.random.seed(42)

weights_no_reg = np.random.randn(5) * 2  # Start with large weights
weights_l2 = weights_no_reg.copy()
lambda_reg = 0.1
lr = 0.1

print("L2 Regularization (Weight Decay)")
print("=" * 50)
print(f"Initial weights: {weights_l2.round(3)}")
print(f"Lambda: {lambda_reg}\n")

# Simulate gradient descent with L2 regularization
# Assume data gradient is small (to isolate regularization effect)
data_gradient = np.array([0.1, -0.05, 0.02, -0.1, 0.05])

for step in range(10):
    # Without regularization
    weights_no_reg = weights_no_reg - lr * data_gradient

    # With L2 regularization
    total_gradient = data_gradient + l2_gradient(weights_l2, lambda_reg)
    weights_l2 = weights_l2 - lr * total_gradient

    if step % 3 == 0:
        print(f"Step {step}: no_reg = {weights_no_reg.round(3)}, L2 = {weights_l2.round(3)}")

print(f"\nL2 penalty pulls weights toward zero")
print(f"Final weight magnitudes: no_reg={np.abs(weights_no_reg).sum():.2f}, L2={np.abs(weights_l2).sum():.2f}")

L1 Regularization (Sparsity)

L1 regularization penalizes the absolute value of weights:

$$L_{total} = L_{data} + \lambda \sum_i |w_i|$$

Unlike L2, L1 encourages sparsity—it pushes many weights exactly to zero, not just small. This can be useful for feature selection: weights that become zero indicate unimportant features.

L1 is less common in deep learning because the non-smooth gradient at zero can cause optimization issues. But it's valuable when you want interpretable, sparse models.

PYTHON
import numpy as np

def l1_penalty(weights, lambda_reg):
    """L1 regularization penalty."""
    return lambda_reg * np.sum(np.abs(weights))

def l1_gradient(weights, lambda_reg):
    """Gradient of L1 penalty (subgradient at 0)."""
    return lambda_reg * np.sign(weights)

# Compare L1 vs L2
np.random.seed(42)

weights_l1 = np.array([2.0, 0.5, -0.3, 0.1, -1.5])
weights_l2 = weights_l1.copy()
lambda_reg = 0.1
lr = 0.1

print("L1 vs L2 Regularization")
print("=" * 50)
print(f"Initial weights: {weights_l1}")
print(f"Lambda: {lambda_reg}\n")

# Assume no data gradient (pure regularization effect)
for step in range(20):
    # L1 update
    weights_l1 = weights_l1 - lr * l1_gradient(weights_l1, lambda_reg)
    # Clip to zero (weights don't oscillate around zero)
    weights_l1 = np.where(np.abs(weights_l1) < lr * lambda_reg, 0, weights_l1)

    # L2 update
    weights_l2 = weights_l2 - lr * l1_gradient(weights_l2, lambda_reg)

    if step % 5 == 0:
        n_zero_l1 = np.sum(weights_l1 == 0)
        print(f"Step {step:2d}: L1 = {weights_l1.round(3)} ({n_zero_l1} zeros)")
        print(f"         L2 = {weights_l2.round(3)}")

print("\nL1 drives weights to exactly zero (sparsity)")
print("L2 shrinks weights but rarely makes them exactly zero")

Dropout

Dropout randomly sets neurons to zero during training. Each forward pass uses a random subset of neurons, forcing the network to learn redundant representations.

At training time, each neuron is zeroed with probability $p$ (typically 0.5 for hidden layers, lower for input). At inference time, all neurons are used but outputs are scaled by $(1-p)$ to compensate.

Dropout prevents co-adaptation: neurons can't rely on specific other neurons being present, so each must learn useful features independently.

PYTHON
import numpy as np

def dropout(x, p=0.5, training=True):
    """
    Apply dropout to activations.
    p: probability of dropping (zeroing) each neuron
    """
    if not training:
        return x  # No dropout at inference

    # Create random mask
    mask = np.random.binomial(1, 1-p, size=x.shape)

    # Scale by 1/(1-p) so expected value is unchanged
    return x * mask / (1 - p)

np.random.seed(42)

# Simulate layer activations
activations = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

print("Dropout Regularization")
print("=" * 50)
print(f"Original activations: {activations}")
print(f"Dropout probability: 0.5\n")

print("Training mode (random neurons dropped):")
for i in range(5):
    dropped = dropout(activations, p=0.5, training=True)
    print(f"  Forward pass {i+1}: {dropped.round(2)}")

print(f"\nInference mode (all neurons, no scaling needed with inverted dropout):")
inference = dropout(activations, p=0.5, training=False)
print(f"  {inference}")

print("\nEach training pass sees different network 'architectures'")
print("This prevents co-adaptation and improves generalization")

Dropout in Practice

Dropout rates vary by layer type. Input layers typically use lower dropout (0.2) to avoid losing too much information. Hidden layers commonly use 0.5. Final layers may use lower dropout or none.

Modern architectures often replace dropout with other techniques like batch normalization, but dropout remains effective for fully connected layers and in domains like NLP.

PYTHON
import numpy as np

def create_dropout_masks(layer_sizes, dropout_rates):
    """Create dropout masks for a network."""
    masks = []
    for size, rate in zip(layer_sizes, dropout_rates):
        if rate > 0:
            mask = np.random.binomial(1, 1-rate, size=size) / (1 - rate)
        else:
            mask = np.ones(size)
        masks.append(mask)
    return masks

np.random.seed(42)

# Network architecture with typical dropout rates
layer_sizes = [784, 512, 256, 128, 10]
dropout_rates = [0.2, 0.5, 0.5, 0.5, 0.0]  # Lower for input, none for output

print("Dropout Rates by Layer")
print("=" * 50)

for i, (size, rate) in enumerate(zip(layer_sizes, dropout_rates)):
    layer_type = "input" if i == 0 else "output" if i == len(layer_sizes)-1 else "hidden"
    print(f"Layer {i} ({layer_type}): {size} neurons, dropout={rate}")

print("\nSimulating one forward pass:")
masks = create_dropout_masks(layer_sizes, dropout_rates)
for i, (mask, size, rate) in enumerate(zip(masks, layer_sizes, dropout_rates)):
    active = np.sum(mask > 0)
    print(f"  Layer {i}: {active}/{size} neurons active ({100*active/size:.0f}%)")

Early Stopping

Early stopping monitors validation loss and stops training when it starts increasing. The model is saved at its best validation performance.

Training loss typically decreases continuously. Validation loss decreases initially, then plateaus or increases as the model starts overfitting. Early stopping captures the model at the generalization sweet spot.

PYTHON
import numpy as np

def simulate_training(epochs, overfit_start=30):
    """Simulate training and validation loss curves."""
    train_losses = []
    val_losses = []

    for e in range(epochs):
        # Training loss always decreases
        train_loss = 2.0 * np.exp(-e/20) + 0.1

        # Validation loss decreases then increases (overfitting)
        if e < overfit_start:
            val_loss = 2.0 * np.exp(-e/25) + 0.2
        else:
            val_loss = 0.3 + 0.01 * (e - overfit_start)

        train_losses.append(train_loss)
        val_losses.append(val_loss)

    return train_losses, val_losses

print("Early Stopping")
print("=" * 50)

train_losses, val_losses = simulate_training(60)

best_epoch = np.argmin(val_losses)
best_val_loss = val_losses[best_epoch]

print(f"{'Epoch':<8} {'Train':<10} {'Val':<10} {'Status':<15}")
print("-" * 43)

for e in [0, 10, 20, 30, 40, 50, 59]:
    status = ""
    if e == best_epoch:
        status = "BEST (save model)"
    elif e > best_epoch and val_losses[e] > best_val_loss * 1.1:
        status = "overfitting"

    print(f"{e:<8} {train_losses[e]:<10.4f} {val_losses[e]:<10.4f} {status}")

print(f"\nBest model at epoch {best_epoch} with val_loss={best_val_loss:.4f}")
print("Training continues improving, but validation gets worse after epoch 30")
print("Early stopping prevents overfitting by selecting the best validation model")

Data Augmentation

Data augmentation artificially expands the training set by applying transformations that preserve labels. For images: rotation, flipping, cropping, color jitter. For text: synonym replacement, back-translation.

This forces the network to learn invariances. If rotated images still have the same label, the network must learn rotation-invariant features.

PYTHON
import numpy as np

def augment_image(image, transform):
    """Apply transformation to image (simulated)."""
    # In practice, these would be actual image operations
    return f"{image}_{transform}"

def augment_batch(images, labels, augmentation_factor=4):
    """Augment a batch of images."""
    transforms = ['original', 'flip_h', 'flip_v', 'rotate_90', 'rotate_180',
                  'crop_center', 'brightness_up', 'brightness_down']

    augmented_images = []
    augmented_labels = []

    for img, label in zip(images, labels):
        # Original
        augmented_images.append(img)
        augmented_labels.append(label)

        # Random augmentations
        selected = np.random.choice(transforms[1:], augmentation_factor-1, replace=False)
        for t in selected:
            augmented_images.append(augment_image(img, t))
            augmented_labels.append(label)  # Label stays the same!

    return augmented_images, augmented_labels

print("Data Augmentation")
print("=" * 50)

# Original training set
original_images = ['cat_01', 'dog_01', 'bird_01']
original_labels = ['cat', 'dog', 'bird']

print(f"Original training set: {len(original_images)} images")

# Augment
aug_images, aug_labels = augment_batch(original_images, original_labels)

print(f"After augmentation: {len(aug_images)} images")
print("\nAugmented samples:")
for img, label in zip(aug_images[:8], aug_labels[:8]):
    print(f"  {img} -> {label}")

print("\nThe network sees more variety while labels stay consistent")
print("This teaches invariance to transformations")

Label Smoothing

Label smoothing replaces hard labels (0 or 1) with soft labels. Instead of training on [0, 0, 1] for class 2, you train on [0.033, 0.033, 0.933].

This prevents the network from becoming overconfident. Hard labels push the network to output extreme probabilities, which can hurt generalization. Soft labels encourage more calibrated confidence.

PYTHON
import numpy as np

def smooth_labels(labels, num_classes, smoothing=0.1):
    """
    Convert hard labels to soft labels.
    smoothing=0.1 means true class gets 0.9, others share 0.1
    """
    n_samples = len(labels)
    soft_labels = np.ones((n_samples, num_classes)) * smoothing / (num_classes - 1)

    for i, label in enumerate(labels):
        soft_labels[i, label] = 1 - smoothing

    return soft_labels

# Example
labels = np.array([0, 1, 2])  # True class indices
num_classes = 3

print("Label Smoothing")
print("=" * 50)

print("Hard labels:")
hard = np.eye(num_classes)[labels]
print(hard)

print("\nSoft labels (smoothing=0.1):")
soft = smooth_labels(labels, num_classes, smoothing=0.1)
print(soft.round(3))

print("\nSoft labels (smoothing=0.2):")
soft2 = smooth_labels(labels, num_classes, smoothing=0.2)
print(soft2.round(3))

print("\nBenefits:")
print("- Prevents overconfident predictions")
print("- Better calibrated probabilities")
print("- Acts as regularization")

Combining Regularization Techniques

In practice, you combine multiple regularization techniques. A typical setup might use:

  • Weight decay (L2 regularization) in the optimizer
  • Dropout in fully connected layers
  • Data augmentation for images
  • Early stopping based on validation loss

The key is balancing regularization strength. Too little and the model overfits; too much and it underfits.

PYTHON
import numpy as np

print("Combined Regularization Strategy")
print("=" * 60)

print("\nTypical Setup for Image Classification:")
print("-" * 60)
print("1. Data Augmentation")
print("   - Random crop, horizontal flip, color jitter")
print("   - RandAugment or AutoAugment for automatic selection")

print("\n2. Model Regularization")
print("   - Weight decay: 1e-4 to 1e-2 (higher for smaller datasets)")
print("   - Dropout: 0.5 for FC layers, less common in convolutions")
print("   - DropPath/Stochastic Depth for residual networks")

print("\n3. Training Regularization")
print("   - Label smoothing: 0.1")
print("   - Mixup or CutMix: blend images during training")

print("\n4. Early Stopping")
print("   - Monitor validation loss")
print("   - Patience: 10-20 epochs without improvement")

print("\nTypical Setup for Language Models:")
print("-" * 60)
print("1. Weight decay: 0.01 to 0.1")
print("2. Dropout: 0.1 in attention and feed-forward")
print("3. Label smoothing: 0.1")
print("4. Gradient clipping: max_norm=1.0")

print("\nGuidelines:")
print("-" * 60)
print("- Start with light regularization")
print("- Increase if validation loss >> training loss (overfitting)")
print("- Decrease if training loss stays high (underfitting)")
print("- More data = less regularization needed")

Key Takeaways

  • Overfitting occurs when models memorize training data instead of learning generalizable patterns
  • L2 regularization (weight decay) penalizes large weights, preferring simpler models
  • L1 regularization encourages sparsity, pushing weights to exactly zero
  • Dropout randomly zeros neurons during training, preventing co-adaptation
  • Early stopping saves the model at best validation performance
  • Data augmentation expands training data with label-preserving transformations
  • Label smoothing prevents overconfident predictions with soft labels
  • Combine techniques: weight decay + dropout + augmentation + early stopping
  • Balance regularization strength to avoid both overfitting and underfitting

10.4 Batch Normalization Intermediate

Batch Normalization

Batch normalization transformed deep learning by solving a fundamental training problem: internal covariate shift. As weights update during training, the distribution of inputs to each layer changes. Each layer must constantly adapt to these shifting distributions, making training slow and unstable.

Batch normalization fixes this by normalizing layer inputs to have zero mean and unit variance. This simple technique dramatically accelerates training, allows higher learning rates, reduces sensitivity to initialization, and even provides regularization.

The Internal Covariate Shift Problem

Consider a deep network during training. When we update the weights of layer 1, the outputs of layer 1 change. Layer 2 receives different inputs than before. Its weights were tuned for the old distribution, so it performs poorly on the new one. Layer 2 must adapt, which changes layer 3's inputs, and so on.

This cascading effect means each layer is chasing a moving target. The deeper the network, the worse the problem. Training becomes slow and unstable, requiring tiny learning rates and careful initialization.

PYTHON
import numpy as np

def simulate_layer_output(weights, inputs):
    """Simulate layer output changing as weights update."""
    return np.dot(inputs, weights)

np.random.seed(42)

# Initial weights and inputs
inputs = np.random.randn(100, 10)  # 100 samples, 10 features
weights = np.random.randn(10, 5)   # 10 inputs, 5 outputs

print("Internal Covariate Shift")
print("=" * 50)

# Compute initial output statistics
output_before = simulate_layer_output(weights, inputs)
print(f"Output stats BEFORE weight update:")
print(f"  Mean: {output_before.mean(axis=0).round(3)}")
print(f"  Std:  {output_before.std(axis=0).round(3)}")

# Simulate weight update (gradient descent step)
weight_change = np.random.randn(10, 5) * 0.5
weights_updated = weights + weight_change

# Compute output with updated weights
output_after = simulate_layer_output(weights_updated, inputs)
print(f"\nOutput stats AFTER weight update:")
print(f"  Mean: {output_after.mean(axis=0).round(3)}")
print(f"  Std:  {output_after.std(axis=0).round(3)}")

print("\nThe next layer's inputs have completely different statistics!")
print("It must constantly adapt to these shifting distributions.")

How Batch Normalization Works

Batch normalization normalizes each feature across the mini-batch to have zero mean and unit variance:

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

where $\mu_B$ and $\sigma_B^2$ are the mean and variance computed over the mini-batch, and $\epsilon$ is a small constant for numerical stability.

To preserve the network's representational power, batch norm adds learnable parameters: scale ($\gamma$) and shift ($\beta$):

$$y_i = \gamma \hat{x}_i + \beta$$

If the network needs the original distribution, it can learn $\gamma = \sigma$ and $\beta = \mu$.

PYTHON
import numpy as np

def batch_norm(x, gamma, beta, epsilon=1e-5):
    """
    Apply batch normalization.
    x: input with shape (batch_size, features)
    gamma: scale parameter (features,)
    beta: shift parameter (features,)
    """
    # Compute batch statistics
    mu = np.mean(x, axis=0)
    var = np.var(x, axis=0)

    # Normalize
    x_norm = (x - mu) / np.sqrt(var + epsilon)

    # Scale and shift
    y = gamma * x_norm + beta

    return y, mu, var

np.random.seed(42)

# Simulated layer activations with various scales
x = np.random.randn(32, 4)  # Batch of 32, 4 features
x[:, 0] *= 10   # Feature 0 has large scale
x[:, 1] += 5    # Feature 1 has large mean
x[:, 2] *= 0.1  # Feature 2 has small scale
# Feature 3 is standard normal

print("Batch Normalization in Action")
print("=" * 50)

print("Before batch norm:")
print(f"  Means: {x.mean(axis=0).round(3)}")
print(f"  Stds:  {x.std(axis=0).round(3)}")

# Initialize gamma=1, beta=0 (identity transform on normalized values)
gamma = np.ones(4)
beta = np.zeros(4)

y, mu, var = batch_norm(x, gamma, beta)

print("\nAfter batch norm:")
print(f"  Means: {y.mean(axis=0).round(3)}")
print(f"  Stds:  {y.std(axis=0).round(3)}")

print("\nAll features now have ~0 mean and ~1 std")
print("Regardless of their original distributions!")

Batch Norm During Training vs Inference

During training, batch norm computes statistics from the current mini-batch. This works well but creates a problem: at inference time, we might have a single sample or a very small batch. The statistics would be unstable or meaningless.

The solution: during training, batch norm maintains running averages of mean and variance. At inference time, it uses these running statistics instead of computing from the batch.

PYTHON
import numpy as np

class BatchNorm:
    def __init__(self, num_features, momentum=0.1, epsilon=1e-5):
        self.gamma = np.ones(num_features)
        self.beta = np.zeros(num_features)
        self.epsilon = epsilon
        self.momentum = momentum

        # Running statistics for inference
        self.running_mean = np.zeros(num_features)
        self.running_var = np.ones(num_features)

    def forward(self, x, training=True):
        if training:
            # Compute batch statistics
            mu = np.mean(x, axis=0)
            var = np.var(x, axis=0)

            # Update running statistics
            self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mu
            self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
        else:
            # Use running statistics for inference
            mu = self.running_mean
            var = self.running_var

        # Normalize
        x_norm = (x - mu) / np.sqrt(var + self.epsilon)

        # Scale and shift
        return self.gamma * x_norm + self.beta

np.random.seed(42)

bn = BatchNorm(num_features=3)

print("Training vs Inference Mode")
print("=" * 50)

# Training: process several batches
print("Training phase (accumulating running statistics):")
for i in range(5):
    x = np.random.randn(32, 3) * (i + 1) + i  # Varying distributions
    y = bn.forward(x, training=True)
    print(f"  Batch {i+1}: input mean={x.mean():.2f}, output mean={y.mean():.4f}")

print(f"\nRunning mean: {bn.running_mean.round(3)}")
print(f"Running var:  {bn.running_var.round(3)}")

# Inference: single sample
print("\nInference phase (using running statistics):")
x_single = np.random.randn(1, 3) * 3 + 2
y_single = bn.forward(x_single, training=False)
print(f"  Single sample input: {x_single.round(3)}")
print(f"  Single sample output: {y_single.round(3)}")

Where to Place Batch Norm

The original paper placed batch norm between the linear transformation and the activation function:

$$y = \text{Activation}(\text{BatchNorm}(Wx + b))$$

Modern practice often places it after the activation:

$$y = \text{BatchNorm}(\text{Activation}(Wx + b))$$

Both work well. The key is consistency within your architecture.

Note: when using batch norm, the bias term $b$ becomes redundant—batch norm's $\beta$ serves the same purpose. Many implementations omit the bias when using batch norm.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def batch_norm_simple(x, gamma, beta):
    mu = np.mean(x, axis=0)
    var = np.var(x, axis=0)
    x_norm = (x - mu) / np.sqrt(var + 1e-5)
    return gamma * x_norm + beta

np.random.seed(42)

# Compare placement strategies
x = np.random.randn(32, 10)
W = np.random.randn(10, 5)
gamma = np.ones(5)
beta = np.zeros(5)

print("Batch Norm Placement")
print("=" * 50)

# Original: BN before activation
linear_out = x @ W
bn_out = batch_norm_simple(linear_out, gamma, beta)
y_original = relu(bn_out)
print(f"BN BEFORE activation:")
print(f"  Linear -> BN -> ReLU")
print(f"  Output mean: {y_original.mean():.4f}, std: {y_original.std():.4f}")

# Modern: BN after activation
linear_out = x @ W
relu_out = relu(linear_out)
y_modern = batch_norm_simple(relu_out, gamma, beta)
print(f"\nBN AFTER activation:")
print(f"  Linear -> ReLU -> BN")
print(f"  Output mean: {y_modern.mean():.4f}, std: {y_modern.std():.4f}")

print("\nBoth are valid approaches")
print("After activation is more common in modern architectures")

Benefits of Batch Normalization

Batch norm provides multiple benefits:

  1. Faster training: Normalized activations keep gradients in a healthy range, allowing higher learning rates.
  2. Reduced sensitivity to initialization: Bad initialization is "corrected" by normalization.
  3. Regularization effect: Batch statistics add noise (different statistics per batch), which helps prevent overfitting.
  4. Enables deeper networks: Stable activations make very deep networks trainable.

PYTHON
import numpy as np

def simulate_deep_network(depth, use_batch_norm, x):
    """Simulate forward pass through deep network."""
    activations = []

    for layer in range(depth):
        # Random weights
        W = np.random.randn(x.shape[1], x.shape[1]) * 0.5

        # Linear transformation
        x = x @ W

        # Optionally apply batch norm
        if use_batch_norm:
            mu = np.mean(x, axis=0)
            var = np.var(x, axis=0) + 1e-5
            x = (x - mu) / np.sqrt(var)

        # ReLU activation
        x = np.maximum(0, x)

        activations.append(x.std())

    return activations

np.random.seed(42)
x = np.random.randn(32, 64)

print("Benefits of Batch Normalization")
print("=" * 50)

# Without batch norm
print("\nWithout BatchNorm (20 layers):")
acts_no_bn = simulate_deep_network(20, use_batch_norm=False, x=x.copy())
print(f"  Layer activation stds: {[f'{a:.2e}' for a in acts_no_bn[::5]]}")
if acts_no_bn[-1] < 1e-10 or acts_no_bn[-1] > 1e10:
    print("  Activations vanished or exploded!")

# With batch norm
print("\nWith BatchNorm (20 layers):")
acts_bn = simulate_deep_network(20, use_batch_norm=True, x=x.copy())
print(f"  Layer activation stds: {[f'{a:.2e}' for a in acts_bn[::5]]}")
print("  Activations remain stable throughout!")

print("\nThis is why batch norm enables much deeper networks")

Layer Normalization

Batch normalization depends on batch statistics, which causes problems with small batches or recurrent networks (where batch norm doesn't make sense across time steps).

Layer normalization normalizes across features instead of across the batch. Each sample is normalized independently:

$$\hat{x}_i = \frac{x_i - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}$$

where $\mu_i$ and $\sigma_i$ are computed over features for sample $i$.

Layer norm is the standard choice for transformers and other sequence models.

PYTHON
import numpy as np

def batch_norm(x, epsilon=1e-5):
    """Normalize across batch (axis 0)."""
    mu = np.mean(x, axis=0, keepdims=True)
    var = np.var(x, axis=0, keepdims=True)
    return (x - mu) / np.sqrt(var + epsilon)

def layer_norm(x, epsilon=1e-5):
    """Normalize across features (axis 1)."""
    mu = np.mean(x, axis=1, keepdims=True)
    var = np.var(x, axis=1, keepdims=True)
    return (x - mu) / np.sqrt(var + epsilon)

np.random.seed(42)

# Input: (batch_size, features) = (4, 8)
x = np.random.randn(4, 8) * np.array([[1], [2], [3], [4]])  # Different scales per sample

print("Batch Norm vs Layer Norm")
print("=" * 50)

print(f"Input shape: {x.shape}")
print(f"Input sample stds: {x.std(axis=1).round(3)}")

# Batch normalization
x_bn = batch_norm(x)
print(f"\nBatch Norm (normalize across batch):")
print(f"  Each feature has mean~0, std~1 across samples")
print(f"  Feature stds: {x_bn.std(axis=0).round(3)}")

# Layer normalization
x_ln = layer_norm(x)
print(f"\nLayer Norm (normalize across features):")
print(f"  Each sample has mean~0, std~1 across features")
print(f"  Sample stds: {x_ln.std(axis=1).round(3)}")

print("\nBatch norm: good for CNNs, depends on batch")
print("Layer norm: good for transformers, independent per sample")

Group Normalization and Instance Normalization

Other normalization variants address specific use cases:

Group normalization divides channels into groups and normalizes within each group. It doesn't depend on batch size, making it suitable for small batches or object detection.

Instance normalization normalizes each sample and each channel independently. It's popular in style transfer where you want to normalize style features without mixing across channels.

PYTHON
import numpy as np

def instance_norm(x, epsilon=1e-5):
    """
    Normalize each sample and channel independently.
    x shape: (batch, channels, height, width)
    """
    # Normalize over spatial dimensions (H, W)
    mu = np.mean(x, axis=(2, 3), keepdims=True)
    var = np.var(x, axis=(2, 3), keepdims=True)
    return (x - mu) / np.sqrt(var + epsilon)

def group_norm(x, num_groups, epsilon=1e-5):
    """
    Normalize within channel groups.
    x shape: (batch, channels, height, width)
    """
    batch, channels, h, w = x.shape
    x = x.reshape(batch, num_groups, channels // num_groups, h, w)

    # Normalize over (C/G, H, W)
    mu = np.mean(x, axis=(2, 3, 4), keepdims=True)
    var = np.var(x, axis=(2, 3, 4), keepdims=True)
    x = (x - mu) / np.sqrt(var + epsilon)

    return x.reshape(batch, channels, h, w)

np.random.seed(42)

# Simulated feature maps: (batch=2, channels=8, height=4, width=4)
x = np.random.randn(2, 8, 4, 4)

print("Normalization Variants")
print("=" * 50)

print("Instance Norm: normalize each (sample, channel) pair")
x_in = instance_norm(x)
print(f"  Sample 0, Channel 0: mean={x_in[0,0].mean():.4f}, std={x_in[0,0].std():.4f}")
print(f"  Sample 0, Channel 1: mean={x_in[0,1].mean():.4f}, std={x_in[0,1].std():.4f}")

print("\nGroup Norm (2 groups of 4 channels): normalize within groups")
x_gn = group_norm(x, num_groups=2)
print(f"  Sample 0, Channels 0-3 (group 1): std={x_gn[0,:4].std():.4f}")
print(f"  Sample 0, Channels 4-7 (group 2): std={x_gn[0,4:].std():.4f}")

print("\nUse cases:")
print("  Batch Norm: CNNs with large batches")
print("  Layer Norm: Transformers, RNNs")
print("  Instance Norm: Style transfer")
print("  Group Norm: Small batches, detection")

Batch Norm in PyTorch

PyTorch provides easy-to-use normalization layers. Here's how to use them in practice:

PYTHON
import torch
import torch.nn as nn

print("Batch Normalization in PyTorch")
print("=" * 50)

# For fully connected layers: BatchNorm1d
bn1d = nn.BatchNorm1d(num_features=64)
x_fc = torch.randn(32, 64)  # (batch, features)
y_fc = bn1d(x_fc)
print(f"BatchNorm1d: {x_fc.shape} -> {y_fc.shape}")

# For convolutional layers: BatchNorm2d
bn2d = nn.BatchNorm2d(num_features=64)
x_conv = torch.randn(32, 64, 28, 28)  # (batch, channels, H, W)
y_conv = bn2d(x_conv)
print(f"BatchNorm2d: {x_conv.shape} -> {y_conv.shape}")

# Layer norm (for transformers)
ln = nn.LayerNorm(normalized_shape=64)
x_ln = torch.randn(32, 10, 64)  # (batch, seq_len, features)
y_ln = ln(x_ln)
print(f"LayerNorm: {x_ln.shape} -> {y_ln.shape}")

# Example model with batch norm
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

print(f"\nModel with BatchNorm:\n{model}")

# Important: set model.eval() for inference!
model.train()  # Training mode: uses batch statistics
model.eval()   # Eval mode: uses running statistics
print("\nRemember: model.eval() for inference!")

Key Takeaways

  • Internal covariate shift makes deep network training difficult as layer input distributions change
  • Batch normalization normalizes inputs to have zero mean and unit variance, stabilizing training
  • Learnable scale (γ) and shift (β) parameters preserve representational power
  • During training, batch norm uses mini-batch statistics; during inference, it uses running averages
  • Benefits: faster training, higher learning rates, less sensitivity to initialization, regularization
  • Layer norm normalizes across features (good for transformers, doesn't depend on batch)
  • Instance norm normalizes each channel independently (good for style transfer)
  • Group norm normalizes within channel groups (good for small batches)
  • Use model.eval() in PyTorch to switch from batch to running statistics for inference

10.5 Hyperparameter Tuning Intermediate

Hyperparameter Tuning

Hyperparameters are the settings we configure before training begins: learning rate, batch size, network architecture, and regularization strength. Finding the right hyperparameters can make the difference between a model that barely works and one that achieves state-of-the-art performance.

Hyperparameters vs Parameters

PYTHON
import torch
import torch.nn as nn

# Parameters: learned during training (weights, biases)
# Hyperparameters: set before training

model = nn.Sequential(
    nn.Linear(784, 256),  # 256 hidden units is a hyperparameter
    nn.ReLU(),
    nn.Dropout(0.3),      # 0.3 dropout rate is a hyperparameter
    nn.Linear(256, 10)
)

# Parameters (learned)
num_parameters = sum(p.numel() for p in model.parameters())
print(f"Learnable parameters: {num_parameters:,}")

# Common hyperparameters
hyperparameters = {
    # Architecture
    'num_layers': 2,
    'hidden_size': 256,
    'activation': 'relu',

    # Training
    'learning_rate': 0.001,
    'batch_size': 32,
    'num_epochs': 100,
    'optimizer': 'adam',

    # Regularization
    'dropout_rate': 0.3,
    'weight_decay': 1e-4,

    # Data
    'train_split': 0.8,
    'augmentation': True
}

print("\nHyperparameters:")
for name, value in hyperparameters.items():
    print(f"  {name}: {value}")

Learning Rate: The Most Important Hyperparameter

The learning rate controls how much weights change during each update:

PYTHON
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

def train_with_lr(lr, n_steps=200):
    """Train a simple model and return loss history"""
    torch.manual_seed(42)

    # Simple regression problem
    X = torch.randn(1000, 10)
    y = X[:, 0] * 2 + X[:, 1] - X[:, 2] * 0.5 + torch.randn(1000) * 0.1

    model = nn.Linear(10, 1)
    optimizer = optim.SGD(model.parameters(), lr=lr)
    criterion = nn.MSELoss()

    losses = []
    for step in range(n_steps):
        optimizer.zero_grad()
        pred = model(X).squeeze()
        loss = criterion(pred, y)
        losses.append(loss.item())
        loss.backward()
        optimizer.step()

    return losses

# Compare different learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1, 1.0]

plt.figure(figsize=(12, 5))
for lr in learning_rates:
    losses = train_with_lr(lr)
    plt.plot(losses, label=f'LR = {lr}')

plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Effect of Learning Rate on Training')
plt.legend()
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.savefig('learning_rate_comparison.png', dpi=100)
plt.show()

print("Learning Rate Guidelines:")
print("  Too small: Slow convergence")
print("  Too large: Divergence or oscillation")
print("  Just right: Fast, stable convergence")

Learning Rate Finder

A systematic way to find a good learning rate:

PYTHON
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

def lr_finder(model, train_loader, criterion, init_lr=1e-7, final_lr=10, num_steps=100):
    """
    Find optimal learning rate by gradually increasing LR and monitoring loss
    """
    optimizer = optim.SGD(model.parameters(), lr=init_lr)

    # Multiplicative factor
    mult = (final_lr / init_lr) ** (1 / num_steps)

    lrs = []
    losses = []
    best_loss = float('inf')

    model.train()
    iterator = iter(train_loader)

    for step in range(num_steps):
        try:
            X, y = next(iterator)
        except StopIteration:
            iterator = iter(train_loader)
            X, y = next(iterator)

        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)

        # Smoothed loss
        if step == 0:
            smoothed_loss = loss.item()
        else:
            smoothed_loss = 0.98 * smoothed_loss + 0.02 * loss.item()

        # Stop if loss explodes
        if smoothed_loss > 4 * best_loss:
            break

        if smoothed_loss < best_loss:
            best_loss = smoothed_loss

        lrs.append(optimizer.param_groups[0]['lr'])
        losses.append(smoothed_loss)

        loss.backward()
        optimizer.step()

        # Increase learning rate
        for param_group in optimizer.param_groups:
            param_group['lr'] *= mult

    return lrs, losses

# Example usage
from torch.utils.data import TensorDataset, DataLoader

# Create dataset
X = torch.randn(1000, 20)
y = torch.randint(0, 5, (1000,))
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Create model
model = nn.Sequential(
    nn.Linear(20, 64),
    nn.ReLU(),
    nn.Linear(64, 5)
)
criterion = nn.CrossEntropyLoss()

# Find learning rate
lrs, losses = lr_finder(model, loader, criterion)

# Plot
plt.figure(figsize=(10, 5))
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.grid(True, alpha=0.3)

# Find suggested LR (where loss decreases fastest)
min_idx = np.argmin(losses)
suggested_lr = lrs[min_idx] / 10
plt.axvline(x=suggested_lr, color='r', linestyle='--', label=f'Suggested LR: {suggested_lr:.2e}')
plt.legend()
plt.savefig('lr_finder.png', dpi=100)
plt.show()

print(f"Suggested learning rate: {suggested_lr:.2e}")

Systematically try all combinations of hyperparameters:

PYTHON
import itertools
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3,
                            n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter grid
param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
    'learning_rate_init': [0.001, 0.01],
    'alpha': [0.0001, 0.001, 0.01],  # L2 regularization
}

# Calculate total combinations
total_combinations = np.prod([len(v) for v in param_grid.values()])
print(f"Total hyperparameter combinations: {total_combinations}")

# Grid search with cross-validation
model = MLPClassifier(max_iter=500, random_state=42)
grid_search = GridSearchCV(
    model,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test, y_test):.4f}")

# All results
import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)
print("\nTop 5 configurations:")
print(results.nsmallest(5, 'rank_test_score')[['params', 'mean_test_score', 'std_test_score']])

Often more efficient than grid search for high-dimensional hyperparameter spaces:

PYTHON
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from scipy.stats import uniform, randint, loguniform

# Define distributions for hyperparameters
param_distributions = {
    'hidden_layer_sizes': [(50,), (100,), (200,), (50, 50), (100, 50), (100, 100)],
    'learning_rate_init': loguniform(1e-4, 1e-1),  # Log-uniform distribution
    'alpha': loguniform(1e-5, 1e-1),
    'batch_size': [16, 32, 64, 128],
}

# Random search
model = MLPClassifier(max_iter=500, random_state=42)
random_search = RandomizedSearchCV(
    model,
    param_distributions,
    n_iter=20,  # Number of random combinations to try
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
print(f"Test score: {random_search.score(X_test, y_test):.4f}")

# Why random search works well:
print("\n" + "="*50)
print("Why Random Search Often Beats Grid Search:")
print("="*50)
print("""
1. Not all hyperparameters are equally important
2. Grid search wastes time on unimportant dimensions
3. Random search explores more unique values of important parameters
4. Same compute budget often gives better results
""")

Bayesian Optimization with Optuna

Optuna uses Bayesian optimization to intelligently explore the hyperparameter space:

PYTHON
import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

def create_model(trial):
    """Create model with hyperparameters suggested by Optuna"""
    n_layers = trial.suggest_int('n_layers', 1, 3)
    layers = []

    in_features = 20
    for i in range(n_layers):
        out_features = trial.suggest_int(f'n_units_l{i}', 32, 256)
        layers.append(nn.Linear(in_features, out_features))
        layers.append(nn.ReLU())
        dropout = trial.suggest_float(f'dropout_l{i}', 0.1, 0.5)
        layers.append(nn.Dropout(dropout))
        in_features = out_features

    layers.append(nn.Linear(in_features, 3))
    return nn.Sequential(*layers)

def objective(trial):
    """Objective function for Optuna"""
    # Suggest hyperparameters
    lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    batch_size = trial.suggest_categorical('batch_size', [16, 32, 64, 128])
    optimizer_name = trial.suggest_categorical('optimizer', ['Adam', 'SGD', 'RMSprop'])

    # Create model
    model = create_model(trial)

    # Create optimizer
    if optimizer_name == 'Adam':
        optimizer = optim.Adam(model.parameters(), lr=lr)
    elif optimizer_name == 'SGD':
        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    else:
        optimizer = optim.RMSprop(model.parameters(), lr=lr)

    criterion = nn.CrossEntropyLoss()

    # Create data loaders
    X_tensor = torch.FloatTensor(X_train)
    y_tensor = torch.LongTensor(y_train)
    train_loader = DataLoader(TensorDataset(X_tensor, y_tensor),
                               batch_size=batch_size, shuffle=True)

    # Training loop
    model.train()
    for epoch in range(20):
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            output = model(batch_X)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()

        # Pruning: stop unpromising trials early
        model.eval()
        with torch.no_grad():
            val_output = model(torch.FloatTensor(X_test))
            val_acc = (val_output.argmax(dim=1) == torch.LongTensor(y_test)).float().mean()

        trial.report(val_acc, epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return val_acc.item()

# Run optimization
study = optuna.create_study(direction='maximize',
                             pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"\nBest trial:")
print(f"  Value (accuracy): {study.best_trial.value:.4f}")
print(f"  Params: {study.best_trial.params}")

# Visualizations
fig1 = plot_optimization_history(study)
fig1.write_image('optuna_history.png')

fig2 = plot_param_importances(study)
fig2.write_image('optuna_importance.png')

Learning Rate Schedulers

Dynamically adjust learning rate during training:

PYTHON
import torch
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

def visualize_scheduler(scheduler_class, scheduler_kwargs, n_epochs=100):
    """Visualize learning rate schedule"""
    model = torch.nn.Linear(10, 1)
    optimizer = optim.SGD(model.parameters(), lr=0.1)
    scheduler = scheduler_class(optimizer, **scheduler_kwargs)

    lrs = []
    for epoch in range(n_epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        # Simulate training step
        optimizer.step()
        scheduler.step()

    return lrs

# Different schedulers
schedulers = {
    'StepLR (step=30, gamma=0.1)': (
        optim.lr_scheduler.StepLR,
        {'step_size': 30, 'gamma': 0.1}
    ),
    'ExponentialLR (gamma=0.95)': (
        optim.lr_scheduler.ExponentialLR,
        {'gamma': 0.95}
    ),
    'CosineAnnealingLR (T_max=100)': (
        optim.lr_scheduler.CosineAnnealingLR,
        {'T_max': 100}
    ),
    'OneCycleLR': (
        optim.lr_scheduler.OneCycleLR,
        {'max_lr': 0.1, 'total_steps': 100}
    )
}

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, (name, (sched_class, kwargs)) in zip(axes.flatten(), schedulers.items()):
    lrs = visualize_scheduler(sched_class, kwargs)
    ax.plot(lrs)
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Learning Rate')
    ax.set_title(name)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('lr_schedulers.png', dpi=100)
plt.show()

# ReduceLROnPlateau (requires validation loss)
print("\nReduceLROnPlateau Example:")
print("""
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',      # Minimize validation loss
    factor=0.1,      # Reduce LR by 10x
    patience=10,     # Wait 10 epochs without improvement
    verbose=True
)

# In training loop:
val_loss = validate(model, val_loader)
scheduler.step(val_loss)
""")

Hyperparameter Tuning Best Practices

PYTHON
# Complete hyperparameter tuning workflow
hyperparameter_guide = """
=== Hyperparameter Tuning Best Practices ===

1. START WITH DEFAULTS
   - Use proven defaults (e.g., Adam with lr=0.001)
   - Get a baseline before tuning

2. PRIORITIZE HYPERPARAMETERS
   High impact:
   - Learning rate (most important!)
   - Batch size
   - Network architecture (depth, width)

   Medium impact:
   - Regularization (dropout, weight decay)
   - Optimizer choice

   Lower impact:
   - Activation functions
   - Weight initialization

3. SEARCH STRATEGY
   - Start with coarse random search
   - Refine with finer search in promising regions
   - Use Bayesian optimization for expensive evaluations

4. VALIDATION STRATEGY
   - Always use held-out validation set
   - Use cross-validation for small datasets
   - Never tune on test set!

5. COMMON RANGES TO TRY

   Learning rate: [1e-5, 1e-1] (log scale)
   Batch size: [16, 32, 64, 128, 256]
   Hidden units: [32, 64, 128, 256, 512]
   Dropout: [0.1, 0.2, 0.3, 0.5]
   Weight decay: [1e-5, 1e-4, 1e-3, 1e-2]
   Number of layers: [1, 2, 3, 4, 5]

6. EFFICIENCY TIPS
   - Train on subset first for quick iterations
   - Use early stopping to save time
   - Parallelize trials when possible
   - Use learning rate finder before tuning
"""

print(hyperparameter_guide)

# Quick hyperparameter tuning template
import torch
import torch.nn as nn
import torch.optim as optim

class QuickTuner:
    """Simple hyperparameter tuning framework"""

    def __init__(self, train_loader, val_loader, n_classes):
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.n_classes = n_classes
        self.results = []

    def create_model(self, config):
        layers = []
        in_features = next(iter(self.train_loader))[0].shape[1]

        for i in range(config['n_layers']):
            out_features = config['hidden_size']
            layers.extend([
                nn.Linear(in_features, out_features),
                nn.BatchNorm1d(out_features),
                nn.ReLU(),
                nn.Dropout(config['dropout'])
            ])
            in_features = out_features

        layers.append(nn.Linear(in_features, self.n_classes))
        return nn.Sequential(*layers)

    def train_and_evaluate(self, config, n_epochs=10):
        model = self.create_model(config)
        optimizer = optim.Adam(
            model.parameters(),
            lr=config['lr'],
            weight_decay=config['weight_decay']
        )
        criterion = nn.CrossEntropyLoss()

        for epoch in range(n_epochs):
            model.train()
            for X, y in self.train_loader:
                optimizer.zero_grad()
                loss = criterion(model(X), y)
                loss.backward()
                optimizer.step()

        # Evaluate
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for X, y in self.val_loader:
                pred = model(X).argmax(dim=1)
                correct += (pred == y).sum().item()
                total += len(y)

        return correct / total

    def random_search(self, n_trials=20):
        import random

        for i in range(n_trials):
            config = {
                'lr': 10 ** random.uniform(-4, -1),
                'hidden_size': random.choice([64, 128, 256]),
                'n_layers': random.randint(1, 3),
                'dropout': random.uniform(0.1, 0.5),
                'weight_decay': 10 ** random.uniform(-5, -2)
            }

            acc = self.train_and_evaluate(config)
            self.results.append((config, acc))
            print(f"Trial {i+1}/{n_trials}: acc={acc:.4f}")

        # Best result
        best_config, best_acc = max(self.results, key=lambda x: x[1])
        print(f"\nBest accuracy: {best_acc:.4f}")
        print(f"Best config: {best_config}")
        return best_config

Summary

| Method | Pros | Cons | Best For | |--------|------|------|----------| | Grid Search | Exhaustive, reproducible | Exponential cost | Few hyperparameters | | Random Search | Efficient, parallel | May miss optima | Many hyperparameters | | Bayesian (Optuna) | Smart exploration | More complex | Expensive evaluations | | Population-based | Adaptive | Resource intensive | Large-scale training |

Key hyperparameters by priority:

  1. Learning rate - Start with LR finder
  2. Architecture - Depth, width, activation
  3. Regularization - Dropout, weight decay
  4. Optimization - Batch size, optimizer, scheduler

Remember: Good hyperparameters often transfer between similar problems!

10.1 Loss Functions Intermediate

Loss Functions

Loss functions measure how wrong a neural network's predictions are. They quantify the gap between what the network outputs and what it should output, providing the signal that drives learning. Without a loss function, there's no way to know if one set of weights is better than another.

The choice of loss function shapes what the network learns. Different tasks require different losses: classification needs losses that penalize incorrect class assignments, regression needs losses that measure distance from target values, and generative models need losses that capture distributional differences. Understanding loss functions helps you choose the right one and debug when training goes wrong.

The Role of Loss in Learning

During training, the network makes predictions, the loss function scores those predictions, and backpropagation uses the loss to compute gradients. The optimizer then adjusts weights to reduce the loss. This cycle repeats until the loss is acceptably low.

A good loss function has several properties. It should be differentiable (almost everywhere) so gradients can flow. It should be minimal when predictions are correct and increase as predictions worsen. It should provide useful gradients—not too large, not too small, and pointing in the right direction.

PYTHON
import numpy as np

# Loss function intuition
# The loss measures "how wrong" we are

def squared_error(prediction, target):
    """Simple squared error loss."""
    return (prediction - target) ** 2

# As prediction approaches target, loss approaches 0
target = 5.0
predictions = [0, 2, 4, 4.5, 4.9, 5.0, 5.1, 6, 8]

print("Squared Error Loss")
print("=" * 40)
print(f"Target: {target}")
print(f"{'Prediction':<12} {'Loss':<12} {'Gradient':<12}")
print("-" * 36)

for pred in predictions:
    loss = squared_error(pred, target)
    gradient = 2 * (pred - target)  # d/d(pred) of (pred - target)^2
    print(f"{pred:<12} {loss:<12.4f} {gradient:<12.4f}")

print("\nNotice: loss is 0 when prediction = target")
print("Gradient points toward target (negative when pred < target)")

Mean Squared Error (MSE)

Mean Squared Error is the standard loss for regression. It averages the squared differences between predictions and targets:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Squaring serves two purposes: it makes all errors positive, and it penalizes large errors more than small ones. A prediction that's off by 10 contributes 100 to the loss, while two predictions each off by 5 contribute only 50 total.

The gradient of MSE is proportional to the error: $\frac{\partial}{\partial \hat{y}} = \frac{2}{n}(\hat{y} - y)$. Large errors produce large gradients, pushing the network strongly toward correction. As predictions improve, gradients shrink, allowing fine-tuning.

PYTHON
import numpy as np

def mse_loss(predictions, targets):
    """Mean Squared Error loss."""
    return np.mean((predictions - targets) ** 2)

def mse_gradient(predictions, targets):
    """Gradient of MSE w.r.t. predictions."""
    return 2 * (predictions - targets) / len(targets)

# Example: predicting house prices (in $100k)
targets = np.array([3.0, 4.5, 2.8, 5.2, 3.9])  # True prices
predictions = np.array([2.8, 4.2, 3.0, 4.8, 4.1])  # Network predictions

loss = mse_loss(predictions, targets)
gradients = mse_gradient(predictions, targets)

print("MSE for Regression")
print("=" * 50)
print(f"Targets:     {targets}")
print(f"Predictions: {predictions}")
print(f"Errors:      {predictions - targets}")
print(f"\nMSE Loss: {loss:.4f}")
print(f"Gradients: {gradients.round(4)}")
print("\nGradients push predictions toward targets")

Mean Absolute Error (MAE)

Mean Absolute Error uses absolute differences instead of squared differences:

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

MAE is more robust to outliers than MSE. An outlier with error 100 contributes 100 to MAE but 10,000 to MSE. If your data has outliers you can't remove, MAE might be better.

The downside: MAE's gradient is constant (±1/n), not proportional to error size. The network gets the same push whether it's far from the target or close. This can make final convergence slower.

PYTHON
import numpy as np

def mae_loss(predictions, targets):
    """Mean Absolute Error loss."""
    return np.mean(np.abs(predictions - targets))

def mse_loss(predictions, targets):
    """Mean Squared Error loss."""
    return np.mean((predictions - targets) ** 2)

# Compare MAE vs MSE with outliers
targets = np.array([1.0, 2.0, 3.0, 4.0, 100.0])  # One outlier!
predictions = np.array([1.1, 2.2, 2.8, 3.9, 5.0])  # Reasonable predictions

mae = mae_loss(predictions, targets)
mse = mse_loss(predictions, targets)

print("MAE vs MSE with Outliers")
print("=" * 50)
print(f"Targets:     {targets}")
print(f"Predictions: {predictions}")
print(f"\nErrors: {predictions - targets}")
print(f"\nMAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print("\nThe outlier (target=100, pred=5) dominates MSE!")
print("MSE = 1805, but without outlier it would be ~0.03")
print("MAE is much less affected by the outlier")

Cross-Entropy Loss

Cross-entropy is the standard loss for classification. For binary classification:

$$\text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

For multi-class classification with one-hot targets:

$$\text{CE} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})$$

Cross-entropy measures how different two probability distributions are. When the predicted distribution matches the true distribution, cross-entropy is minimized. It heavily penalizes confident wrong predictions: predicting 0.01 probability for the true class costs much more than predicting 0.4.

PYTHON
import numpy as np

def binary_cross_entropy(predictions, targets, epsilon=1e-15):
    """Binary cross-entropy loss."""
    # Clip predictions to avoid log(0)
    predictions = np.clip(predictions, epsilon, 1 - epsilon)
    return -np.mean(
        targets * np.log(predictions) +
        (1 - targets) * np.log(1 - predictions)
    )

# Binary classification example
targets = np.array([1, 1, 0, 0, 1])  # True labels
predictions = np.array([0.9, 0.8, 0.2, 0.1, 0.7])  # Predicted probabilities

loss = binary_cross_entropy(predictions, targets)

print("Binary Cross-Entropy")
print("=" * 50)
print(f"Targets:     {targets}")
print(f"Predictions: {predictions}")
print(f"\nBCE Loss: {loss:.4f}")

# Show how wrong confidence hurts
print("\nEffect of confident wrong predictions:")
wrong_confident = np.array([0.9, 0.8, 0.2, 0.1, 0.1])  # Last one wrong & confident
wrong_uncertain = np.array([0.9, 0.8, 0.2, 0.1, 0.4])  # Last one wrong but uncertain

loss_confident = binary_cross_entropy(wrong_confident, targets)
loss_uncertain = binary_cross_entropy(wrong_uncertain, targets)
print(f"  Confident wrong (0.1 for true class 1): loss = {loss_confident:.4f}")
print(f"  Uncertain wrong (0.4 for true class 1): loss = {loss_uncertain:.4f}")
print("  Confident mistakes are penalized more!")

Softmax and Cross-Entropy Together

In practice, classification networks output raw scores (logits), and we apply softmax to get probabilities. Combining softmax with cross-entropy gives numerical stability and a clean gradient.

The combined gradient is beautifully simple: $\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i$. The gradient for each class is just the difference between predicted probability and target (1 for true class, 0 otherwise).

PYTHON
import numpy as np

def softmax(logits):
    """Stable softmax."""
    exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)

def cross_entropy_loss(logits, targets):
    """Cross-entropy loss from logits."""
    probs = softmax(logits)
    n_samples = len(targets)
    # targets are class indices
    log_probs = np.log(probs[np.arange(n_samples), targets] + 1e-15)
    return -np.mean(log_probs)

def cross_entropy_gradient(logits, targets):
    """Gradient of CE loss w.r.t. logits."""
    probs = softmax(logits)
    n_samples = len(targets)
    grad = probs.copy()
    grad[np.arange(n_samples), targets] -= 1
    return grad / n_samples

# Multi-class example: 3 classes
logits = np.array([
    [2.0, 1.0, 0.1],   # Sample 1: class 0 has highest logit
    [0.5, 2.5, 0.3],   # Sample 2: class 1 has highest logit
    [0.1, 0.2, 3.0],   # Sample 3: class 2 has highest logit
])
targets = np.array([0, 1, 2])  # True classes (all predictions correct)

probs = softmax(logits)
loss = cross_entropy_loss(logits, targets)
grad = cross_entropy_gradient(logits, targets)

print("Softmax + Cross-Entropy")
print("=" * 50)
print("Logits:")
print(logits)
print("\nProbabilities (after softmax):")
print(probs.round(3))
print(f"\nTargets: {targets}")
print(f"Cross-Entropy Loss: {loss:.4f}")
print("\nGradient (pred - target):")
print(grad.round(4))

Focal Loss for Imbalanced Data

Standard cross-entropy treats all samples equally, which is problematic when classes are imbalanced. If 99% of samples are class A, the network can achieve low loss by always predicting A.

Focal loss down-weights easy examples, focusing training on hard ones:

$$\text{FL} = -\alpha (1 - \hat{y})^\gamma \log(\hat{y})$$

The $(1 - \hat{y})^\gamma$ factor reduces loss for confident correct predictions. With $\gamma = 2$, a prediction of 0.9 for the true class contributes 100x less than a prediction of 0.1.

PYTHON
import numpy as np

def focal_loss(predictions, targets, gamma=2.0, alpha=0.25):
    """
    Focal loss for handling class imbalance.
    gamma: focusing parameter (higher = more focus on hard examples)
    alpha: class weight
    """
    epsilon = 1e-15
    predictions = np.clip(predictions, epsilon, 1 - epsilon)

    # For positive class (target = 1)
    pos_loss = -alpha * ((1 - predictions) ** gamma) * np.log(predictions)
    # For negative class (target = 0)
    neg_loss = -(1 - alpha) * (predictions ** gamma) * np.log(1 - predictions)

    loss = targets * pos_loss + (1 - targets) * neg_loss
    return np.mean(loss)

def bce_loss(predictions, targets):
    """Standard binary cross-entropy."""
    epsilon = 1e-15
    predictions = np.clip(predictions, epsilon, 1 - epsilon)
    return -np.mean(targets * np.log(predictions) + (1 - targets) * np.log(1 - predictions))

# Compare BCE vs Focal Loss
# Easy example: high confidence correct prediction
easy_pred = 0.95
easy_target = 1

# Hard example: low confidence for true class
hard_pred = 0.3
hard_target = 1

print("Focal Loss vs BCE")
print("=" * 50)

for gamma in [0, 1, 2, 5]:
    easy_fl = focal_loss(np.array([easy_pred]), np.array([easy_target]), gamma=gamma, alpha=1.0)
    hard_fl = focal_loss(np.array([hard_pred]), np.array([hard_target]), gamma=gamma, alpha=1.0)
    ratio = hard_fl / easy_fl if easy_fl > 0 else float('inf')
    print(f"gamma={gamma}: easy_loss={easy_fl:.4f}, hard_loss={hard_fl:.4f}, ratio={ratio:.1f}x")

print("\nHigher gamma focuses more on hard examples")
print("BCE (gamma=0) treats easy and hard equally relative to confidence")

Contrastive and Triplet Loss

For learning embeddings (like face recognition or sentence similarity), we use losses that compare samples rather than predict labels.

Contrastive loss pulls similar pairs together and pushes dissimilar pairs apart:

$$L = y \cdot d^2 + (1-y) \cdot \max(0, m - d)^2$$

Triplet loss uses anchor, positive, and negative samples:

$$L = \max(0, d(a, p) - d(a, n) + m)$$

The network learns to embed similar items close together and dissimilar items far apart in the embedding space.

PYTHON
import numpy as np

def euclidean_distance(a, b):
    """Euclidean distance between embeddings."""
    return np.sqrt(np.sum((a - b) ** 2))

def triplet_loss(anchor, positive, negative, margin=1.0):
    """
    Triplet loss for learning embeddings.
    anchor: embedding of reference sample
    positive: embedding of similar sample
    negative: embedding of dissimilar sample
    """
    pos_dist = euclidean_distance(anchor, positive)
    neg_dist = euclidean_distance(anchor, negative)
    loss = max(0, pos_dist - neg_dist + margin)
    return loss

# Example: face embeddings
np.random.seed(42)

# Anchor: person A's face
anchor = np.array([0.5, 0.3, 0.8, 0.2])

# Positive: another photo of person A (should be similar)
positive = np.array([0.52, 0.28, 0.79, 0.22])

# Negative: person B's face (should be different)
negative = np.array([0.1, 0.9, 0.2, 0.7])

pos_dist = euclidean_distance(anchor, positive)
neg_dist = euclidean_distance(anchor, negative)
loss = triplet_loss(anchor, positive, negative, margin=0.5)

print("Triplet Loss for Embeddings")
print("=" * 50)
print(f"Anchor embedding:   {anchor}")
print(f"Positive embedding: {positive}")
print(f"Negative embedding: {negative}")
print(f"\nDistance to positive: {pos_dist:.4f}")
print(f"Distance to negative: {neg_dist:.4f}")
print(f"Triplet loss (margin=0.5): {loss:.4f}")

if loss == 0:
    print("\nLoss is 0: negative is already far enough from anchor")
else:
    print("\nLoss > 0: network needs to push negative further away")

Choosing the Right Loss Function

Different tasks call for different losses:

| Task | Recommended Loss | |------|-----------------| | Regression | MSE (standard) or MAE (robust to outliers) | | Binary Classification | Binary Cross-Entropy | | Multi-class Classification | Categorical Cross-Entropy | | Imbalanced Classification | Focal Loss or Weighted Cross-Entropy | | Embedding Learning | Triplet Loss or Contrastive Loss | | Sequence Generation | Cross-Entropy (per token) |

PYTHON
import numpy as np

print("Loss Function Selection Guide")
print("=" * 50)

print("\n1. REGRESSION TASKS")
print("   Use MSE for most cases")
print("   Use MAE if outliers are present")
print("   Use Huber loss for robustness with smooth gradients")

print("\n2. CLASSIFICATION TASKS")
print("   Binary: Binary Cross-Entropy (BCE)")
print("   Multi-class: Cross-Entropy with Softmax")
print("   Multi-label: BCE per label")

print("\n3. IMBALANCED DATA")
print("   Use Focal Loss (gamma=2 is common)")
print("   Or use class weights in standard CE")

print("\n4. EMBEDDING LEARNING")
print("   Use Triplet Loss for similarity learning")
print("   Use Contrastive Loss for pairwise similarity")

print("\n5. CUSTOM OBJECTIVES")
print("   Can combine losses: L = L1 + λ*L2")
print("   Example: reconstruction + regularization")

Key Takeaways

  • Loss functions quantify prediction errors and drive learning through gradients
  • MSE is standard for regression; it penalizes large errors heavily
  • MAE is robust to outliers but has constant gradients
  • Cross-entropy is standard for classification; it penalizes confident wrong predictions
  • Softmax + Cross-entropy together give clean gradients: (predicted - target)
  • Focal loss handles class imbalance by down-weighting easy examples
  • Triplet loss learns embeddings by comparing anchor, positive, and negative samples
  • Choose loss based on task: regression vs classification, balanced vs imbalanced, prediction vs embedding

10.2 Optimizers Intermediate

Optimizers

Optimizers are the algorithms that actually update neural network weights during training. Given gradients from backpropagation, the optimizer decides how to adjust each weight to reduce the loss. This seemingly simple task—"move weights in the direction that decreases loss"—has spawned decades of research and dozens of algorithms.

The choice of optimizer affects training speed, final model quality, and hyperparameter sensitivity. Understanding how optimizers work helps you choose the right one and debug training problems.

Gradient Descent Fundamentals

The foundation of all neural network optimizers is gradient descent. The gradient of the loss with respect to a weight tells us the direction of steepest increase. We move in the opposite direction to decrease the loss:

$$w \leftarrow w - \eta \nabla_w L$$

The learning rate $\eta$ controls step size. Too large and training diverges; too small and it takes forever. Finding the right learning rate is one of deep learning's persistent challenges.

PYTHON
import numpy as np

# Simple gradient descent on a 1D function
def f(x):
    """Function to minimize: f(x) = x^2 - 4x + 5"""
    return x**2 - 4*x + 5

def gradient(x):
    """Gradient of f: f'(x) = 2x - 4"""
    return 2*x - 4

# Gradient descent
x = 10.0  # Start far from minimum
learning_rate = 0.1

print("Gradient Descent")
print("=" * 50)
print(f"Minimizing f(x) = x² - 4x + 5")
print(f"Minimum at x = 2 (where f'(x) = 0)")
print(f"\n{'Step':<6} {'x':<12} {'f(x)':<12} {'gradient':<12}")
print("-" * 42)

for step in range(15):
    fx = f(x)
    grad = gradient(x)
    print(f"{step:<6} {x:<12.4f} {fx:<12.4f} {grad:<12.4f}")
    x = x - learning_rate * grad  # Gradient descent step

print(f"\nFinal x: {x:.4f} (should be close to 2)")

Stochastic Gradient Descent (SGD)

Pure gradient descent computes gradients over the entire dataset—expensive for large datasets. Stochastic Gradient Descent (SGD) estimates gradients from mini-batches, updating weights after each batch.

SGD introduces noise into gradient estimates, which has two effects. First, it enables training on datasets too large to fit in memory. Second, the noise can help escape shallow local minima, often leading to better generalization.

PYTHON
import numpy as np

def sgd_step(weights, gradients, learning_rate):
    """Simple SGD update."""
    return weights - learning_rate * gradients

# Simulate SGD with noisy gradients
np.random.seed(42)

true_gradient = np.array([0.5, -0.3, 0.2])  # True gradient direction
weights = np.array([1.0, 1.0, 1.0])
learning_rate = 0.1

print("SGD with Noisy Gradients")
print("=" * 50)
print("True gradient direction:", true_gradient)
print("\nSGD estimates gradient from mini-batches, adding noise:")

for step in range(5):
    # Simulate noisy gradient estimate from mini-batch
    noise = np.random.randn(3) * 0.1
    estimated_gradient = true_gradient + noise

    weights = sgd_step(weights, estimated_gradient, learning_rate)
    print(f"Step {step + 1}: weights = {weights.round(4)}")

print("\nDespite noise, weights move in roughly the right direction")
print("Noise can actually help escape poor local minima")

Learning Rate Effects

The learning rate is perhaps the most important hyperparameter. It determines how big each update step is.

With a too-large learning rate, updates overshoot the minimum, and training may diverge—loss increases instead of decreasing. With a too-small learning rate, training converges but takes many iterations. The loss decreases slowly, and you might stop before reaching a good solution.

PYTHON
import numpy as np

def f(x):
    return x**2

def gradient(x):
    return 2*x

print("Effect of Learning Rate")
print("=" * 50)

for lr in [0.01, 0.1, 0.5, 1.0, 1.1]:
    x = 10.0
    print(f"\nLearning rate = {lr}:")

    history = []
    for step in range(20):
        history.append(x)
        x = x - lr * gradient(x)

        # Check for divergence
        if abs(x) > 1e6:
            print(f"  DIVERGED after {step} steps!")
            break

    if abs(x) <= 1e6:
        print(f"  Final x = {x:.4f}, loss = {f(x):.4f}")
        if abs(x) < 0.01:
            print(f"  Converged in ~{len([h for h in history if abs(h) > 0.01])} steps")

Momentum

SGD can oscillate in ravines—narrow valleys where the surface curves much more steeply in one direction than another. Momentum helps by accumulating a velocity vector that smooths out oscillations.

$$v_t = \beta v_{t-1} + \nabla_w L$$
$$w \leftarrow w - \eta v_t$$

Momentum is like a ball rolling downhill: it builds up speed in consistent directions and dampens oscillations. A typical momentum value is 0.9, meaning the update is 90% from previous velocity and 10% from current gradient.

PYTHON
import numpy as np

def sgd_momentum_step(weights, gradients, velocity, lr, momentum):
    """SGD with momentum update."""
    velocity = momentum * velocity + gradients
    weights = weights - lr * velocity
    return weights, velocity

# Compare SGD vs SGD with momentum on oscillating gradients
np.random.seed(42)

# Simulate gradients in a ravine (large in one direction, small in another)
weights_sgd = np.array([10.0, 10.0])
weights_mom = np.array([10.0, 10.0])
velocity = np.array([0.0, 0.0])

learning_rate = 0.1
momentum = 0.9

print("SGD vs SGD with Momentum")
print("=" * 50)
print("In a ravine, gradients oscillate perpendicular to the minimum")

for step in range(10):
    # Gradient with oscillation (simulating a ravine)
    grad = np.array([weights_sgd[0] * 0.1, weights_sgd[1] * 2.0])  # Steep in dim 1

    # Plain SGD
    weights_sgd = weights_sgd - learning_rate * grad

    # SGD with momentum
    grad_mom = np.array([weights_mom[0] * 0.1, weights_mom[1] * 2.0])
    weights_mom, velocity = sgd_momentum_step(weights_mom, grad_mom, velocity, learning_rate, momentum)

    if step % 2 == 0:
        print(f"Step {step}: SGD = {weights_sgd.round(3)}, Momentum = {weights_mom.round(3)}")

print("\nMomentum dampens oscillations and accelerates convergence")

Nesterov Accelerated Gradient

Nesterov momentum looks ahead before computing the gradient. Instead of computing the gradient at the current position, it computes it at the position we would reach if we continued with our current velocity.

$$v_t = \beta v_{t-1} + \nabla_w L(w - \beta v_{t-1})$$
$$w \leftarrow w - \eta v_t$$

This "lookahead" provides a correction—if we're about to overshoot, the gradient at the lookahead position tells us to slow down.

PYTHON
import numpy as np

def nesterov_step(weights, velocity, lr, momentum, grad_fn):
    """Nesterov accelerated gradient step."""
    # Look ahead
    lookahead = weights - momentum * velocity

    # Compute gradient at lookahead position
    gradient = grad_fn(lookahead)

    # Update velocity and weights
    velocity = momentum * velocity + lr * gradient
    weights = weights - velocity

    return weights, velocity

# Simple quadratic to minimize
def loss_gradient(w):
    return 2 * w  # Gradient of w^2

np.random.seed(42)
w_momentum = 10.0
w_nesterov = 10.0
v_momentum = 0.0
v_nesterov = 0.0

lr = 0.1
momentum = 0.9

print("Nesterov vs Standard Momentum")
print("=" * 50)

for step in range(10):
    # Standard momentum
    grad = loss_gradient(w_momentum)
    v_momentum = momentum * v_momentum + grad
    w_momentum = w_momentum - lr * v_momentum

    # Nesterov momentum
    w_nesterov, v_nesterov = nesterov_step(w_nesterov, v_nesterov, lr, momentum, loss_gradient)

    print(f"Step {step}: Momentum w={w_momentum:.4f}, Nesterov w={w_nesterov:.4f}")

print("\nNesterov often converges faster by anticipating the gradient")

AdaGrad: Adaptive Learning Rates

AdaGrad adapts the learning rate for each parameter based on historical gradients. Parameters with large past gradients get smaller learning rates; parameters with small past gradients get larger ones.

$$G_t = G_{t-1} + g_t^2$$
$$w \leftarrow w - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$

This is useful when features have different frequencies. Rare features get larger updates; common features get smaller ones. However, AdaGrad's learning rate can decrease too aggressively, causing training to stop prematurely.

PYTHON
import numpy as np

def adagrad_step(weights, gradients, accumulated_sq_grad, lr, epsilon=1e-8):
    """AdaGrad update."""
    accumulated_sq_grad = accumulated_sq_grad + gradients**2
    adjusted_lr = lr / (np.sqrt(accumulated_sq_grad) + epsilon)
    weights = weights - adjusted_lr * gradients
    return weights, accumulated_sq_grad

np.random.seed(42)

# Simulate parameters with different gradient scales
weights = np.array([5.0, 5.0])
accumulated = np.zeros(2)
lr = 1.0

print("AdaGrad: Adaptive Learning Rates")
print("=" * 50)
print("Parameter 1 has small gradients, Parameter 2 has large gradients\n")

for step in range(10):
    # Different gradient scales for each parameter
    gradients = np.array([0.1, 2.0])  # Much larger gradient for param 2

    weights, accumulated = adagrad_step(weights, gradients, accumulated, lr)

    effective_lr = lr / (np.sqrt(accumulated) + 1e-8)
    print(f"Step {step}: weights = {weights.round(4)}, effective_lr = {effective_lr.round(4)}")

print("\nParam 2's learning rate decreases faster due to larger gradients")
print("This helps balance updates across parameters")

RMSprop: Fixing AdaGrad's Decay

RMSprop fixes AdaGrad's aggressive learning rate decay by using an exponential moving average of squared gradients instead of the sum:

$$E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2$$
$$w \leftarrow w - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t$$

The decay rate $\beta$ (typically 0.9 or 0.99) controls how much history to consider. This prevents the learning rate from shrinking to zero, allowing continued learning.

PYTHON
import numpy as np

def rmsprop_step(weights, gradients, sq_grad_avg, lr, beta=0.9, epsilon=1e-8):
    """RMSprop update."""
    sq_grad_avg = beta * sq_grad_avg + (1 - beta) * gradients**2
    weights = weights - lr / (np.sqrt(sq_grad_avg) + epsilon) * gradients
    return weights, sq_grad_avg

np.random.seed(42)

weights = np.array([5.0])
sq_avg = np.array([0.0])
lr = 0.1

print("RMSprop: Exponential Moving Average")
print("=" * 50)

# Simulate varying gradient magnitudes
for step in range(15):
    # Gradient varies in magnitude
    gradient = np.array([np.sin(step) * 2 + 0.5])

    weights, sq_avg = rmsprop_step(weights, gradient, sq_avg, lr)

    effective_lr = lr / (np.sqrt(sq_avg) + 1e-8)
    print(f"Step {step:2d}: grad = {gradient[0]:6.3f}, eff_lr = {effective_lr[0]:.4f}, w = {weights[0]:.4f}")

print("\nEffective learning rate adapts but doesn't decay to zero")

Adam: Combining Momentum and Adaptive Rates

Adam (Adaptive Moment Estimation) combines momentum with adaptive learning rates. It maintains both a moving average of gradients (first moment, like momentum) and a moving average of squared gradients (second moment, like RMSprop):

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$w \leftarrow w - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

The bias correction (dividing by $1-\beta^t$) counteracts the zero initialization of moment estimates, especially important in early training.

PYTHON
import numpy as np

class AdamOptimizer:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment
        self.v = None  # Second moment
        self.t = 0     # Time step

    def step(self, weights, gradients):
        if self.m is None:
            self.m = np.zeros_like(weights)
            self.v = np.zeros_like(weights)

        self.t += 1

        # Update moments
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradients**2

        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)

        # Update weights
        weights = weights - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

        return weights

# Compare SGD vs Adam
np.random.seed(42)

weights_sgd = np.array([5.0, 5.0])
weights_adam = np.array([5.0, 5.0])
adam = AdamOptimizer(lr=0.5)

print("Adam vs SGD")
print("=" * 50)

for step in range(10):
    # Gradient with different scales
    grad = np.array([0.5, 2.0])

    # SGD update
    weights_sgd = weights_sgd - 0.1 * grad

    # Adam update
    weights_adam = adam.step(weights_adam, grad)

    print(f"Step {step}: SGD = {weights_sgd.round(3)}, Adam = {weights_adam.round(3)}")

print("\nAdam adapts to gradient scales automatically")

AdamW: Weight Decay Done Right

Standard Adam implements L2 regularization by adding the weight to the gradient before computing moment estimates. AdamW instead applies weight decay directly to the weights after the Adam update:

$$w \leftarrow w - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w \right)$$

This decouples weight decay from the adaptive learning rate, leading to better regularization. AdamW has become the default optimizer for many applications, especially language models.

PYTHON
import numpy as np

class AdamWOptimizer:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.weight_decay = weight_decay
        self.m = None
        self.v = None
        self.t = 0

    def step(self, weights, gradients):
        if self.m is None:
            self.m = np.zeros_like(weights)
            self.v = np.zeros_like(weights)

        self.t += 1

        # Update moments (without weight decay in gradient)
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradients**2

        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)

        # Update weights with decoupled weight decay
        weights = weights - self.lr * (m_hat / (np.sqrt(v_hat) + self.epsilon) + self.weight_decay * weights)

        return weights

np.random.seed(42)
adamw = AdamWOptimizer(lr=0.1, weight_decay=0.1)
weights = np.array([5.0, 3.0])

print("AdamW: Decoupled Weight Decay")
print("=" * 50)

for step in range(10):
    grad = np.array([0.5, 0.3])
    weights = adamw.step(weights, grad)
    print(f"Step {step}: weights = {weights.round(4)}")

print("\nWeight decay is applied directly, not through gradient")
print("This gives better regularization than L2 in Adam")

Learning Rate Scheduling

Even with adaptive optimizers, adjusting the learning rate during training often helps. Common schedules:

Step decay: Reduce learning rate by a factor at specific epochs. Exponential decay: $\eta_t = \eta_0 \cdot \gamma^t$ Cosine annealing: Smoothly decreases following a cosine curve. Warmup: Start with a tiny learning rate and increase it early in training.

PYTHON
import numpy as np

def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10):
    """Reduce LR by drop_rate every epochs_drop epochs."""
    return initial_lr * (drop_rate ** (epoch // epochs_drop))

def exponential_decay(initial_lr, epoch, decay_rate=0.95):
    """Exponential decay."""
    return initial_lr * (decay_rate ** epoch)

def cosine_annealing(initial_lr, epoch, total_epochs):
    """Cosine annealing to zero."""
    return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs))

def warmup_cosine(initial_lr, epoch, warmup_epochs=5, total_epochs=50):
    """Linear warmup then cosine decay."""
    if epoch < warmup_epochs:
        return initial_lr * epoch / warmup_epochs
    else:
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return initial_lr * 0.5 * (1 + np.cos(np.pi * progress))

print("Learning Rate Schedules")
print("=" * 50)

initial_lr = 0.1
epochs = [0, 5, 10, 15, 20, 30, 40, 50]

print(f"{'Epoch':<8} {'Step':<10} {'Exp':<10} {'Cosine':<10} {'Warmup':<10}")
print("-" * 48)
for e in epochs:
    s = step_decay(initial_lr, e)
    exp = exponential_decay(initial_lr, e)
    cos = cosine_annealing(initial_lr, e, 50)
    warm = warmup_cosine(initial_lr, e, 5, 50)
    print(f"{e:<8} {s:<10.4f} {exp:<10.4f} {cos:<10.4f} {warm:<10.4f}")

Choosing an Optimizer

For most applications, start with AdamW. It works well across tasks with minimal tuning. Use learning rate 1e-3 to 1e-4, weight decay 0.01 to 0.1.

SGD with momentum can sometimes find better solutions than Adam, especially for image classification. It requires more tuning but may generalize better.

Adam (without W) is fine when you're not using weight decay or for quick experiments.

PYTHON
print("Optimizer Selection Guide")
print("=" * 60)

print("\n1. DEFAULT CHOICE: AdamW")
print("   - Works well for most tasks")
print("   - lr=1e-3 to 1e-4, weight_decay=0.01 to 0.1")
print("   - Standard for language models, transformers")

print("\n2. COMPUTER VISION: SGD with Momentum")
print("   - Often better generalization than Adam")
print("   - lr=0.1, momentum=0.9, with cosine schedule")
print("   - Requires more tuning but can achieve better results")

print("\n3. QUICK EXPERIMENTS: Adam")
print("   - Fast convergence, less tuning needed")
print("   - lr=1e-3 works well as default")
print("   - May not generalize as well as SGD")

print("\n4. LARGE BATCH TRAINING: LAMB or LARS")
print("   - Designed for very large batch sizes")
print("   - Layer-wise adaptive rates")

print("\n5. TIPS:")
print("   - Always use learning rate warmup for large models")
print("   - Cosine decay often beats step decay")
print("   - If training is unstable, reduce learning rate")
print("   - If training is too slow, increase learning rate")

Key Takeaways

  • SGD computes gradients on mini-batches; noise can help generalization
  • Momentum accumulates velocity, smoothing oscillations and accelerating convergence
  • Nesterov momentum looks ahead before computing gradients, improving convergence
  • AdaGrad adapts learning rates per parameter but can decay too aggressively
  • RMSprop uses exponential moving average to prevent learning rate collapse
  • Adam combines momentum with adaptive rates; the most popular optimizer
  • AdamW decouples weight decay from adaptive learning; preferred for regularization
  • Learning rate schedules (warmup, cosine annealing) often improve final performance
  • Start with AdamW for most tasks; use SGD+momentum if you have time to tune