Enthusiast Intermediate 90 min read

Chapter 9: Neural Network Foundations

Perceptrons, MLPs, activation functions, and backpropagation.

Libraries covered: PyTorch NumPy

Learning Objectives

["Understand neural network architecture", "Implement backpropagation", "Build networks in PyTorch"]


9.1 Perceptrons and Neurons Beginner

Perceptrons and Neurons

The perceptron, invented by Frank Rosenblatt in 1958, is the simplest neural network—a single artificial neuron that makes decisions by weighing evidence. While limited on its own, the perceptron introduces concepts that underpin all modern deep learning: weighted inputs, activation functions, and learning through weight adjustment.

Understanding perceptrons provides the foundation for understanding neural networks. Every deep network, no matter how complex, is built from these simple computational units.

Biological Inspiration

Artificial neurons draw loose inspiration from biological neurons in the brain. A biological neuron receives signals through dendrites, processes them in the cell body, and transmits output through the axon to other neurons. The connection strength between neurons (synaptic strength) determines how much influence one neuron has on another.

Artificial neurons abstract this process. They receive numerical inputs, multiply each by a weight (analogous to synaptic strength), sum the results, and apply an activation function to produce output. Learning happens by adjusting the weights.

The analogy shouldn't be taken too literally—artificial neurons are mathematical functions, not accurate brain models. But the metaphor helps build intuition for how networks of simple units can produce complex behavior.

PYTHON
import numpy as np

# A biological neuron receives signals and fires if they exceed a threshold
# An artificial neuron does something similar mathematically

def simple_neuron(inputs, weights, threshold):
    """
    Simple threshold neuron.
    Fires (outputs 1) if weighted sum exceeds threshold.
    """
    weighted_sum = np.dot(inputs, weights)
    if weighted_sum >= threshold:
        return 1  # Neuron fires
    else:
        return 0  # Neuron doesn't fire

# Example: neuron with 3 inputs
inputs = np.array([0.5, 0.3, 0.8])
weights = np.array([0.4, 0.6, 0.2])
threshold = 0.5

output = simple_neuron(inputs, weights, threshold)
print(f"Inputs: {inputs}")
print(f"Weights: {weights}")
print(f"Weighted sum: {np.dot(inputs, weights):.2f}")
print(f"Threshold: {threshold}")
print(f"Output: {output}")

The Perceptron Model

The perceptron takes multiple inputs, multiplies each by a weight, sums them, adds a bias term, and passes the result through a step function:

$$y = \begin{cases} 1 & \text{if } \sum_i w_i x_i + b \geq 0 \\ 0 & \text{otherwise} \end{cases}$$

The weights ($w_i$) determine the importance of each input. The bias ($b$) shifts the decision boundary—it's like having an input that's always 1 with its own weight.

The step function makes the output binary: either the neuron fires (1) or it doesn't (0).

PYTHON
import numpy as np

class Perceptron:
    def __init__(self, n_inputs):
        # Initialize weights randomly, bias to zero
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0

    def predict(self, x):
        """Compute perceptron output."""
        linear = np.dot(x, self.weights) + self.bias
        return 1 if linear >= 0 else 0

    def predict_batch(self, X):
        """Predict for multiple samples."""
        linear = np.dot(X, self.weights) + self.bias
        return (linear >= 0).astype(int)

# Create a perceptron with 2 inputs
perceptron = Perceptron(n_inputs=2)

# Test on some inputs
test_inputs = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

print("Perceptron outputs (random weights):")
for x in test_inputs:
    y = perceptron.predict(x)
    print(f"  Input {x} -> Output {y}")

Learning: The Perceptron Algorithm

The perceptron learns by adjusting weights based on errors. When it makes a wrong prediction, it updates weights to do better next time:

If output is 0 but should be 1: Increase weights for active inputs (move toward firing)

If output is 1 but should be 0: Decrease weights for active inputs (move away from firing)

The update rule: $w_i \leftarrow w_i + \eta (y_{true} - y_{pred}) x_i$

where $\eta$ is the learning rate controlling step size.

PYTHON
import numpy as np

class Perceptron:
    def __init__(self, n_inputs, learning_rate=0.1):
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.lr = learning_rate

    def predict(self, x):
        linear = np.dot(x, self.weights) + self.bias
        return 1 if linear >= 0 else 0

    def train(self, X, y, epochs=100):
        """Train perceptron using the perceptron learning rule."""
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.predict(xi)
                error = yi - prediction

                if error != 0:
                    # Update weights and bias
                    self.weights += self.lr * error * xi
                    self.bias += self.lr * error
                    errors += 1

            if errors == 0:
                print(f"Converged at epoch {epoch + 1}")
                break

        return self

# Train on AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])  # AND function

perceptron = Perceptron(n_inputs=2)
perceptron.train(X, y, epochs=100)

print("\nLearned AND gate:")
for xi, yi in zip(X, y):
    pred = perceptron.predict(xi)
    print(f"  {xi} -> {pred} (expected {yi})")

print(f"\nLearned weights: {perceptron.weights}")
print(f"Learned bias: {perceptron.bias}")

Linear Decision Boundaries

A perceptron divides the input space with a linear boundary (a line in 2D, a plane in 3D, a hyperplane in higher dimensions). Points on one side are classified as 1, points on the other as 0.

The decision boundary is where $\sum_i w_i x_i + b = 0$. In 2D with inputs $x_1$ and $x_2$:

$$w_1 x_1 + w_2 x_2 + b = 0$$

This is the equation of a line with slope $-w_1/w_2$ and intercept $-b/w_2$.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

# Train perceptron on OR gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 1])  # OR function

class Perceptron:
    def __init__(self, n_inputs, learning_rate=0.1):
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.lr = learning_rate

    def predict(self, x):
        return 1 if np.dot(x, self.weights) + self.bias >= 0 else 0

    def train(self, X, y, epochs=100):
        for epoch in range(epochs):
            for xi, yi in zip(X, y):
                error = yi - self.predict(xi)
                self.weights += self.lr * error * xi
                self.bias += self.lr * error

perceptron = Perceptron(2)
perceptron.train(X, y)

# The decision boundary
w1, w2 = perceptron.weights
b = perceptron.bias
print(f"Decision boundary: {w1:.2f}*x1 + {w2:.2f}*x2 + {b:.2f} = 0")
print(f"This is a line separating class 0 from class 1")

The XOR Problem: Limitations of Perceptrons

A single perceptron can only solve linearly separable problems—where a straight line can separate the classes. This is a severe limitation.

The classic example is XOR (exclusive OR). XOR outputs 1 when inputs differ and 0 when they're the same. No single line can separate the 1s from the 0s in XOR. This was famously demonstrated by Minsky and Papert in 1969, contributing to the first "AI winter."

PYTHON
import numpy as np

# XOR is not linearly separable
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])  # XOR function

print("XOR truth table:")
for xi, yi in zip(X_xor, y_xor):
    print(f"  {xi} -> {yi}")

print("\nTry to visualize: plot (0,0) and (1,1) as class 0")
print("Plot (0,1) and (1,0) as class 1")
print("No straight line can separate them!")

# Try to train a perceptron on XOR
class Perceptron:
    def __init__(self, n_inputs, lr=0.1):
        self.weights = np.zeros(n_inputs)
        self.bias = 0.0
        self.lr = lr

    def predict(self, x):
        return 1 if np.dot(x, self.weights) + self.bias >= 0 else 0

    def train(self, X, y, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                error = yi - self.predict(xi)
                self.weights += self.lr * error * xi
                self.bias += self.lr * error

perceptron = Perceptron(2)
perceptron.train(X_xor, y_xor, epochs=1000)

print("\nPerceptron predictions on XOR (after 1000 epochs):")
correct = 0
for xi, yi in zip(X_xor, y_xor):
    pred = perceptron.predict(xi)
    correct += (pred == yi)
    print(f"  {xi} -> {pred} (expected {yi})")
print(f"\nAccuracy: {correct}/{len(y_xor)} - Perceptron CANNOT learn XOR!")

From Perceptrons to Neurons

The step function's hard threshold makes learning difficult—small weight changes produce no change in output until you cross the threshold. Modern neural networks replace the step function with smooth, differentiable activation functions.

A neuron with a smooth activation function can compute gradients, enabling the backpropagation algorithm to train multi-layer networks. This seemingly small change unlocks the power of deep learning.

PYTHON
import numpy as np

def step(x):
    """Step function - original perceptron."""
    return np.where(x >= 0, 1, 0)

def sigmoid(x):
    """Sigmoid - smooth version of step."""
    return 1 / (1 + np.exp(-x))

def relu(x):
    """ReLU - most common in modern networks."""
    return np.maximum(0, x)

# Compare activations
x = np.linspace(-5, 5, 100)

print("Activation function comparison:")
print("Step: Hard 0/1 threshold, not differentiable at 0")
print("Sigmoid: Smooth, outputs between 0 and 1")
print("ReLU: Simple, fast, most popular today")

# Show values at a few points
test_points = [-2, -1, 0, 1, 2]
print(f"\n{'x':<6} {'Step':<8} {'Sigmoid':<10} {'ReLU':<8}")
print("-" * 32)
for x_val in test_points:
    print(f"{x_val:<6} {int(step(x_val)):<8} {sigmoid(x_val):<10.4f} {relu(x_val):<8}")

Multi-Layer Networks Solve XOR

The XOR problem is solved by adding a hidden layer. Two neurons in the hidden layer can each learn a linear boundary, and the output neuron combines them to create a non-linear decision region.

This is the key insight: stacking layers of neurons creates networks that can learn arbitrarily complex functions.

PYTHON
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Manual two-layer network for XOR
# Hidden layer: 2 neurons
# Output layer: 1 neuron

# These weights are hand-crafted to solve XOR
W1 = np.array([[20, 20],   # First hidden neuron: AND-like
               [20, 20]])  # Second hidden neuron: OR-like
b1 = np.array([-30, -10])  # Biases for hidden layer

W2 = np.array([[-60], [60]])  # Output neuron
b2 = np.array([-30])

def forward(x):
    """Forward pass through 2-layer network."""
    # Hidden layer
    h = sigmoid(np.dot(x, W1) + b1)
    # Output layer
    y = sigmoid(np.dot(h, W2) + b2)
    return y

# Test on XOR
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

print("Two-layer network solving XOR:")
for xi, yi in zip(X_xor, y_xor):
    pred = forward(xi)[0]
    pred_class = 1 if pred > 0.5 else 0
    print(f"  {xi} -> {pred:.4f} -> class {pred_class} (expected {yi})")

print("\nWith a hidden layer, neural networks can learn XOR!")
print("This is why depth matters in deep learning.")

Weights, Biases, and Parameters

Every connection in a neural network has a weight—a number that determines how much one neuron influences another. Every neuron (except inputs) has a bias—a number added before the activation function.

Together, weights and biases are the network's parameters. Learning is the process of finding parameter values that make the network produce correct outputs. A network with millions of parameters can learn extremely complex functions.

PYTHON
import numpy as np

def count_parameters(layer_sizes):
    """Count total parameters in a feedforward network."""
    total_weights = 0
    total_biases = 0

    for i in range(len(layer_sizes) - 1):
        n_in = layer_sizes[i]
        n_out = layer_sizes[i + 1]

        weights = n_in * n_out
        biases = n_out

        total_weights += weights
        total_biases += biases

        print(f"Layer {i+1}: {n_in} -> {n_out}")
        print(f"  Weights: {n_in} x {n_out} = {weights}")
        print(f"  Biases: {n_out}")

    print(f"\nTotal parameters: {total_weights + total_biases}")
    print(f"  Weights: {total_weights}")
    print(f"  Biases: {total_biases}")

    return total_weights + total_biases

# Example: small network
print("Small network: 10 inputs, 2 hidden layers (20, 10), 1 output")
count_parameters([10, 20, 10, 1])

print("\n" + "="*50 + "\n")

# Example: larger network
print("Larger network: 784 inputs (28x28 image), hidden (256, 128), 10 outputs")
count_parameters([784, 256, 128, 10])

Key Takeaways

  • A perceptron is a single artificial neuron: weighted sum of inputs plus bias, passed through an activation function
  • Perceptrons can only learn linearly separable functions
  • The XOR problem showed perceptrons' limitations—they cannot separate non-linear patterns
  • Multi-layer networks overcome this by combining multiple linear boundaries
  • Modern neurons use smooth activation functions (sigmoid, ReLU) instead of step functions
  • Weights control connection strengths; biases shift activation thresholds
  • The number of parameters (weights + biases) determines a network's capacity
  • Deep learning's power comes from stacking layers of simple neurons

9.2 Activation Functions Beginner

Activation Functions

Activation functions are the nonlinear transformations that give neural networks their power. Without them, stacking layers would be pointless—a composition of linear functions is just another linear function. Activation functions introduce the nonlinearity that allows networks to learn complex patterns, from image recognition to language understanding.

The choice of activation function significantly impacts how well a network learns. Different functions have different properties: some work better in hidden layers, others in output layers; some train faster, others are more stable. Understanding these tradeoffs is essential for building effective neural networks.

Why Nonlinearity Matters

Consider what happens without activation functions. Each layer computes a linear transformation: $y = Wx + b$. If you stack two such layers:

$$y = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2 = W' x + b'$$

The result is just another linear transformation. No matter how many layers you stack, the network can only learn linear relationships. It could never learn XOR, recognize faces, or understand language.

Activation functions break this limitation. By applying a nonlinear function after each linear transformation, the network can approximate arbitrarily complex functions. The Universal Approximation Theorem proves that a single hidden layer with nonlinear activations can approximate any continuous function—given enough neurons.

PYTHON
import numpy as np

# Without activation functions, stacking layers is pointless
W1 = np.array([[1, 2], [3, 4]])
b1 = np.array([1, 1])
W2 = np.array([[0.5, -0.5], [0.5, 0.5]])
b2 = np.array([0, 0])

x = np.array([1, 2])

# Two linear layers
h = W1 @ x + b1
y = W2 @ h + b2
print(f"Two linear layers: {y}")

# Equivalent single layer
W_combined = W2 @ W1
b_combined = W2 @ b1 + b2
y_single = W_combined @ x + b_combined
print(f"Single equivalent layer: {y_single}")
print("They're identical! Multiple linear layers collapse into one.")

The Sigmoid Function

The sigmoid function was the original activation function for neural networks, inspired by biological neurons that either fire or don't. Sigmoid squashes any input into the range (0, 1):

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Its output can be interpreted as a probability, making it natural for binary classification. The function is smooth and differentiable everywhere, which is essential for gradient-based learning.

The derivative has a convenient form: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$. This means you can compute the gradient using just the output value, without recalculating the sigmoid.

PYTHON
import numpy as np

def sigmoid(x):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of sigmoid."""
    s = sigmoid(x)
    return s * (1 - s)

# Demonstrate sigmoid behavior
x = np.array([-5, -2, -1, 0, 1, 2, 5])
print("Sigmoid function:")
print(f"{'x':<6} {'sigmoid(x)':<12} {'derivative':<12}")
print("-" * 30)
for xi in x:
    print(f"{xi:<6} {sigmoid(xi):<12.4f} {sigmoid_derivative(xi):<12.4f}")

print("\nKey properties:")
print("- Output always between 0 and 1")
print("- sigmoid(0) = 0.5 (centered)")
print("- Large positive x → 1, large negative x → 0")
print("- Derivative is largest at x=0, vanishes for large |x|")

The Vanishing Gradient Problem

Sigmoid has a critical flaw that limited deep learning for decades: the vanishing gradient problem. Look at the derivative—it's at most 0.25 (when x=0) and approaches zero for large positive or negative inputs.

During backpropagation, gradients are multiplied through layers. If each layer multiplies by a small number, the gradient shrinks exponentially. After just a few layers, gradients become so tiny that early layers barely learn.

This is why deep networks with sigmoid activations train poorly. The gradients "vanish" before reaching the early layers. Networks deeper than 3-4 layers were nearly impossible to train effectively.

PYTHON
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Demonstrate vanishing gradients
print("Gradient flow through sigmoid layers:")
print("Starting with gradient = 1.0\n")

gradient = 1.0
for layer in range(10):
    # Assume pre-activations are around 2 (not extreme but not zero)
    local_gradient = sigmoid_derivative(2.0)  # About 0.1
    gradient *= local_gradient
    print(f"After layer {layer + 1}: gradient = {gradient:.2e}")

print(f"\nAfter 10 layers, gradient is {gradient:.2e}")
print("Early layers receive almost no learning signal!")

The Hyperbolic Tangent (Tanh)

Tanh is a scaled and shifted sigmoid that outputs values between -1 and 1:

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$$

Being zero-centered is a significant advantage. With sigmoid, outputs are always positive, which can cause gradients to be all positive or all negative, slowing learning through zig-zagging updates. Tanh's outputs center around zero, allowing more balanced gradient flow.

Tanh still suffers from vanishing gradients, but less severely than sigmoid. Its derivative reaches 1 at x=0 (compared to sigmoid's 0.25), giving stronger gradients near the center.

PYTHON
import numpy as np

def tanh(x):
    """Hyperbolic tangent activation."""
    return np.tanh(x)

def tanh_derivative(x):
    """Derivative of tanh."""
    return 1 - np.tanh(x)**2

# Compare sigmoid and tanh
x = np.array([-3, -1, 0, 1, 3])

print("Comparison: Sigmoid vs Tanh")
print(f"{'x':<6} {'sigmoid':<10} {'tanh':<10} {'sig_deriv':<12} {'tanh_deriv':<12}")
print("-" * 50)
for xi in x:
    sig = 1 / (1 + np.exp(-xi))
    sig_d = sig * (1 - sig)
    th = np.tanh(xi)
    th_d = 1 - th**2
    print(f"{xi:<6} {sig:<10.4f} {th:<10.4f} {sig_d:<12.4f} {th_d:<12.4f}")

print("\nAdvantages of tanh:")
print("- Zero-centered outputs")
print("- Stronger gradients (max derivative = 1 vs 0.25)")
print("- Still saturates for large |x|")

ReLU: The Modern Default

The Rectified Linear Unit (ReLU) revolutionized deep learning. It's embarrassingly simple:

$$\text{ReLU}(x) = \max(0, x)$$

ReLU outputs zero for negative inputs and the input itself for positive inputs. This simplicity brings major advantages. Computation is fast—just a comparison. More importantly, for positive inputs, the gradient is exactly 1. Gradients don't shrink as they flow backward, solving the vanishing gradient problem.

ReLU enabled training of truly deep networks for the first time. Networks with 10, 50, even 100+ layers became practical. This breakthrough catalyzed the deep learning revolution.

PYTHON
import numpy as np

def relu(x):
    """ReLU activation function."""
    return np.maximum(0, x)

def relu_derivative(x):
    """Derivative of ReLU."""
    return (x > 0).astype(float)

# Demonstrate ReLU behavior
x = np.array([-3, -1, -0.1, 0, 0.1, 1, 3])

print("ReLU function:")
print(f"{'x':<8} {'ReLU(x)':<10} {'derivative':<10}")
print("-" * 28)
for xi in x:
    print(f"{xi:<8} {relu(xi):<10} {relu_derivative(xi):<10}")

print("\nKey advantages:")
print("- Gradient is 1 for positive inputs (no vanishing)")
print("- Extremely fast to compute")
print("- Sparse activation (many zeros) is efficient")
print("- Enables very deep networks")

The Dying ReLU Problem

ReLU has its own flaw: dying neurons. If a neuron's weighted sum becomes negative, its output is zero. Zero output means zero gradient flowing backward through that neuron. If this happens consistently, the weights never update, and the neuron is "dead"—it contributes nothing to the network.

This can happen during training if learning rates are too high. A large negative weight update can push a neuron into the dead zone permanently. In some networks, a significant fraction of neurons can die, wasting capacity.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

# Simulate dying ReLU problem
np.random.seed(42)

# A neuron with weights that lead to negative pre-activations
weights = np.array([-0.5, -0.3, 0.1])
bias = -1.0

# Sample inputs (all positive, like after a previous ReLU)
X = np.abs(np.random.randn(100, 3))

# Compute pre-activations
pre_activations = X @ weights + bias

print("Dying ReLU demonstration:")
print(f"Number of inputs: {len(X)}")
print(f"Pre-activations < 0: {(pre_activations < 0).sum()}")
print(f"Pre-activations >= 0: {(pre_activations >= 0).sum()}")

if (pre_activations < 0).all():
    print("\nThis neuron is DEAD!")
    print("All outputs are 0, all gradients are 0")
    print("Weights will never update from backprop through this neuron")

Leaky ReLU and Variants

Leaky ReLU fixes the dying ReLU problem by allowing a small gradient for negative inputs:

$$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{otherwise} \end{cases}$$

Typically $\alpha = 0.01$. Negative inputs produce small but non-zero outputs, keeping the gradient alive. The neuron can still recover from negative weights.

Parametric ReLU (PReLU) learns the slope $\alpha$ during training. ELU (Exponential Linear Unit) uses an exponential curve for negative inputs, which can speed training by pushing mean activations closer to zero.

PYTHON
import numpy as np

def leaky_relu(x, alpha=0.01):
    """Leaky ReLU activation."""
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    """Exponential Linear Unit."""
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

# Compare ReLU variants
x = np.array([-3, -1, -0.5, 0, 0.5, 1, 3])

print("ReLU Variants Comparison:")
print(f"{'x':<8} {'ReLU':<10} {'LeakyReLU':<12} {'ELU':<10}")
print("-" * 40)
for xi in x:
    r = max(0, xi)
    lr = leaky_relu(xi)
    e = elu(xi)
    print(f"{xi:<8} {r:<10.3f} {lr:<12.3f} {e:<10.3f}")

print("\nLeaky ReLU: small negative slope prevents dead neurons")
print("ELU: exponential negative part, can speed training")

Softmax for Classification

Softmax is special—it's used in the output layer for multi-class classification. Unlike other activations that work element-wise, softmax operates on a vector, converting raw scores (logits) into a probability distribution:

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

The outputs are all positive and sum to 1, so they can be interpreted as class probabilities. The exponential amplifies differences: the largest input gets the highest probability, and the gap between classes becomes more pronounced.

PYTHON
import numpy as np

def softmax(x):
    """Softmax activation for classification."""
    # Subtract max for numerical stability
    exp_x = np.exp(x - np.max(x))
    return exp_x / np.sum(exp_x)

# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)

print("Softmax for classification:")
print(f"Raw logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum of probabilities: {probs.sum():.4f}")
print(f"Predicted class: {np.argmax(probs)}")

print("\nHow softmax amplifies differences:")
logits2 = np.array([3.0, 1.0, 0.1])  # Slightly higher first logit
probs2 = softmax(logits2)
print(f"Logits: {logits} -> Probs: {probs.round(3)}")
print(f"Logits: {logits2} -> Probs: {probs2.round(3)}")

GELU and Swish: Modern Activations

Recent architectures like Transformers use GELU (Gaussian Error Linear Unit) and Swish. These smooth approximations to ReLU have shown better performance in many applications.

GELU can be thought of as a smooth, probabilistic version of ReLU. Instead of a hard cutoff at zero, it weights inputs by their probability under a Gaussian distribution:

$$\text{GELU}(x) = x \cdot \Phi(x)$$

where $\Phi$ is the cumulative distribution function of the standard normal distribution.

Swish (also called SiLU) is even simpler: $\text{Swish}(x) = x \cdot \sigma(x)$, where $\sigma$ is the sigmoid function.

PYTHON
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def gelu(x):
    """Gaussian Error Linear Unit (approximation)."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

def swish(x):
    """Swish / SiLU activation."""
    return x * sigmoid(x)

def relu(x):
    return np.maximum(0, x)

# Compare modern activations
x = np.linspace(-3, 3, 7)

print("Modern Activation Functions:")
print(f"{'x':<8} {'ReLU':<10} {'GELU':<10} {'Swish':<10}")
print("-" * 38)
for xi in x:
    print(f"{xi:<8.1f} {relu(xi):<10.3f} {gelu(xi):<10.3f} {swish(xi):<10.3f}")

print("\nGELU: Used in BERT, GPT, and most Transformers")
print("Swish: Smooth, self-gated, competitive with GELU")
print("Both avoid hard zero for negative inputs")

Choosing Activation Functions

Different situations call for different activations. Here are practical guidelines:

Hidden layers: Start with ReLU. It's fast, effective, and rarely goes wrong. If you see dying neurons (many zero activations), try Leaky ReLU or ELU. For Transformer architectures, use GELU.

Output layer: Depends on the task. For multi-class classification, use softmax. For binary classification, use sigmoid. For regression, use no activation (linear output) or ReLU if outputs must be non-negative.

RNNs and LSTMs: Use tanh for hidden states. The bounded output prevents values from exploding over long sequences.

PYTHON
import numpy as np

def relu(x): return np.maximum(0, x)
def sigmoid(x): return 1 / (1 + np.exp(-x))
def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / np.sum(exp_x)

# Task-specific activation choices
print("Activation Function Selection Guide:")
print("=" * 50)

# Classification example
print("\n1. Multi-class Classification (3 classes)")
logits = np.array([2.1, 0.5, -1.0])
print(f"   Final layer logits: {logits}")
print(f"   Apply softmax: {softmax(logits).round(3)}")
print(f"   Interpretation: 81% class 0, 14% class 1, 5% class 2")

# Binary classification example
print("\n2. Binary Classification")
logit = 1.5
print(f"   Final layer logit: {logit}")
print(f"   Apply sigmoid: {sigmoid(logit):.3f}")
print(f"   Interpretation: 82% probability of positive class")

# Regression example
print("\n3. Regression (predicting house price)")
print("   Use linear output (no activation)")
print("   Network can output any real value")

# Hidden layers
print("\n4. Hidden Layers")
print("   Default: ReLU")
print("   Transformers: GELU")
print("   RNNs/LSTMs: tanh")

Key Takeaways

  • Nonlinearity is essential—without activation functions, deep networks would just be linear models
  • Sigmoid was the original but causes vanishing gradients in deep networks
  • Tanh is zero-centered but still saturates
  • ReLU revolutionized deep learning by enabling gradient flow, but can have dying neurons
  • Leaky ReLU and ELU fix the dying ReLU problem with non-zero negative slopes
  • Softmax converts logits to probabilities for multi-class classification
  • GELU and Swish are modern smooth activations used in Transformers
  • Choose activation based on layer type and task: ReLU for hidden layers, softmax/sigmoid for classification outputs

9.3 Feedforward Neural Networks Intermediate

Feedforward Neural Networks

A feedforward neural network is the simplest architecture for deep learning: information flows in one direction, from input through hidden layers to output, with no loops or cycles. Despite this simplicity, feedforward networks can learn remarkably complex functions. They form the foundation upon which more sophisticated architectures—convolutional networks, recurrent networks, transformers—are built.

Understanding feedforward networks means understanding how layers transform data, how representations are built hierarchically, and how the interplay of weights, biases, and activations creates powerful function approximators.

Network Architecture

A feedforward network consists of layers of neurons. The input layer receives the raw features. One or more hidden layers transform these features into increasingly abstract representations. The output layer produces the final prediction.

Each layer is fully connected (or "dense"): every neuron in one layer connects to every neuron in the next. These connections have weights that the network learns during training. The term "feedforward" means data moves only forward—from input to output—never backward or in loops.

PYTHON
import numpy as np

class FeedforwardNetwork:
    """Simple feedforward neural network."""

    def __init__(self, layer_sizes):
        """
        Initialize network with given layer sizes.
        layer_sizes: list like [input_dim, hidden1, hidden2, ..., output_dim]
        """
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes)

        # Initialize weights and biases for each layer
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            # He initialization for ReLU
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros(layer_sizes[i+1])
            self.weights.append(w)
            self.biases.append(b)

    def describe(self):
        """Print network architecture."""
        print(f"Feedforward Network: {self.layer_sizes}")
        total_params = 0
        for i, (w, b) in enumerate(zip(self.weights, self.biases)):
            params = w.size + b.size
            total_params += params
            print(f"  Layer {i+1}: {w.shape[0]} -> {w.shape[1]} ({params} parameters)")
        print(f"  Total parameters: {total_params}")

# Create a network: 784 inputs (28x28 image), two hidden layers, 10 outputs
net = FeedforwardNetwork([784, 256, 128, 10])
net.describe()

The Forward Pass

The forward pass computes the network's output for a given input. Each layer performs two operations: a linear transformation followed by an activation function.

For layer $l$, the computation is:

$$z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$$
$$a^{(l)} = f(z^{(l)})$$

Here $a^{(l-1)}$ is the previous layer's output (or the input for the first hidden layer), $W^{(l)}$ is the weight matrix, $b^{(l)}$ is the bias vector, $z^{(l)}$ is the pre-activation, and $f$ is the activation function.

The output of one layer becomes the input to the next, creating a chain of transformations that gradually refines the representation.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

class FeedforwardNetwork:
    def __init__(self, layer_sizes):
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * 0.1
            b = np.zeros(layer_sizes[i+1])
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, x):
        """
        Forward pass through the network.
        Returns all intermediate activations for use in backprop.
        """
        activations = [x]
        pre_activations = []

        a = x
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            z = a @ W + b  # Linear transformation
            pre_activations.append(z)

            # Use ReLU for hidden layers, softmax for output
            if i < len(self.weights) - 1:
                a = relu(z)
            else:
                a = softmax(z)
            activations.append(a)

        return activations, pre_activations

# Example forward pass
net = FeedforwardNetwork([4, 8, 8, 3])
x = np.array([[0.5, 0.3, 0.8, 0.1]])  # Single sample, 4 features

activations, pre_activations = net.forward(x)
print("Forward pass through network [4, 8, 8, 3]:")
for i, a in enumerate(activations):
    print(f"  Layer {i} output shape: {a.shape}")
print(f"\nFinal output (probabilities): {activations[-1].round(3)}")
print(f"Predicted class: {np.argmax(activations[-1])}")

Hidden Representations

The magic of neural networks lies in their hidden layers. Each layer learns to represent the data in a way that makes the next layer's job easier. Early layers might detect simple patterns; later layers combine these into complex features.

For image classification, early layers might learn edge detectors. Middle layers combine edges into shapes. Later layers recognize object parts. The final layer uses these high-level features to classify the image.

This hierarchical feature learning is what makes deep networks powerful. Instead of engineering features by hand, the network learns the right representations automatically.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

# Demonstrate how representations evolve
np.random.seed(42)

# Simulated input: 2D points from two classes
# Class 0: points in a ring
# Class 1: points in the center
n_samples = 100
theta = np.random.uniform(0, 2*np.pi, n_samples // 2)
class0 = np.column_stack([np.cos(theta), np.sin(theta)]) * 2
class1 = np.random.randn(n_samples // 2, 2) * 0.5

X = np.vstack([class0, class1])
y = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

print("Hidden Representation Learning:")
print(f"Input: 2D points, {n_samples} samples")
print(f"Problem: Ring (class 0) vs Center (class 1)")
print("\nA linear classifier cannot separate these classes!")
print("But a network with hidden layers can learn a nonlinear boundary.")

# Random network (untrained) - just to show the transformation
W1 = np.random.randn(2, 10) * 0.5
b1 = np.zeros(10)
W2 = np.random.randn(10, 2) * 0.5
b2 = np.zeros(2)

# Transform through hidden layer
h = relu(X @ W1 + b1)
print(f"\nAfter hidden layer: {X.shape} -> {h.shape}")
print("The hidden layer projects data to 10 dimensions")
print("In this higher-dimensional space, the classes may become separable")

Network Depth and Width

Depth is the number of layers. Width is the number of neurons per layer. Both affect what a network can learn.

Deeper networks can learn more abstract, hierarchical representations. Each layer builds on the previous, enabling increasingly sophisticated transformations. However, deeper networks are harder to train—gradients must flow through more layers.

Wider networks have more parameters per layer, increasing capacity. But width alone doesn't create hierarchical representations. A shallow but very wide network might memorize training data without learning generalizable patterns.

Modern architectures balance depth and width. ResNets go very deep (100+ layers) with moderate width. Transformers are moderately deep with very wide layers.

PYTHON
import numpy as np

def count_params(layers):
    """Count parameters for a feedforward network."""
    total = 0
    for i in range(len(layers) - 1):
        weights = layers[i] * layers[i+1]
        biases = layers[i+1]
        total += weights + biases
    return total

# Compare architectures
print("Comparing Network Architectures")
print("=" * 50)

# Same input/output, different depth/width tradeoffs
architectures = [
    ("Shallow-Wide", [784, 1024, 10]),
    ("Medium-Medium", [784, 256, 128, 10]),
    ("Deep-Narrow", [784, 128, 64, 32, 16, 10]),
    ("Very Deep", [784, 64, 64, 64, 64, 64, 64, 10])
]

for name, layers in architectures:
    params = count_params(layers)
    depth = len(layers) - 1
    print(f"\n{name}:")
    print(f"  Layers: {layers}")
    print(f"  Depth: {depth} layers")
    print(f"  Parameters: {params:,}")

Batch Processing

Neural networks process data in batches—multiple samples at once. This isn't just for efficiency; it's fundamental to how modern deep learning works.

With a batch of inputs, the forward pass becomes matrix multiplication. Instead of processing one input vector, we process a matrix where each row is a sample. This parallelizes computation on GPUs, which excel at matrix operations.

Batches also provide more stable gradients. The gradient computed over a batch averages the gradients of individual samples, reducing noise and leading to smoother training.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

# Network weights
np.random.seed(42)
W1 = np.random.randn(10, 32) * 0.1
b1 = np.zeros(32)
W2 = np.random.randn(32, 5) * 0.1
b2 = np.zeros(5)

# Single sample forward pass
x_single = np.random.randn(1, 10)
h_single = relu(x_single @ W1 + b1)
y_single = softmax(h_single @ W2 + b2)
print("Single sample processing:")
print(f"  Input shape: {x_single.shape}")
print(f"  Output shape: {y_single.shape}")

# Batch forward pass - same computation, but for many samples at once
batch_size = 64
X_batch = np.random.randn(batch_size, 10)
H_batch = relu(X_batch @ W1 + b1)
Y_batch = softmax(H_batch @ W2 + b2)
print(f"\nBatch processing ({batch_size} samples):")
print(f"  Input shape: {X_batch.shape}")
print(f"  Output shape: {Y_batch.shape}")
print("\nThe same operations work for batches!")
print("Matrix multiplication handles multiple samples efficiently")

Weight Initialization

How you initialize weights matters enormously. Bad initialization can prevent learning entirely.

If weights are too small, signals shrink as they pass through layers. Gradients vanish, and the network doesn't learn. If weights are too large, signals explode, and training becomes unstable.

Xavier/Glorot initialization (for sigmoid/tanh) scales weights by $\sqrt{1/n_{in}}$ or $\sqrt{2/(n_{in}+n_{out})}$. He initialization (for ReLU) scales by $\sqrt{2/n_{in}}$. These keep signal variance stable across layers.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def forward_pass(x, weights, activation):
    """Forward pass through multiple layers."""
    a = x
    for W in weights:
        z = a @ W
        a = activation(z)
    return a

# Compare initializations
np.random.seed(42)
n_layers = 10
layer_size = 256
x = np.random.randn(100, layer_size)  # 100 samples

print("Effect of Weight Initialization")
print("=" * 50)

# Bad: too small
weights_small = [np.random.randn(layer_size, layer_size) * 0.01 for _ in range(n_layers)]
out_small = forward_pass(x, weights_small, relu)
print(f"\nToo small (0.01):")
print(f"  Output mean: {np.mean(np.abs(out_small)):.2e}")
print(f"  Signal vanished!")

# Bad: too large
weights_large = [np.random.randn(layer_size, layer_size) * 1.0 for _ in range(n_layers)]
out_large = forward_pass(x, weights_large, relu)
print(f"\nToo large (1.0):")
print(f"  Output mean: {np.mean(np.abs(out_large)):.2e}")
print(f"  Signal exploded!")

# Good: He initialization for ReLU
weights_he = [np.random.randn(layer_size, layer_size) * np.sqrt(2.0/layer_size) for _ in range(n_layers)]
out_he = forward_pass(x, weights_he, relu)
print(f"\nHe initialization (sqrt(2/n)):")
print(f"  Output mean: {np.mean(np.abs(out_he)):.2e}")
print(f"  Signal preserved!")

Universal Approximation

The Universal Approximation Theorem states that a feedforward network with a single hidden layer can approximate any continuous function to arbitrary accuracy—given enough neurons. This is remarkable: a simple architecture with sufficient width can learn any function.

However, the theorem doesn't say how many neurons you need. In practice, approximating complex functions with a single layer might require exponentially many neurons. Deep networks often achieve the same with far fewer parameters by learning hierarchical representations.

The theorem guarantees that networks are powerful enough. Practical success depends on training dynamics, data quantity, and architecture choices.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

# Demonstrate: a wide single-layer network can approximate functions
# We'll approximate sin(x) with a ReLU network

np.random.seed(42)

# Target function
x = np.linspace(-np.pi, np.pi, 100).reshape(-1, 1)
y_true = np.sin(x)

# Single hidden layer with many neurons
n_hidden = 50
W1 = np.random.randn(1, n_hidden) * 2
b1 = np.random.randn(n_hidden) * 2
W2 = np.random.randn(n_hidden, 1) * 0.1
b2 = np.zeros(1)

# Forward pass (random weights - not trained!)
h = relu(x @ W1 + b1)
y_pred = h @ W2 + b2

print("Universal Approximation Demonstration")
print("=" * 50)
print(f"Target: sin(x) on [-π, π]")
print(f"Network: 1 -> {n_hidden} (ReLU) -> 1")
print(f"\nWith random weights, output is just random:")
print(f"  MSE: {np.mean((y_pred - y_true)**2):.4f}")
print(f"\nBut with training, this network CAN approximate sin(x)")
print("The Universal Approximation Theorem guarantees this is possible")
print("given enough hidden neurons and proper training.")

A Complete Example

Let's build a complete feedforward network for classification, implementing all the pieces we've discussed.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def cross_entropy_loss(y_pred, y_true):
    """Categorical cross-entropy loss."""
    n_samples = y_pred.shape[0]
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    # One-hot encode y_true if needed
    if y_true.ndim == 1:
        y_one_hot = np.zeros_like(y_pred)
        y_one_hot[np.arange(n_samples), y_true] = 1
        y_true = y_one_hot
    return -np.sum(y_true * np.log(y_pred)) / n_samples

class SimpleNeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.01):
        self.lr = learning_rate
        self.weights = []
        self.biases = []

        # He initialization
        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros(layer_sizes[i+1])
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass, storing activations for backprop."""
        self.activations = [X]
        self.pre_activations = []

        a = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            z = a @ W + b
            self.pre_activations.append(z)

            if i < len(self.weights) - 1:
                a = relu(z)
            else:
                a = softmax(z)
            self.activations.append(a)

        return self.activations[-1]

    def predict(self, X):
        """Return class predictions."""
        probs = self.forward(X)
        return np.argmax(probs, axis=1)

    def accuracy(self, X, y):
        """Compute classification accuracy."""
        predictions = self.predict(X)
        return np.mean(predictions == y)

# Create a simple dataset
np.random.seed(42)
n_samples = 300
n_features = 4
n_classes = 3

# Generate random data
X = np.random.randn(n_samples, n_features)
y = np.random.randint(0, n_classes, n_samples)

# Create network
net = SimpleNeuralNetwork([n_features, 16, 8, n_classes])

# Forward pass
output = net.forward(X[:5])
print("Complete Feedforward Network Example")
print("=" * 50)
print(f"Architecture: {n_features} -> 16 -> 8 -> {n_classes}")
print(f"\nSample predictions (first 5):")
print(f"Probabilities:\n{output.round(3)}")
print(f"Predicted classes: {net.predict(X[:5])}")
print(f"True classes: {y[:5]}")
print(f"\nInitial accuracy (random weights): {net.accuracy(X, y):.1%}")
print("(Expected ~33% for 3 classes with random weights)")

Key Takeaways

  • Feedforward networks pass data one direction: input → hidden layers → output
  • Fully connected layers connect every neuron to every neuron in the next layer
  • The forward pass alternates linear transformations and activation functions
  • Hidden layers learn hierarchical representations—simple features combine into complex ones
  • Depth enables abstraction; width increases capacity
  • Batch processing enables efficient GPU computation and stable gradients
  • Proper weight initialization (He or Xavier) prevents vanishing/exploding signals
  • The Universal Approximation Theorem guarantees networks can learn any function
  • These networks form the foundation for all deep learning architectures

9.4 Backpropagation Intermediate

Backpropagation

Backpropagation is the algorithm that makes deep learning possible. It computes gradients—how much each weight contributes to the error—by propagating error signals backward through the network. These gradients tell us how to adjust weights to reduce errors.

Without backpropagation, training neural networks would be computationally intractable. Trying every possible weight combination is impossible. Estimating gradients by perturbing each weight individually would require millions of forward passes. Backpropagation computes all gradients in just two passes through the network: one forward, one backward.

The Gradient Descent Framework

Neural network training follows gradient descent: compute how wrong the network is (the loss), compute which direction improves each weight (the gradient), then nudge weights in that direction.

For a loss function $L$ and weight $w$, the gradient $\frac{\partial L}{\partial w}$ tells us: if we increase $w$ slightly, how much does the loss change? A positive gradient means increasing $w$ increases loss—so we should decrease $w$. A negative gradient means the opposite.

The update rule is: $w \leftarrow w - \eta \frac{\partial L}{\partial w}$

where $\eta$ is the learning rate. We move weights in the direction that decreases loss.

PYTHON
import numpy as np

# Gradient descent intuition
def f(x):
    """A simple function we want to minimize."""
    return x**2 + 2*x + 1  # Minimum at x = -1

def gradient(x):
    """Gradient of f."""
    return 2*x + 2

# Start at a random point
x = 5.0
learning_rate = 0.1

print("Gradient Descent Example: minimizing f(x) = x² + 2x + 1")
print(f"{'Step':<6} {'x':<12} {'f(x)':<12} {'gradient':<12}")
print("-" * 42)

for step in range(10):
    fx = f(x)
    grad = gradient(x)
    print(f"{step:<6} {x:<12.4f} {fx:<12.4f} {grad:<12.4f}")

    # Update x in direction that reduces f
    x = x - learning_rate * grad

print(f"\nFinal x: {x:.4f}")
print(f"True minimum at x = -1.0")

The Chain Rule

Backpropagation is just the chain rule from calculus, applied systematically. When functions are composed, the derivative of the composition is the product of derivatives.

If $y = f(g(x))$, then $\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$

In neural networks, the output is a long composition of functions: activations, linear transformations, more activations, and so on. To find how a weight deep in the network affects the final loss, we chain together all the intermediate derivatives.

PYTHON
import numpy as np

# Chain rule example
# y = relu(w * x + b)
# loss = (y - target)^2

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# Forward pass
x = 2.0
w = 0.5
b = 0.1
target = 1.0

z = w * x + b           # Pre-activation
y = relu(z)             # Activation
loss = (y - target)**2  # Loss

print("Chain Rule in Action")
print("=" * 40)
print(f"Forward pass:")
print(f"  z = w*x + b = {w}*{x} + {b} = {z}")
print(f"  y = relu(z) = {y}")
print(f"  loss = (y - target)² = ({y} - {target})² = {loss}")

# Backward pass (chain rule)
dloss_dy = 2 * (y - target)     # d(loss)/d(y)
dy_dz = relu_derivative(z)       # d(relu)/d(z)
dz_dw = x                        # d(w*x + b)/d(w)

# Chain them together
dloss_dw = dloss_dy * dy_dz * dz_dw

print(f"\nBackward pass (chain rule):")
print(f"  ∂loss/∂y = 2(y - target) = {dloss_dy}")
print(f"  ∂y/∂z = relu'(z) = {dy_dz}")
print(f"  ∂z/∂w = x = {dz_dw}")
print(f"\n  ∂loss/∂w = {dloss_dy} × {dy_dz} × {dz_dw} = {dloss_dw}")

Computational Graphs

It helps to visualize neural networks as computational graphs. Each node performs an operation; edges carry values forward and gradients backward.

During the forward pass, we compute and cache intermediate values at each node. During the backward pass, gradients flow from the loss back through each node. Each node multiplies the incoming gradient by its local derivative and passes the result to its inputs.

This structure makes backpropagation modular. Each operation only needs to know its own local derivative. The chain rule automatically combines everything.

PYTHON
import numpy as np

class ComputeNode:
    """Base class for computational graph nodes."""
    def forward(self, *inputs):
        raise NotImplementedError

    def backward(self, grad_output):
        raise NotImplementedError

class Multiply(ComputeNode):
    """Multiplication: z = x * y"""
    def forward(self, x, y):
        self.x, self.y = x, y
        return x * y

    def backward(self, grad_output):
        # d(xy)/dx = y, d(xy)/dy = x
        return grad_output * self.y, grad_output * self.x

class Add(ComputeNode):
    """Addition: z = x + y"""
    def forward(self, x, y):
        return x + y

    def backward(self, grad_output):
        # d(x+y)/dx = 1, d(x+y)/dy = 1
        return grad_output, grad_output

class ReLU(ComputeNode):
    """ReLU activation"""
    def forward(self, x):
        self.x = x
        return np.maximum(0, x)

    def backward(self, grad_output):
        return grad_output * (self.x > 0)

# Build a simple graph: y = relu(w * x + b)
mult = Multiply()
add = Add()
relu = ReLU()

x, w, b = 2.0, 0.5, 0.1
target = 1.0

# Forward pass
z1 = mult.forward(w, x)
z2 = add.forward(z1, b)
y = relu.forward(z2)
loss = (y - target) ** 2

print("Computational Graph Example")
print("=" * 40)
print(f"y = relu(w * x + b)")
print(f"Forward: {w} * {x} = {z1} → + {b} = {z2} → relu = {y}")

# Backward pass
grad_loss = 2 * (y - target)
grad_z2 = relu.backward(grad_loss)
grad_z1, grad_b = add.backward(grad_z2)
grad_w, grad_x = mult.backward(grad_z1)

print(f"\nBackward pass:")
print(f"  ∂loss/∂y = {grad_loss}")
print(f"  ∂loss/∂w = {grad_w}")
print(f"  ∂loss/∂b = {grad_b}")

Forward and Backward Pass

In practice, backpropagation has two phases:

Forward pass: Compute the output, caching intermediate values needed for gradients. This is the normal network computation—input goes in, prediction comes out, loss is computed.

Backward pass: Starting from the loss, compute gradients layer by layer going backward. Each layer receives the gradient from the layer above, multiplies by its local gradient, and passes the result to the layer below.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_backward(grad, x):
    return grad * (x > 0)

def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / np.sum(exp_x)

# Simple 2-layer network: input(3) -> hidden(4) -> output(2)
np.random.seed(42)

# Weights
W1 = np.random.randn(3, 4) * 0.5
b1 = np.zeros(4)
W2 = np.random.randn(4, 2) * 0.5
b2 = np.zeros(2)

# Input and target
x = np.array([1.0, 0.5, -0.3])
target = 0  # Class 0

print("Forward and Backward Pass")
print("=" * 50)

# FORWARD PASS
print("\n--- Forward Pass ---")

# Layer 1
z1 = x @ W1 + b1
print(f"z1 = x @ W1 + b1: shape {z1.shape}")
a1 = relu(z1)
print(f"a1 = relu(z1): {a1.round(3)}")

# Layer 2
z2 = a1 @ W2 + b2
print(f"z2 = a1 @ W2 + b2: shape {z2.shape}")
a2 = softmax(z2)
print(f"a2 = softmax(z2): {a2.round(3)}")

# Loss (cross-entropy for class 0)
loss = -np.log(a2[target])
print(f"\nLoss = -log(p[{target}]) = {loss:.4f}")

# BACKWARD PASS
print("\n--- Backward Pass ---")

# Gradient of cross-entropy + softmax combined
dz2 = a2.copy()
dz2[target] -= 1  # Softmax + cross-entropy derivative
print(f"dL/dz2 = {dz2.round(4)}")

# Gradient w.r.t. W2 and b2
dW2 = np.outer(a1, dz2)
db2 = dz2
print(f"dL/dW2 shape: {dW2.shape}")
print(f"dL/db2: {db2.round(4)}")

# Gradient flowing to a1
da1 = dz2 @ W2.T
print(f"dL/da1: {da1.round(4)}")

# Through ReLU
dz1 = relu_backward(da1, z1)
print(f"dL/dz1: {dz1.round(4)}")

# Gradient w.r.t. W1 and b1
dW1 = np.outer(x, dz1)
db1 = dz1
print(f"dL/dW1 shape: {dW1.shape}")

Gradient Flow and Layer Interactions

During backpropagation, gradients must flow through every layer. If any layer severely shrinks or amplifies gradients, training fails.

Each layer transforms gradients by multiplying by the transpose of its weight matrix. If weights are too small, gradients shrink. If too large, they explode. Activation functions also affect gradient flow—sigmoid squashes gradients toward zero for large inputs.

Understanding gradient flow explains many training tricks: why we use ReLU (preserves gradients), why we initialize carefully (prevents initial gradient explosion/vanishing), why batch normalization helps (stabilizes gradient magnitudes).

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

# Track gradient flow through layers
np.random.seed(42)

n_layers = 5
layer_size = 100
x = np.random.randn(layer_size)

print("Gradient Flow Through Layers")
print("=" * 50)

# With ReLU
print("\nWith ReLU activation:")
grad = np.ones(layer_size)  # Start with gradient = 1
for i in range(n_layers):
    W = np.random.randn(layer_size, layer_size) * np.sqrt(2.0 / layer_size)
    grad = (grad @ W.T) * (np.random.rand(layer_size) > 0.3)  # Simulate ReLU mask
    print(f"  Layer {i+1}: gradient magnitude = {np.linalg.norm(grad):.4f}")

# With sigmoid (problematic)
print("\nWith Sigmoid activation:")
grad = np.ones(layer_size)
for i in range(n_layers):
    W = np.random.randn(layer_size, layer_size) * 0.5
    # Sigmoid derivative is at most 0.25
    grad = (grad @ W.T) * 0.2  # Typical sigmoid derivative
    print(f"  Layer {i+1}: gradient magnitude = {np.linalg.norm(grad):.4f}")

print("\nSigmoid gradients vanish rapidly!")
print("This is why ReLU revolutionized deep learning.")

Backpropagation for a Full Network

Let's implement complete backpropagation for a multi-layer network. We'll compute gradients for all weights and biases, then update them.

PYTHON
import numpy as np

def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def cross_entropy_loss(y_pred, y_true):
    n_samples = len(y_true)
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    loss = -np.sum(np.log(y_pred[np.arange(n_samples), y_true])) / n_samples
    return loss

class NeuralNetwork:
    def __init__(self, layer_sizes):
        self.layer_sizes = layer_sizes
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros(layer_sizes[i+1])
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass with caching."""
        self.activations = [X]
        self.pre_activations = []

        a = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            z = a @ W + b
            self.pre_activations.append(z)

            if i < len(self.weights) - 1:
                a = relu(z)
            else:
                a = softmax(z)
            self.activations.append(a)

        return a

    def backward(self, y_true):
        """Backward pass computing all gradients."""
        n_samples = len(y_true)
        self.weight_gradients = []
        self.bias_gradients = []

        # Start: gradient of loss w.r.t. final layer output
        y_pred = self.activations[-1]
        dz = y_pred.copy()
        dz[np.arange(n_samples), y_true] -= 1
        dz /= n_samples

        # Go backward through layers
        for i in range(len(self.weights) - 1, -1, -1):
            a_prev = self.activations[i]

            # Gradients for this layer's weights and biases
            dW = a_prev.T @ dz
            db = np.sum(dz, axis=0)
            self.weight_gradients.insert(0, dW)
            self.bias_gradients.insert(0, db)

            if i > 0:
                # Gradient flowing to previous layer
                da = dz @ self.weights[i].T
                # Through ReLU
                dz = da * (self.pre_activations[i-1] > 0)

    def update(self, learning_rate):
        """Update weights using computed gradients."""
        for i in range(len(self.weights)):
            self.weights[i] -= learning_rate * self.weight_gradients[i]
            self.biases[i] -= learning_rate * self.bias_gradients[i]

    def train_step(self, X, y, learning_rate):
        """One complete training step."""
        # Forward
        y_pred = self.forward(X)
        loss = cross_entropy_loss(y_pred, y)

        # Backward
        self.backward(y)

        # Update
        self.update(learning_rate)

        return loss

# Train on a simple dataset
np.random.seed(42)

# Create data: 3 classes, 4 features
n_samples = 200
X = np.random.randn(n_samples, 4)
y = np.random.randint(0, 3, n_samples)

# Create network
net = NeuralNetwork([4, 16, 8, 3])

print("Training with Backpropagation")
print("=" * 50)

# Train for several epochs
for epoch in range(5):
    loss = net.train_step(X, y, learning_rate=0.1)
    pred = np.argmax(net.forward(X), axis=1)
    acc = np.mean(pred == y)
    print(f"Epoch {epoch+1}: loss = {loss:.4f}, accuracy = {acc:.1%}")

Mini-Batch Gradient Descent

In practice, we don't compute gradients over the entire dataset at once (too expensive) or one sample at a time (too noisy). We use mini-batches: compute gradients over a small batch of samples, update, repeat.

Mini-batches balance efficiency and stability. They're small enough to fit in memory and allow frequent updates, but large enough to provide stable gradient estimates. Typical batch sizes are 32, 64, 128, or 256.

PYTHON
import numpy as np

def relu(x): return np.maximum(0, x)
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

# Create larger dataset
np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 10)
y = np.random.randint(0, 5, n_samples)

# Simple network weights
W1 = np.random.randn(10, 32) * np.sqrt(2.0 / 10)
b1 = np.zeros(32)
W2 = np.random.randn(32, 5) * np.sqrt(2.0 / 32)
b2 = np.zeros(5)

def forward(X):
    z1 = X @ W1 + b1
    a1 = relu(z1)
    z2 = a1 @ W2 + b2
    return softmax(z2), a1, z1

print("Mini-Batch Gradient Descent")
print("=" * 50)

batch_size = 64
n_batches = n_samples // batch_size
learning_rate = 0.01

for epoch in range(3):
    # Shuffle data each epoch
    idx = np.random.permutation(n_samples)
    X_shuffled = X[idx]
    y_shuffled = y[idx]

    total_loss = 0
    for batch in range(n_batches):
        # Get mini-batch
        start = batch * batch_size
        end = start + batch_size
        X_batch = X_shuffled[start:end]
        y_batch = y_shuffled[start:end]

        # Forward pass
        y_pred, a1, z1 = forward(X_batch)

        # Compute loss
        loss = -np.mean(np.log(y_pred[np.arange(batch_size), y_batch] + 1e-15))
        total_loss += loss

        # Backward pass (simplified, updating global weights)
        dz2 = y_pred.copy()
        dz2[np.arange(batch_size), y_batch] -= 1
        dz2 /= batch_size

        dW2 = a1.T @ dz2
        db2 = np.sum(dz2, axis=0)

        da1 = dz2 @ W2.T
        dz1 = da1 * (z1 > 0)
        dW1 = X_batch.T @ dz1
        db1 = np.sum(dz1, axis=0)

        # Update
        W1 -= learning_rate * dW1
        b1 -= learning_rate * db1
        W2 -= learning_rate * dW2
        b2 -= learning_rate * db2

    avg_loss = total_loss / n_batches
    y_pred_all, _, _ = forward(X)
    acc = np.mean(np.argmax(y_pred_all, axis=1) == y)
    print(f"Epoch {epoch+1}: avg loss = {avg_loss:.4f}, accuracy = {acc:.1%}")

Common Pitfalls

Several issues can derail backpropagation:

Vanishing gradients: Gradients shrink to near-zero, preventing learning in early layers. Use ReLU, careful initialization, or techniques like batch normalization.

Exploding gradients: Gradients grow to infinity, causing unstable updates. Use gradient clipping to cap gradient magnitudes.

Dead neurons: With ReLU, neurons can get stuck outputting zero permanently. Use Leaky ReLU or careful learning rate tuning.

Numerical instability: Softmax with large inputs can overflow. Subtract the maximum before exponentiating for stability.

PYTHON
import numpy as np

print("Common Backpropagation Pitfalls")
print("=" * 50)

# 1. Exploding gradients
print("\n1. Exploding Gradients:")
grad = 1.0
for i in range(10):
    grad *= 2.0  # Each layer doubles the gradient
print(f"   After 10 layers (multiplier 2.0): gradient = {grad}")
print("   Solution: Gradient clipping")
clipped = np.clip(grad, -10, 10)
print(f"   Clipped to [-10, 10]: {clipped}")

# 2. Vanishing gradients
print("\n2. Vanishing Gradients:")
grad = 1.0
for i in range(10):
    grad *= 0.25  # Sigmoid derivative at saturation
print(f"   After 10 layers (multiplier 0.25): gradient = {grad:.2e}")
print("   Solution: Use ReLU activation")

# 3. Numerical stability in softmax
print("\n3. Softmax Stability:")
x = np.array([1000, 1001, 1002])  # Large values
print(f"   Logits: {x}")
try:
    unstable = np.exp(x)
    print(f"   Naive exp: overflow!")
except:
    print(f"   Naive exp: overflow!")
stable = np.exp(x - np.max(x))
stable = stable / np.sum(stable)
print(f"   Stable softmax: {stable.round(4)}")
print("   Solution: Subtract max before exp")

Key Takeaways

  • Backpropagation computes gradients by applying the chain rule backward through the network
  • The forward pass computes outputs while caching values for gradient computation
  • The backward pass propagates gradients from loss to weights, layer by layer
  • Computational graphs make backpropagation modular—each operation handles its own gradients
  • Mini-batch gradient descent balances computational efficiency with gradient stability
  • Proper initialization and activation functions prevent vanishing/exploding gradients
  • Gradient clipping handles exploding gradients; Leaky ReLU prevents dead neurons
  • Understanding gradient flow is key to debugging and improving neural network training

9.5 Building Networks with PyTorch Intermediate

Building Networks with PyTorch

PyTorch is the deep learning framework that has become the standard for research and increasingly for production. It provides everything we've built by hand—layers, activations, automatic differentiation, optimizers—in an elegant, well-tested package. Understanding the manual implementations helps you appreciate what PyTorch does, but using PyTorch lets you focus on architecture and experiments rather than implementation details.

This section bridges from raw NumPy implementations to professional deep learning code. The concepts are identical; the tools are more powerful.

PyTorch Tensors

PyTorch's fundamental data structure is the tensor—a multi-dimensional array similar to NumPy's ndarray. Tensors support GPU acceleration and automatic differentiation, the two features that make PyTorch powerful.

The key difference from NumPy: tensors can track their computation history. When you set requires<em>grad=True, PyTorch remembers every operation performed on that tensor. This enables automatic backpropagation.

PYTHON
import torch

# Creating tensors
x = torch.tensor([1.0, 2.0, 3.0])
print(f"1D tensor: {x}")

# Random tensor
X = torch.randn(3, 4)  # 3 rows, 4 columns
print(f"\nRandom 3x4 tensor:\n{X}")

# Tensor with gradient tracking
w = torch.randn(4, requires_grad=True)
print(f"\nTensor with gradients: {w}")
print(f"Requires grad: {w.requires_grad}")

# Basic operations (like NumPy)
a = torch.tensor([1.0, 2.0])
b = torch.tensor([3.0, 4.0])
print(f"\na + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a @ b (dot product) = {a @ b}")

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nDevice: {device}")
x_gpu = X.to(device)

Automatic Differentiation

PyTorch's automatic differentiation (autograd) computes gradients automatically. Perform a forward pass, call .backward() on the loss, and gradients appear in each tensor's .grad attribute.

This replaces all the manual backpropagation we implemented earlier. PyTorch builds a computational graph during the forward pass and traverses it backward during .backward().

PYTHON
import torch

# Simple example: y = x^2, find dy/dx at x=3
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2
y.backward()  # Compute gradient

print("Automatic Differentiation")
print("=" * 40)
print(f"x = {x.item()}")
print(f"y = x² = {y.item()}")
print(f"dy/dx = 2x = {x.grad.item()}")  # Should be 6.0

# More complex example: neural network forward pass
print("\nNeural network gradient example:")
W = torch.randn(3, 2, requires_grad=True)
b = torch.zeros(2, requires_grad=True)
x = torch.randn(3)

# Forward pass
z = x @ W + b
y = torch.relu(z)
loss = y.sum()

# Backward pass
loss.backward()

print(f"Input x: {x}")
print(f"Output y: {y}")
print(f"W.grad shape: {W.grad.shape}")
print(f"b.grad: {b.grad}")

nn.Module: Building Blocks

PyTorch organizes neural networks using nn.Module. A module can be a single layer (like nn.Linear) or a complex network containing other modules. Every layer, activation, and model is a module.

You create custom networks by subclassing nn.Module, defining layers in <strong>init</strong>, and defining the forward pass in forward(). PyTorch handles everything else: tracking parameters, moving to GPU, saving/loading.

PYTHON
import torch
import torch.nn as nn

# Using built-in modules
linear = nn.Linear(10, 5)  # 10 inputs, 5 outputs
print("nn.Linear module:")
print(f"  Weight shape: {linear.weight.shape}")
print(f"  Bias shape: {linear.bias.shape}")

# Forward pass through the module
x = torch.randn(3, 10)  # Batch of 3 samples
y = linear(x)
print(f"\n  Input shape: {x.shape}")
print(f"  Output shape: {y.shape}")

# Custom network as nn.Module
class SimpleNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create and use the network
net = SimpleNetwork(10, 20, 3)
print(f"\nCustom network: {net}")

# List all parameters
print("\nParameters:")
for name, param in net.named_parameters():
    print(f"  {name}: {param.shape}")

# Total parameter count
total = sum(p.numel() for p in net.parameters())
print(f"\nTotal parameters: {total}")

Sequential Models

For simple feedforward networks, nn.Sequential provides a concise way to stack layers. You list the layers in order, and PyTorch chains them together automatically.

This is ideal for prototyping and simple architectures. For complex networks with branching or residual connections, you'll need custom modules.

PYTHON
import torch
import torch.nn as nn

# Simple sequential model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

print("Sequential Model:")
print(model)

# Forward pass
x = torch.randn(32, 784)  # Batch of 32 images (flattened 28x28)
y = model(x)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {y.shape}")

# Named sequential (for accessing layers by name)
model_named = nn.Sequential()
model_named.add_module('fc1', nn.Linear(784, 256))
model_named.add_module('relu1', nn.ReLU())
model_named.add_module('fc2', nn.Linear(256, 128))
model_named.add_module('relu2', nn.ReLU())
model_named.add_module('output', nn.Linear(128, 10))

print("\nNamed Sequential Model:")
for name, layer in model_named.named_children():
    print(f"  {name}: {layer}")

Loss Functions

PyTorch provides loss functions as modules. The most common: nn.CrossEntropyLoss for classification (combines softmax + negative log likelihood) and nn.MSELoss for regression.

CrossEntropyLoss expects raw logits (before softmax) and integer labels. It's numerically stable and efficient.

PYTHON
import torch
import torch.nn as nn

# Classification loss
print("Classification: CrossEntropyLoss")
print("=" * 50)

criterion = nn.CrossEntropyLoss()

# Logits (raw network output, before softmax)
logits = torch.tensor([[2.0, 1.0, 0.1],
                       [0.5, 2.5, 0.3],
                       [0.1, 0.2, 3.0]])
labels = torch.tensor([0, 1, 2])  # True classes

loss = criterion(logits, labels)
print(f"Logits:\n{logits}")
print(f"Labels: {labels}")
print(f"Loss: {loss.item():.4f}")

# Probabilities (for reference)
probs = torch.softmax(logits, dim=1)
print(f"\nProbabilities (from logits):\n{probs}")

# Regression loss
print("\nRegression: MSELoss")
print("=" * 50)

mse_criterion = nn.MSELoss()
predictions = torch.tensor([2.5, 0.0, 2.1])
targets = torch.tensor([3.0, -0.5, 2.0])

mse_loss = mse_criterion(predictions, targets)
print(f"Predictions: {predictions}")
print(f"Targets: {targets}")
print(f"MSE Loss: {mse_loss.item():.4f}")

Optimizers

Optimizers update weights based on gradients. PyTorch provides all common optimizers: SGD, Adam, AdamW, RMSprop, and more. You pass the model's parameters to the optimizer, and it handles updates.

The typical pattern: zero gradients, forward pass, compute loss, backward pass, optimizer step.

PYTHON
import torch
import torch.nn as nn
import torch.optim as optim

# Simple model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 2)
)

# Different optimizers
sgd = optim.SGD(model.parameters(), lr=0.01)
adam = optim.Adam(model.parameters(), lr=0.001)
adamw = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

print("Common Optimizers:")
print("  SGD: Simple gradient descent with optional momentum")
print("  Adam: Adaptive learning rates, usually works well")
print("  AdamW: Adam with proper weight decay (recommended)")

# Training step pattern
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Dummy data
X = torch.randn(32, 10)
y = torch.randint(0, 2, (32,))

print("\nTraining step pattern:")
for step in range(3):
    # 1. Zero gradients
    optimizer.zero_grad()

    # 2. Forward pass
    outputs = model(X)

    # 3. Compute loss
    loss = criterion(outputs, y)

    # 4. Backward pass
    loss.backward()

    # 5. Update weights
    optimizer.step()

    print(f"  Step {step + 1}: loss = {loss.item():.4f}")

The Complete Training Loop

Putting it all together: data loading, forward pass, loss computation, backpropagation, and weight updates. This is the pattern you'll use for every PyTorch project.

PYTHON
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Create synthetic dataset
torch.manual_seed(42)
n_samples = 1000
n_features = 20
n_classes = 5

X = torch.randn(n_samples, n_features)
y = torch.randint(0, n_classes, (n_samples,))

# Create DataLoader
dataset = TensorDataset(X, y)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

print("Complete Training Loop")
print("=" * 50)

# Define model
model = nn.Sequential(
    nn.Linear(n_features, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, n_classes)
)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
n_epochs = 5
for epoch in range(n_epochs):
    total_loss = 0
    correct = 0
    total = 0

    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Track statistics
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += batch_y.size(0)
        correct += predicted.eq(batch_y).sum().item()

    avg_loss = total_loss / len(train_loader)
    accuracy = 100 * correct / total
    print(f"Epoch {epoch + 1}/{n_epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy:.1f}%")

GPU Training

PyTorch makes GPU training straightforward. Move your model and data to the GPU with .to(device), and everything else stays the same. GPU acceleration can provide 10-100x speedups for large networks.

PYTHON
import torch
import torch.nn as nn
import torch.optim as optim

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create model and move to GPU
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
).to(device)

print("\nModel on device:", next(model.parameters()).device)

# Training with GPU
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Create data and move to GPU
X = torch.randn(64, 784).to(device)
y = torch.randint(0, 10, (64,)).to(device)

print(f"Data on device: {X.device}")

# Training step (same as before, but on GPU)
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()

print(f"Loss: {loss.item():.4f}")

# Note: If no GPU available, code runs on CPU transparently

Saving and Loading Models

Save trained models for later use. PyTorch saves the state</em>dict (a dictionary of parameters) rather than the entire model, which is more flexible and portable.

PYTHON
import torch
import torch.nn as nn

# Create and "train" a model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 5)
)

# Pretend we trained it
print("Model state_dict:")
for name, param in model.state_dict().items():
    print(f"  {name}: {param.shape}")

# Save the model
torch.save(model.state_dict(), 'model_weights.pth')
print("\nModel saved to 'model_weights.pth'")

# Load the model
# First, create the same architecture
loaded_model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 5)
)

# Then load the weights
loaded_model.load_state_dict(torch.load('model_weights.pth'))
print("Model loaded successfully")

# Verify weights match
print("\nWeights match:",
      torch.equal(model[0].weight, loaded_model[0].weight))

# For inference, set to evaluation mode
loaded_model.eval()
print("\nModel in eval mode (disables dropout, etc.)")

Common Layers and Activations

PyTorch provides every layer and activation you'll need. Here are the most common ones.

PYTHON
import torch
import torch.nn as nn

# Create sample input
x = torch.randn(4, 10)  # Batch of 4, 10 features

print("Common PyTorch Layers")
print("=" * 50)

# Linear (Dense) layer
linear = nn.Linear(10, 5)
print(f"\nLinear: {x.shape} -> {linear(x).shape}")

# Convolutional layer
x_2d = torch.randn(4, 3, 32, 32)  # Batch of 4, 3 channels, 32x32
conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
print(f"Conv2d: {x_2d.shape} -> {conv(x_2d).shape}")

# Batch Normalization
bn = nn.BatchNorm1d(10)
print(f"BatchNorm1d: normalizes features across batch")

# Dropout
dropout = nn.Dropout(p=0.5)
print(f"Dropout: randomly zeros 50% of inputs during training")

# Activations
print("\nCommon Activations:")
print(f"  ReLU: {nn.ReLU()}")
print(f"  LeakyReLU: {nn.LeakyReLU(0.01)}")
print(f"  GELU: {nn.GELU()}")
print(f"  Sigmoid: {nn.Sigmoid()}")
print(f"  Softmax: {nn.Softmax(dim=1)}")

# Embedding (for NLP)
embedding = nn.Embedding(1000, 64)  # 1000 words, 64-dim vectors
word_ids = torch.tensor([1, 42, 999, 0])
embedded = embedding(word_ids)
print(f"\nEmbedding: word IDs {word_ids.shape} -> vectors {embedded.shape}")

A Real Example: MNIST Classifier

Let's build a complete MNIST digit classifier—a standard first deep learning project.

PYTHON
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# For this example, we'll simulate MNIST-like data
torch.manual_seed(42)

# Simulated MNIST: 28x28 images, 10 classes
n_train = 1000
n_test = 200
X_train = torch.randn(n_train, 1, 28, 28)
y_train = torch.randint(0, 10, (n_train,))
X_test = torch.randn(n_test, 1, 28, 28)
y_test = torch.randint(0, 10, (n_test,))

# Create data loaders
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
test_dataset = torch.utils.data.TensorDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Define the network
class MNISTClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.network = nn.Sequential(
            nn.Linear(28 * 28, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        return self.network(x)

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MNISTClassifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print("MNIST-style Classifier Training")
print("=" * 50)
print(f"Model: {sum(p.numel() for p in model.parameters())} parameters")
print(f"Device: {device}")

# Training
def train_epoch(model, loader, criterion, optimizer):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for X, y in loader:
        X, y = X.to(device), y.to(device)

        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += y.size(0)
        correct += predicted.eq(y).sum().item()

    return total_loss / len(loader), 100 * correct / total

# Evaluation
def evaluate(model, loader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for X, y in loader:
            X, y = X.to(device), y.to(device)
            outputs = model(X)
            _, predicted = outputs.max(1)
            total += y.size(0)
            correct += predicted.eq(y).sum().item()

    return 100 * correct / total

# Train for 5 epochs
for epoch in range(5):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer)
    test_acc = evaluate(model, test_loader)
    print(f"Epoch {epoch + 1}: Train Loss = {train_loss:.4f}, "
          f"Train Acc = {train_acc:.1f}%, Test Acc = {test_acc:.1f}%")

print("\nTraining complete!")
print("(With real MNIST data, you'd see ~98% accuracy)")

Key Takeaways

  • PyTorch tensors are like NumPy arrays with GPU support and automatic differentiation
  • Autograd computes gradients automatically—just call .backward() on the loss
  • nn.Module is the building block for all neural networks in PyTorch
  • nn.Sequential provides quick model building for simple architectures
  • Loss functions and optimizers are modules with standard interfaces
  • The training loop: zerograd → forward → loss → backward → step
  • GPU training: just move model and data to the device with .to(device)
  • Save/load models using state</em>dict for flexibility and portability
  • PyTorch handles the complexity so you can focus on model design and experimentation