Beginner Intermediate 75 min read

Chapter 2: Calculus and Optimization

Derivatives, gradients, and optimization algorithms that power neural network training.

Libraries covered: NumPy SciPy SymPy

Learning Objectives

["Compute gradients and partial derivatives", "Understand gradient descent and its variants", "Apply optimization to ML problems"]


2.1 Derivatives and Gradients Beginner

Derivatives and Gradients

Every time a machine learning model learns from data, it relies on derivatives to guide its adjustments. The derivative tells the model which direction to move and by how much. Without derivatives, we would have no systematic way to improve our predictions—training would be reduced to random guessing.

This section builds your understanding of derivatives from first principles and extends them to the multivariate case. By the end, you'll understand the mathematical machinery that powers gradient descent and backpropagation.

What is a Derivative?

At its core, a derivative measures how a function's output changes as its input changes. If you have a function $f(x)$ that maps inputs to outputs, the derivative $f'(x)$ tells you the instantaneous rate of change at any point.

Mathematically, the derivative is defined as:

$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

This formula captures a simple intuition: take two nearby points, measure how much the output changed, divide by how much the input changed, and take the limit as those points get infinitely close.

Geometric interpretation: The derivative at a point equals the slope of the tangent line to the function at that point. A positive derivative means the function is increasing; a negative derivative means it's decreasing; a zero derivative indicates a flat spot (potentially a minimum, maximum, or inflection point).

Why does this matter for ML? Consider a loss function $L(w)$ that measures how wrong our model's predictions are, where $w$ represents a model weight. The derivative $\frac{dL}{dw}$ tells us:

  • If positive: Increasing <!--MATHBLOCK16--> increases the loss, so we should decrease <!--MATHBLOCK17-->
  • If negative: Increasing <!--MATHBLOCK18--> decreases the loss, so we should increase <!--MATHBLOCK19-->
  • Magnitude: How sensitive the loss is to changes in <!--MATHBLOCK20-->

This is the fundamental insight behind gradient descent: move the parameters in the direction that reduces the loss.

Common Derivatives in Machine Learning

Several derivatives appear repeatedly in ML. Memorizing these will help you understand neural network computations:

| Function | Derivative | Where It Appears | |----------|------------|------------------| | $x^n$ | $nx^{n-1}$ | Polynomial features, weight regularization | | $e^x$ | $e^x$ | Softmax, exponential learning rate schedules | | $\ln(x)$ | $\frac{1}{x}$ | Log-likelihood, cross-entropy loss | | $\frac{1}{1+e^{-x}}$ (sigmoid) | $\sigma(x)(1-\sigma(x))$ | Binary classification output | | $\tanh(x)$ | $1 - \tanh^2(x)$ | RNN activations | | $\max(0, x)$ (ReLU) | $1$ if $x > 0$, else $0$ | Hidden layer activations |

Notice that sigmoid and tanh have derivatives expressed in terms of themselves—this makes computation efficient during backpropagation since we already computed the forward pass values.

Computing Derivatives Numerically

While analytical derivatives are exact and efficient, numerical derivatives are invaluable for verification and debugging. The central difference formula provides a good approximation:

$$f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}$$

This is more accurate than the forward difference $\frac{f(x+h) - f(x)}{h}$ because errors cancel out symmetrically.

PYTHON
import numpy as np

def numerical_derivative(f, x, h=1e-5):
    """Compute derivative using central difference."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Verify: derivative of x^2 should be 2x
f = lambda x: x**2
x = 3.0

numerical = numerical_derivative(f, x)
analytical = 2 * x  # We know d/dx[x^2] = 2x

print(f"At x = {x}:")
print(f"  Numerical derivative:  {numerical:.10f}")
print(f"  Analytical derivative: {analytical:.10f}")
print(f"  Absolute error: {abs(numerical - analytical):.2e}")

The error is typically around $10^{-10}$ for smooth functions—small enough for gradient checking but too slow for training large models.

Partial Derivatives: Functions of Multiple Variables

Real ML models have thousands or millions of parameters. A loss function doesn't just depend on one weight—it depends on all of them simultaneously: $L(w_1, w_2, \ldots, w_n)$.

A partial derivative measures how the output changes when we vary just one input while holding all others fixed:

$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$

The symbol $\partial$ (rather than $d$) signals that other variables exist and are being held constant.

Example: Consider $f(x, y) = x^2 y + 3xy^2$

To find $\frac{\partial f}{\partial x}$, treat $y$ as a constant:

$$\frac{\partial f}{\partial x} = 2xy + 3y^2$$

To find $\frac{\partial f}{\partial y}$, treat $x$ as a constant:

$$\frac{\partial f}{\partial y} = x^2 + 6xy$$

Each partial derivative tells us the sensitivity of $f$ to one specific variable.

PYTHON
import numpy as np

def f(x, y):
    """Example function: f(x,y) = x²y + 3xy²"""
    return x**2 * y + 3 * x * y**2

def df_dx_analytical(x, y):
    """Analytical: ∂f/∂x = 2xy + 3y²"""
    return 2*x*y + 3*y**2

def df_dy_analytical(x, y):
    """Analytical: ∂f/∂y = x² + 6xy"""
    return x**2 + 6*x*y

def partial_derivative(f, x, y, var='x', h=1e-5):
    """Numerical partial derivative."""
    if var == 'x':
        return (f(x + h, y) - f(x - h, y)) / (2 * h)
    else:
        return (f(x, y + h) - f(x, y - h)) / (2 * h)

# Test at (x, y) = (2, 3)
x, y = 2.0, 3.0

print(f"f(x, y) = x²y + 3xy² at ({x}, {y})")
print(f"f = {f(x, y)}")
print(f"\n∂f/∂x: numerical = {partial_derivative(f, x, y, 'x'):.4f}, "
      f"analytical = {df_dx_analytical(x, y):.4f}")
print(f"∂f/∂y: numerical = {partial_derivative(f, x, y, 'y'):.4f}, "
      f"analytical = {df_dy_analytical(x, y):.4f}")

The Gradient: A Vector of Partial Derivatives

When we collect all partial derivatives into a single vector, we get the gradient:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

The gradient is one of the most important concepts in optimization. It has two crucial properties:

  1. Direction: The gradient points in the direction of steepest increase of <!--MATHBLOCK46-->
  2. Magnitude: <!--MATHBLOCK47--> tells us how steep that increase is

Since we want to minimize the loss function, we move in the opposite direction of the gradient. This gives us the gradient descent update rule:

$$\mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} - \eta \nabla L(\mathbf{w}_{\text{old}})$$

where $\eta$ is the learning rate—a small positive number that controls step size.

Intuition: Imagine standing on a hilly landscape in fog. You can't see the lowest point, but you can feel which direction slopes downward at your feet. The gradient tells you the steepest uphill direction; walking the opposite way takes you downhill. Repeat until you reach a valley.

PYTHON
import numpy as np

def compute_gradient(f, params, h=1e-5):
    """
    Compute gradient numerically for any scalar function of a vector.

    Args:
        f: Function that takes a numpy array and returns a scalar
        params: Current parameter vector
        h: Step size for finite differences

    Returns:
        Gradient vector (same shape as params)
    """
    gradient = np.zeros_like(params, dtype=float)

    for i in range(len(params)):
        params_plus = params.copy()
        params_minus = params.copy()
        params_plus[i] += h
        params_minus[i] -= h
        gradient[i] = (f(params_plus) - f(params_minus)) / (2 * h)

    return gradient

# Example: quadratic loss L(w) = w₁² + 2w₂²
# Gradient: ∇L = [2w₁, 4w₂]
def quadratic_loss(w):
    return w[0]**2 + 2*w[1]**2

w = np.array([3.0, 2.0])
grad = compute_gradient(quadratic_loss, w)

print(f"Loss function: L(w) = w₁² + 2w₂²")
print(f"At w = {w}:")
print(f"  Loss = {quadratic_loss(w)}")
print(f"  Gradient (numerical):  {grad}")
print(f"  Gradient (analytical): {np.array([2*w[0], 4*w[1]])}")

Gradient Descent in Action

Let's trace through a few steps of gradient descent to see how the gradient guides us toward the minimum:

PYTHON
import numpy as np

def gradient_descent_demo():
    """Demonstrate gradient descent on a simple quadratic."""

    # Loss: L(w) = (w₁ - 3)² + 2(w₂ - 1)²
    # Minimum at w* = [3, 1] with L(w*) = 0
    def loss(w):
        return (w[0] - 3)**2 + 2*(w[1] - 1)**2

    def gradient(w):
        return np.array([2*(w[0] - 3), 4*(w[1] - 1)])

    # Start far from optimum
    w = np.array([0.0, 0.0])
    learning_rate = 0.1

    print("Gradient Descent Progress")
    print("-" * 50)
    print(f"{'Step':<6} {'w₁':>8} {'w₂':>8} {'Loss':>12} {'|∇L|':>10}")
    print("-" * 50)

    for step in range(10):
        grad = gradient(w)
        grad_norm = np.linalg.norm(grad)
        print(f"{step:<6} {w[0]:>8.4f} {w[1]:>8.4f} {loss(w):>12.6f} {grad_norm:>10.4f}")

        # Update: move opposite to gradient
        w = w - learning_rate * grad

    print("-" * 50)
    print(f"Final w = [{w[0]:.4f}, {w[1]:.4f}], approaching optimum [3, 1]")

gradient_descent_demo()

Notice how the loss decreases and the gradient magnitude shrinks as we approach the minimum. This is typical behavior: far from the optimum, gradients are large and we make big steps; near the optimum, gradients become small and steps shrink, allowing precise convergence.

The Jacobian: Gradients for Vector-Valued Functions

When a function outputs a vector instead of a scalar (like a neural network layer), we need the Jacobian matrix instead of a gradient vector. If $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ maps $n$ inputs to $m$ outputs, the Jacobian is an $m \times n$ matrix:

$$\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

Each row is the gradient of one output component. Entry $J_{ij}$ tells us how output $i$ changes when input $j$ changes.

In neural networks: A layer transforms inputs $\mathbf{x}$ into outputs $\mathbf{y}$. The Jacobian $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ describes how each output neuron responds to each input. For a linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$, the Jacobian is simply the weight matrix $\mathbf{W}$.

PYTHON
import numpy as np

def compute_jacobian(f, x, h=1e-5):
    """
    Compute Jacobian matrix numerically.

    Args:
        f: Function mapping R^n -> R^m
        x: Input vector of shape (n,)

    Returns:
        Jacobian matrix of shape (m, n)
    """
    f_x = f(x)
    m, n = len(f_x), len(x)
    jacobian = np.zeros((m, n))

    for j in range(n):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[j] += h
        x_minus[j] -= h
        jacobian[:, j] = (f(x_plus) - f(x_minus)) / (2 * h)

    return jacobian

# Example: linear layer f(x) = Wx
W = np.array([[1, 2],
              [3, 4],
              [5, 6]], dtype=float)

def linear_layer(x):
    return W @ x

x = np.array([1.0, 2.0])

print("Linear layer: f(x) = Wx")
print(f"W = \n{W}")
print(f"x = {x}")
print(f"\nJacobian (numerical):\n{compute_jacobian(linear_layer, x)}")
print(f"\nWeight matrix W:\n{W}")
print("\nFor linear layers, Jacobian = W (as expected)")

The Hessian: Second-Order Information

The Hessian matrix contains all second-order partial derivatives:

$$\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}$$

While the gradient tells us the slope at a point, the Hessian tells us the curvature—how the slope itself changes. This information is valuable for optimization:

  • Positive definite Hessian (all positive eigenvalues): Local minimum, bowl-shaped
  • Negative definite Hessian (all negative eigenvalues): Local maximum, hill-shaped
  • Indefinite Hessian (mixed eigenvalues): Saddle point

In deep learning, saddle points are far more common than local minima in high dimensions. The Hessian helps identify these, though computing it exactly is often prohibitively expensive.

Practical Application: Linear Regression Gradients

Let's derive the gradient for mean squared error in linear regression—one of the most fundamental calculations in ML.

Given data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ and predictions $\hat{y}_i = \mathbf{w}^T\mathbf{x}_i$, the MSE loss is:

$$L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

To find the gradient, we differentiate with respect to $\mathbf{w}$:

$$\nabla_{\mathbf{w}} L = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T\mathbf{x}_i) \mathbf{x}_i = -\frac{2}{n} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w})$$

This formula tells us exactly how to update weights to reduce prediction error.

PYTHON
import numpy as np

# Generate synthetic linear regression data
np.random.seed(42)
n_samples, n_features = 100, 3
X = np.random.randn(n_samples, n_features)
true_weights = np.array([2.0, -1.0, 0.5])
y = X @ true_weights + 0.1 * np.random.randn(n_samples)

def mse_loss(w, X, y):
    """Mean Squared Error loss."""
    predictions = X @ w
    return np.mean((y - predictions)**2)

def mse_gradient(w, X, y):
    """Analytical gradient of MSE."""
    n = len(y)
    predictions = X @ w
    return -2/n * X.T @ (y - predictions)

# Gradient descent to find optimal weights
w = np.zeros(n_features)  # Start at origin
learning_rate = 0.1

print("Learning linear regression weights via gradient descent")
print("-" * 55)

for epoch in range(5):
    loss = mse_loss(w, X, y)
    grad = mse_gradient(w, X, y)
    print(f"Epoch {epoch}: Loss = {loss:.6f}, |grad| = {np.linalg.norm(grad):.6f}")
    w = w - learning_rate * grad

print("-" * 55)
print(f"Learned weights: {w}")
print(f"True weights:    {true_weights}")

Key Takeaways

  • Derivatives measure instantaneous rates of change and tell us how to adjust parameters
  • Partial derivatives extend this to functions of many variables by varying one at a time
  • The gradient vector points toward steepest increase; we descend by moving opposite to it
  • Numerical differentiation via central differences is slower but essential for verification
  • The Jacobian generalizes gradients for vector-valued functions (like neural network layers)
  • The Hessian captures curvature information useful for advanced optimization

These concepts form the mathematical backbone of machine learning optimization. In the next section, we'll see how the chain rule allows us to efficiently compute gradients through compositions of functions—the key insight behind backpropagation.

2.2 The Chain Rule Beginner

The Chain Rule

The chain rule is arguably the most important calculus concept for deep learning. Without it, we couldn't train neural networks. This single rule enables backpropagation—the algorithm that revolutionized AI by making it possible to efficiently compute gradients through deep compositions of functions.

Why the Chain Rule Matters

Neural networks are compositions of functions. A simple two-layer network computes:

$$\hat{y} = f_2(f_1(\mathbf{x}))$$

where $f_1$ might be a linear transformation followed by ReLU, and $f_2$ might be another linear transformation followed by softmax. To train this network, we need to compute $\frac{\partial L}{\partial \mathbf{W}_1}$—how the loss changes with respect to the first layer's weights.

The challenge is that $\mathbf{W}_1$ doesn't directly affect the loss. It affects $f_1$'s output, which affects $f_2$'s output, which affects the loss. We need a way to propagate gradient information backward through this chain. That's exactly what the chain rule provides.

The Chain Rule for Single Variables

For composed functions $y = f(g(x))$, the chain rule states:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

where $u = g(x)$ is the intermediate variable.

Intuition: If $u$ changes by a small amount $\Delta u$ when $x$ changes by $\Delta x$, and $y$ changes by $\Delta y$ when $u$ changes by $\Delta u$, then the total change in $y$ per unit change in $x$ is the product of these rates.

Example: Let $y = (3x + 1)^2$

We can write this as $y = u^2$ where $u = 3x + 1$:

  • <!--MATHBLOCK29-->
  • <!--MATHBLOCK30-->
  • <!--MATHBLOCK31-->

PYTHON
import numpy as np

def chain_rule_example():
    """Demonstrate the chain rule on a simple composition."""

    # y = (3x + 1)^2
    def y(x):
        return (3*x + 1)**2

    # Derivative via chain rule: dy/dx = 6(3x + 1)
    def dy_dx_analytical(x):
        return 6 * (3*x + 1)

    # Numerical derivative for verification
    def dy_dx_numerical(x, h=1e-5):
        return (y(x + h) - y(x - h)) / (2 * h)

    x = 2.0

    print("Chain Rule Example: y = (3x + 1)²")
    print(f"At x = {x}:")
    print(f"  y = {y(x)}")
    print(f"  dy/dx (chain rule): {dy_dx_analytical(x)}")
    print(f"  dy/dx (numerical):  {dy_dx_numerical(x):.6f}")

chain_rule_example()

Extending to Longer Chains

The chain rule naturally extends to longer compositions. For $y = f(g(h(x)))$:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx}$$

where $v = h(x)$, $u = g(v)$, and $y = f(u)$.

This is crucial for deep networks. A 100-layer network is a composition of 100 functions, and we can still compute gradients by applying the chain rule 100 times.

Key insight: We multiply the local derivatives along the chain. Each layer only needs to know its own derivative—we don't need to understand the entire network to compute local gradients.

PYTHON
import numpy as np

def deep_chain_example():
    """Chain rule through multiple compositions."""

    # y = exp(sin(x^2))
    # v = x^2, u = sin(v), y = exp(u)
    def forward(x):
        v = x**2
        u = np.sin(v)
        y = np.exp(u)
        return v, u, y

    def chain_derivative(x):
        v, u, y = forward(x)

        # Local derivatives
        dy_du = np.exp(u)       # d/du[exp(u)] = exp(u)
        du_dv = np.cos(v)       # d/dv[sin(v)] = cos(v)
        dv_dx = 2 * x           # d/dx[x^2] = 2x

        # Chain rule: multiply along the path
        dy_dx = dy_du * du_dv * dv_dx
        return dy_dx

    def numerical_derivative(x, h=1e-5):
        _, _, y_plus = forward(x + h)
        _, _, y_minus = forward(x - h)
        return (y_plus - y_minus) / (2 * h)

    x = 1.0
    print("Deep Chain: y = exp(sin(x²))")
    print(f"At x = {x}:")
    print(f"  Chain rule: dy/dx = {chain_derivative(x):.6f}")
    print(f"  Numerical:  dy/dx = {numerical_derivative(x):.6f}")

deep_chain_example()

The Multivariate Chain Rule

In machine learning, functions typically have many inputs. The multivariate chain rule handles this:

If $y = f(u_1, u_2, \ldots, u_m)$ and each $u_i = g_i(x_1, x_2, \ldots, x_n)$, then:

$$\frac{\partial y}{\partial x_j} = \sum_{i=1}^{m} \frac{\partial y}{\partial u_i} \cdot \frac{\partial u_i}{\partial x_j}$$

Interpretation: To find how $y$ changes with $x_j$, we sum up contributions from all paths through which $x_j$ influences $y$. Each path contributes the product of derivatives along that path.

This summation is essential when a variable affects the output through multiple intermediate nodes, which happens constantly in neural networks (think of how a single input pixel affects multiple neurons in the next layer).

Computational Graphs

Understanding the chain rule becomes easier when we visualize computations as directed graphs:

  • Nodes represent variables (inputs, intermediates, outputs)
  • Edges represent operations

For $y = (a + b) \cdot (b + c)$:

    a ──┐
        ├──► u = a+b ──┐
    b ──┤              ├──► y = u·v
        ├──► v = b+c ──┘
    c ──┘

Notice that $b$ appears in both intermediate computations. The multivariate chain rule handles this:

$$\frac{\partial y}{\partial b} = \frac{\partial y}{\partial u}\frac{\partial u}{\partial b} + \frac{\partial y}{\partial v}\frac{\partial v}{\partial b}$$

PYTHON
import numpy as np

def computational_graph_example():
    """Gradient computation through a computational graph."""

    # y = (a + b) * (b + c)
    # Expanded: y = ab + b^2 + ac + bc
    # dy/da = b + c (wait, let's verify)
    # Actually: dy/da = b, dy/db = a + 2b + c, dy/dc = b

    def forward(a, b, c):
        u = a + b      # Intermediate 1
        v = b + c      # Intermediate 2
        y = u * v      # Output
        return u, v, y

    def backward(a, b, c):
        u, v, y = forward(a, b, c)

        # Derivatives of final output
        dy_du = v       # d/du[u*v] = v
        dy_dv = u       # d/dv[u*v] = u

        # Derivatives of intermediates
        du_da = 1       # d/da[a+b] = 1
        du_db = 1       # d/db[a+b] = 1
        dv_db = 1       # d/db[b+c] = 1
        dv_dc = 1       # d/dc[b+c] = 1

        # Apply chain rule (sum over all paths)
        dy_da = dy_du * du_da                     # Only one path: y <- u <- a
        dy_db = dy_du * du_db + dy_dv * dv_db    # Two paths: y <- u <- b, y <- v <- b
        dy_dc = dy_dv * dv_dc                     # Only one path: y <- v <- c

        return dy_da, dy_db, dy_dc

    a, b, c = 2.0, 3.0, 4.0
    u, v, y = forward(a, b, c)
    da, db, dc = backward(a, b, c)

    print("Computational Graph: y = (a + b) * (b + c)")
    print(f"Inputs: a={a}, b={b}, c={c}")
    print(f"Intermediates: u=a+b={u}, v=b+c={v}")
    print(f"Output: y={y}")
    print(f"\nGradients via chain rule:")
    print(f"  dy/da = {da} (via path y ← u ← a)")
    print(f"  dy/db = {db} (via paths y ← u ← b and y ← v ← b)")
    print(f"  dy/dc = {dc} (via path y ← v ← c)")

    # Verify with numerical gradients
    h = 1e-5
    da_num = (forward(a+h, b, c)[2] - forward(a-h, b, c)[2]) / (2*h)
    db_num = (forward(a, b+h, c)[2] - forward(a, b-h, c)[2]) / (2*h)
    dc_num = (forward(a, b, c+h)[2] - forward(a, b, c-h)[2]) / (2*h)

    print(f"\nNumerical verification:")
    print(f"  dy/da ≈ {da_num:.4f}")
    print(f"  dy/db ≈ {db_num:.4f}")
    print(f"  dy/dc ≈ {dc_num:.4f}")

computational_graph_example()

Forward vs Backward Mode

There are two ways to apply the chain rule through a computational graph:

Forward mode (forward accumulation):

  • Start from inputs, propagate derivatives forward
  • Compute <!--MATHBLOCK44--> for one input <!--MATHBLOCK45--> and all outputs
  • Efficient when there are few inputs and many outputs

Backward mode (reverse accumulation):

  • Start from outputs, propagate derivatives backward
  • Compute <!--MATHBLOCK46--> for one output <!--MATHBLOCK47--> and all inputs
  • Efficient when there are many inputs and few outputs

Neural networks have millions of parameters (inputs to the loss function) but only one output (the scalar loss). Backward mode is dramatically more efficient—this is why we use backpropagation, not forward propagation of gradients.

Backpropagation: Chain Rule in Action

Backpropagation is simply the chain rule applied in backward mode through a neural network. Let's trace through a simple network:

PYTHON
import numpy as np

def backprop_demo():
    """
    Complete backpropagation through a tiny network.

    Network: input x -> linear (w1, b1) -> ReLU -> linear (w2, b2) -> MSE loss
    """

    # Network parameters (tiny for demonstration)
    w1, b1 = 0.5, 0.1    # First layer
    w2, b2 = -0.3, 0.2   # Second layer

    # Data
    x = 2.0              # Input
    y_true = 1.0         # Target

    # ==================== FORWARD PASS ====================
    # Compute outputs and save intermediates for backward pass

    # Layer 1: linear
    z1 = w1 * x + b1
    # Layer 1: ReLU activation
    a1 = max(0, z1)
    # Layer 2: linear (output)
    z2 = w2 * a1 + b2
    y_pred = z2
    # Loss: MSE
    loss = (y_pred - y_true)**2

    print("=" * 60)
    print("FORWARD PASS")
    print("=" * 60)
    print(f"Input: x = {x}")
    print(f"Layer 1: z1 = w1*x + b1 = {w1}*{x} + {b1} = {z1}")
    print(f"ReLU:    a1 = max(0, z1) = {a1}")
    print(f"Layer 2: z2 = w2*a1 + b2 = {w2}*{a1} + {b2} = {z2}")
    print(f"Output:  y_pred = {y_pred}")
    print(f"Loss:    L = (y_pred - y_true)² = ({y_pred} - {y_true})² = {loss}")

    # ==================== BACKWARD PASS ====================
    # Apply chain rule in reverse order

    print("\n" + "=" * 60)
    print("BACKWARD PASS (Chain Rule)")
    print("=" * 60)

    # Start from loss
    # dL/d(y_pred) = 2(y_pred - y_true)
    dL_dy = 2 * (y_pred - y_true)
    print(f"\n∂L/∂y_pred = 2(y_pred - y_true) = {dL_dy}")

    # Layer 2: y_pred = z2 = w2*a1 + b2
    # dL/dz2 = dL/dy_pred * dy_pred/dz2 = dL/dy * 1
    dL_dz2 = dL_dy * 1
    # dL/dw2 = dL/dz2 * dz2/dw2 = dL/dz2 * a1
    dL_dw2 = dL_dz2 * a1
    # dL/db2 = dL/dz2 * dz2/db2 = dL/dz2 * 1
    dL_db2 = dL_dz2 * 1
    # dL/da1 = dL/dz2 * dz2/da1 = dL/dz2 * w2
    dL_da1 = dL_dz2 * w2

    print(f"\nLayer 2 gradients (via chain rule):")
    print(f"  ∂L/∂w2 = ∂L/∂z2 · ∂z2/∂w2 = {dL_dz2} · {a1} = {dL_dw2}")
    print(f"  ∂L/∂b2 = ∂L/∂z2 · ∂z2/∂b2 = {dL_dz2} · 1 = {dL_db2}")
    print(f"  ∂L/∂a1 = ∂L/∂z2 · ∂z2/∂a1 = {dL_dz2} · {w2} = {dL_da1}")

    # ReLU: a1 = max(0, z1)
    # da1/dz1 = 1 if z1 > 0 else 0
    relu_grad = 1.0 if z1 > 0 else 0.0
    dL_dz1 = dL_da1 * relu_grad

    print(f"\nReLU gradient:")
    print(f"  ∂a1/∂z1 = {relu_grad} (since z1 = {z1} {'>' if z1 > 0 else '≤'} 0)")
    print(f"  ∂L/∂z1 = ∂L/∂a1 · ∂a1/∂z1 = {dL_da1} · {relu_grad} = {dL_dz1}")

    # Layer 1: z1 = w1*x + b1
    dL_dw1 = dL_dz1 * x
    dL_db1 = dL_dz1 * 1

    print(f"\nLayer 1 gradients:")
    print(f"  ∂L/∂w1 = ∂L/∂z1 · ∂z1/∂w1 = {dL_dz1} · {x} = {dL_dw1}")
    print(f"  ∂L/∂b1 = ∂L/∂z1 · ∂z1/∂b1 = {dL_dz1} · 1 = {dL_db1}")

    print("\n" + "=" * 60)
    print("GRADIENT SUMMARY")
    print("=" * 60)
    print(f"∂L/∂w1 = {dL_dw1:.4f}")
    print(f"∂L/∂b1 = {dL_db1:.4f}")
    print(f"∂L/∂w2 = {dL_dw2:.4f}")
    print(f"∂L/∂b2 = {dL_db2:.4f}")

backprop_demo()

The Vanishing and Exploding Gradient Problems

The chain rule reveals a potential issue: gradients are products of many terms. In a deep network:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot \ldots \cdot \frac{\partial a_2}{\partial W_1}$$

If each factor is slightly less than 1 (say, 0.9), the product shrinks exponentially:

  • 10 layers: <!--MATHBLOCK48-->
  • 50 layers: <!--MATHBLOCK49-->
  • 100 layers: <!--MATHBLOCK50-->

This is the vanishing gradient problem—gradients become so small that early layers barely learn.

Conversely, if factors are slightly greater than 1, gradients explode. The exploding gradient problem causes unstable training with NaN losses.

Solutions developed over the years:

  • ReLU activation: Derivative is exactly 1 for positive inputs (no shrinkage)
  • Skip connections (ResNets): Provide gradient highways that bypass layers
  • Batch normalization: Keeps activations in a reasonable range
  • Careful initialization: Start with weights that preserve gradient magnitude
  • Gradient clipping: Cap gradient magnitude to prevent explosions

PYTHON
import numpy as np

def gradient_flow_demo():
    """Demonstrate vanishing gradients through a deep network."""

    def sigmoid(x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def sigmoid_derivative(x):
        s = sigmoid(x)
        return s * (1 - s)  # Maximum value is 0.25 at x=0

    def relu_derivative(x):
        return (x > 0).astype(float)  # Either 0 or 1

    # Simulate gradient flow through n layers
    def gradient_magnitude(n_layers, activation='sigmoid'):
        gradient = 1.0  # Start with gradient of 1

        for i in range(n_layers):
            if activation == 'sigmoid':
                # Sigmoid derivative max is 0.25
                local_grad = 0.25
            else:
                # ReLU derivative is 1 (assuming positive path)
                local_grad = 1.0

            gradient *= local_grad

        return gradient

    print("Gradient Magnitude vs Network Depth")
    print("-" * 50)
    print(f"{'Layers':<10} {'Sigmoid':<20} {'ReLU':<20}")
    print("-" * 50)

    for n in [1, 5, 10, 20, 50, 100]:
        sig_grad = gradient_magnitude(n, 'sigmoid')
        relu_grad = gradient_magnitude(n, 'relu')
        print(f"{n:<10} {sig_grad:<20.2e} {relu_grad:<20.2e}")

    print("-" * 50)
    print("\nSigmoid gradients vanish exponentially!")
    print("ReLU maintains gradient flow (ideal case).")

gradient_flow_demo()

Vector and Matrix Forms of Chain Rule

In practice, we work with vectors and matrices, not scalars. The chain rule extends naturally:

For $\mathbf{y} = f(\mathbf{u})$ and $\mathbf{u} = g(\mathbf{x})$:

$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}}$$

where these are Jacobian matrices, and the $\cdot$ is matrix multiplication.

For a scalar loss $L$ with respect to a weight matrix $\mathbf{W}$:

$$\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{W}}$$

This matrix calculus is what makes efficient batch processing possible in deep learning frameworks.

Key Takeaways

  • The chain rule allows computing derivatives of composed functions by multiplying local derivatives
  • Computational graphs visualize how variables depend on each other
  • Backpropagation is the chain rule applied in reverse mode—efficient for many-input, one-output scenarios
  • When a variable affects output through multiple paths, we sum the contributions from each path
  • Vanishing/exploding gradients occur because gradients are products of many factors
  • Modern architectures (ReLU, skip connections, normalization) are designed to maintain healthy gradient flow

The chain rule transforms the seemingly impossible task of computing gradients through millions of parameters into a systematic, efficient algorithm. Every time you call loss.backward() in PyTorch, you're witnessing the chain rule at work.

2.3 Optimization Fundamentals Beginner

Optimization Fundamentals

Training a machine learning model means finding parameters that minimize a loss function. This is an optimization problem, and understanding optimization fundamentals helps you diagnose training issues, choose appropriate hyperparameters, and understand why certain architectures work better than others.

The Optimization Landscape

When we train a model, we're searching for the lowest point in a high-dimensional landscape defined by the loss function. For a model with $n$ parameters, the loss function $L(\theta)$ maps points in $n$-dimensional space to scalar loss values.

Visualizing low dimensions: In 2D, the loss landscape is a surface over the parameter plane—think of a terrain with hills, valleys, and ridges. The goal is to find the lowest valley. In 3D and beyond, direct visualization fails, but the intuition carries over.

The landscape's shape profoundly affects optimization:

  • Smooth valleys are easy to descend into
  • Narrow ravines cause oscillation
  • Flat plateaus slow convergence
  • Many local minima can trap the optimizer

Modern neural networks create astronomically high-dimensional landscapes (millions of dimensions), yet gradient descent often finds good solutions. Understanding why requires examining the geometry of these landscapes.

Critical Points: Where Gradients Vanish

A critical point is any location where the gradient equals zero: $\nabla L(\theta) = 0$. These points come in three flavors:

Local minimum: The loss is lower than all nearby points. The Hessian (matrix of second derivatives) has all positive eigenvalues.

Local maximum: The loss is higher than all nearby points. The Hessian has all negative eigenvalues.

Saddle point: The loss decreases in some directions but increases in others. The Hessian has both positive and negative eigenvalues.

A remarkable finding in deep learning is that saddle points vastly outnumber local minima in high dimensions. If you have $n$ parameters, a random critical point is exponentially more likely to be a saddle point than a local minimum. This is actually good news—saddle points can be escaped, while true local minima trap the optimizer.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

def critical_points_demo():
    """Demonstrate different types of critical points."""

    # Create a function with a saddle point at origin
    def saddle(x, y):
        return x**2 - y**2  # Saddle at (0,0)

    def bowl(x, y):
        return x**2 + y**2  # Minimum at (0,0)

    def monkey_saddle(x, y):
        return x**3 - 3*x*y**2  # More complex saddle

    # Analyze critical points via Hessian
    def analyze_critical_point(name, hessian_eigenvalues):
        print(f"\n{name}:")
        print(f"  Hessian eigenvalues: {hessian_eigenvalues}")
        if all(e > 0 for e in hessian_eigenvalues):
            print(f"  Classification: Local MINIMUM (all positive)")
        elif all(e < 0 for e in hessian_eigenvalues):
            print(f"  Classification: Local MAXIMUM (all negative)")
        else:
            print(f"  Classification: SADDLE POINT (mixed signs)")

    print("Critical Point Analysis at Origin")
    print("=" * 50)

    # Bowl: f = x² + y², Hessian = [[2, 0], [0, 2]]
    analyze_critical_point("Bowl (x² + y²)", [2, 2])

    # Saddle: f = x² - y², Hessian = [[2, 0], [0, -2]]
    analyze_critical_point("Saddle (x² - y²)", [2, -2])

    # Hill: f = -x² - y², Hessian = [[-2, 0], [0, -2]]
    analyze_critical_point("Hill (-x² - y²)", [-2, -2])

critical_points_demo()

Convexity: The Ideal Case

A function is convex if any line segment between two points on the function lies above (or on) the function itself. Mathematically, for all $\theta_1, \theta_2$ and $t \in [0, 1]$:

$$f(t\theta_1 + (1-t)\theta_2) \leq tf(\theta_1) + (1-t)f(\theta_2)$$

Why convexity matters: A convex function has exactly one minimum (the global minimum), and any local minimum is also the global minimum. This means gradient descent is guaranteed to find the optimal solution.

Examples of convex losses:

  • Mean squared error (linear regression)
  • Logistic loss (logistic regression)
  • Hinge loss (SVM)

The catch: Deep neural network loss functions are not convex. They have complex landscapes with many critical points. Yet somehow, gradient descent still works remarkably well—an observation that continues to drive research in optimization theory.

PYTHON
import numpy as np

def convexity_demo():
    """Demonstrate convex vs non-convex functions."""

    def convex_quadratic(x):
        """Convex: f(x) = x² + 1"""
        return x**2 + 1

    def non_convex(x):
        """Non-convex: f(x) = x⁴ - 2x² + 1 (has two minima)"""
        return x**4 - 2*x**2 + 1

    def check_convexity(f, x1, x2, num_points=10):
        """Check if convexity inequality holds between two points."""
        violations = 0
        for t in np.linspace(0, 1, num_points):
            midpoint = t * x1 + (1-t) * x2
            f_midpoint = f(midpoint)
            linear_interp = t * f(x1) + (1-t) * f(x2)

            if f_midpoint > linear_interp + 1e-10:  # Small tolerance
                violations += 1

        return violations == 0

    print("Convexity Check")
    print("=" * 50)

    # Test between x=-2 and x=2
    x1, x2 = -2.0, 2.0

    convex_result = check_convexity(convex_quadratic, x1, x2)
    print(f"f(x) = x² + 1: {'Convex' if convex_result else 'Not convex'}")

    nonconvex_result = check_convexity(non_convex, x1, x2)
    print(f"f(x) = x⁴ - 2x² + 1: {'Convex' if nonconvex_result else 'Not convex'}")

    print("\nImplication for optimization:")
    print("  Convex: Any local minimum is global minimum")
    print("  Non-convex: Multiple local minima possible")

convexity_demo()

The Learning Rate: Step Size Matters

The learning rate $\eta$ controls how far we move in each gradient descent step:

$$\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla L(\theta_{\text{old}})$$

Choosing the learning rate is one of the most important—and difficult—decisions in training:

Too large: The optimizer overshoots minima, oscillates wildly, or diverges entirely. The loss may increase or become NaN.

Too small: Convergence is painfully slow. Training might get stuck in flat regions or shallow local minima.

Just right: The optimizer makes steady progress, decreasing loss consistently without oscillation.

The "right" learning rate isn't a single value—it depends on the model architecture, batch size, current parameter values, and even the phase of training (early vs. late). This is why learning rate schedules and adaptive methods are so important.

PYTHON
import numpy as np

def learning_rate_demo():
    """Demonstrate effect of learning rate on optimization."""

    def loss(x):
        """Simple quadratic: L(x) = x²"""
        return x**2

    def gradient(x):
        return 2 * x

    def train(lr, x_init=5.0, steps=20):
        """Run gradient descent and return trajectory."""
        x = x_init
        trajectory = [x]

        for _ in range(steps):
            x = x - lr * gradient(x)
            trajectory.append(x)

            # Check for divergence
            if abs(x) > 1000:
                break

        return trajectory

    learning_rates = [0.01, 0.1, 0.5, 0.9, 1.1]

    print("Effect of Learning Rate on Gradient Descent")
    print("Loss function: L(x) = x², starting at x = 5.0")
    print("=" * 60)

    for lr in learning_rates:
        traj = train(lr)
        final_x = traj[-1]
        final_loss = loss(final_x)

        if abs(final_x) > 100:
            status = "DIVERGED"
        elif abs(final_x) < 0.01:
            status = f"Converged in {len([x for x in traj if abs(x) > 0.01])} steps"
        else:
            status = f"Slow (x = {final_x:.4f})"

        print(f"  LR = {lr}: {status}")

    print("\nKey insight: LR must be < 1/L where L is the Lipschitz")
    print("constant of the gradient. For f(x) = x², L = 2, so LR < 1.")

learning_rate_demo()

Condition Number and Convergence Speed

The condition number of the Hessian matrix measures how "stretched" the loss landscape is:

$$\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$$

where $\lambda_{\max}$ and $\lambda_{\min}$ are the largest and smallest eigenvalues.

Low condition number ($\kappa \approx 1$): The landscape is roughly spherical. Gradient descent converges quickly because all directions have similar curvature.

High condition number ($\kappa \gg 1$): The landscape is elongated like a narrow valley. Gradient descent zigzags—the gradient points toward the nearest wall rather than toward the minimum.

This explains why feature scaling matters: unscaled features with different magnitudes create ill-conditioned problems, slowing convergence dramatically.

PYTHON
import numpy as np

def condition_number_demo():
    """Demonstrate how condition number affects optimization."""

    def loss_well_conditioned(w):
        """κ ≈ 1: L = w₁² + w₂²"""
        return w[0]**2 + w[1]**2

    def loss_ill_conditioned(w):
        """κ = 100: L = w₁² + 100w₂²"""
        return w[0]**2 + 100*w[1]**2

    def gradient_descent(loss_fn, grad_fn, w_init, lr, steps):
        w = w_init.copy()
        trajectory = [w.copy()]

        for _ in range(steps):
            w = w - lr * grad_fn(w)
            trajectory.append(w.copy())

        return np.array(trajectory)

    # Gradients
    grad_well = lambda w: np.array([2*w[0], 2*w[1]])
    grad_ill = lambda w: np.array([2*w[0], 200*w[1]])

    w_init = np.array([10.0, 10.0])
    steps = 50

    # For well-conditioned, we can use larger LR
    traj_well = gradient_descent(loss_well_conditioned, grad_well, w_init, lr=0.4, steps=steps)

    # For ill-conditioned, must use smaller LR (dictated by largest eigenvalue)
    traj_ill = gradient_descent(loss_ill_conditioned, grad_ill, w_init, lr=0.004, steps=steps)

    print("Condition Number Effect on Convergence")
    print("=" * 60)
    print(f"\nWell-conditioned (κ = 1):")
    print(f"  Hessian eigenvalues: [2, 2]")
    print(f"  Using LR = 0.4")
    print(f"  Final loss: {loss_well_conditioned(traj_well[-1]):.6f}")

    print(f"\nIll-conditioned (κ = 100):")
    print(f"  Hessian eigenvalues: [2, 200]")
    print(f"  Using LR = 0.004 (limited by largest eigenvalue)")
    print(f"  Final loss: {loss_ill_conditioned(traj_ill[-1]):.6f}")

    print(f"\nNote: The ill-conditioned problem converges much slower")
    print(f"because the learning rate must be small enough for the")
    print(f"steep direction, making progress slow in the flat direction.")

condition_number_demo()

Local vs Global Minima

In non-convex optimization, we distinguish between:

Global minimum: The absolute lowest point of the entire loss function.

Local minimum: A point lower than all nearby points, but possibly not the absolute lowest.

Classical optimization theory worried greatly about local minima trapping optimizers. However, modern deep learning research has revealed surprising findings:

  1. Most local minima are nearly as good as the global minimum in large networks—they achieve similar test performance.
  2. Saddle points, not local minima, are the main obstacle. Gradient descent slows dramatically near saddle points where the gradient is small.
  3. Overparameterization helps: Networks with more parameters than necessary tend to have loss landscapes where almost all local minima are global minima.

This partly explains why larger models often train more easily—their optimization landscapes are more benign.

Momentum: Escaping Saddle Points and Ravines

Pure gradient descent has two problems:

  1. It gets stuck at saddle points where gradients vanish
  2. It oscillates in narrow ravines

Momentum addresses both by accumulating a velocity that smooths out the gradient:

$$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \eta \mathbf{v}_t$$

The velocity accumulates gradients from previous steps (controlled by $\beta$, typically 0.9). This has two effects:

  1. Accelerates through consistent gradients: If gradients point the same direction repeatedly, velocity builds up, speeding convergence.
  2. Dampens oscillations: In ravines, the oscillating component averages out while the consistent component (toward the minimum) accumulates.
  3. Escapes saddle points: Even when the current gradient is near zero, accumulated momentum can carry the optimizer past saddle points.

PYTHON
import numpy as np

def momentum_demo():
    """Compare gradient descent with and without momentum."""

    # Ill-conditioned quadratic (ravine shape)
    def loss(w):
        return 0.5 * w[0]**2 + 10 * w[1]**2

    def gradient(w):
        return np.array([w[0], 20*w[1]])

    def gd_vanilla(w_init, lr, steps):
        w = w_init.copy()
        losses = [loss(w)]

        for _ in range(steps):
            w = w - lr * gradient(w)
            losses.append(loss(w))

        return losses

    def gd_momentum(w_init, lr, beta, steps):
        w = w_init.copy()
        v = np.zeros_like(w)
        losses = [loss(w)]

        for _ in range(steps):
            v = beta * v + gradient(w)
            w = w - lr * v
            losses.append(loss(w))

        return losses

    w_init = np.array([10.0, 1.0])
    steps = 100

    losses_vanilla = gd_vanilla(w_init, lr=0.04, steps=steps)
    losses_momentum = gd_momentum(w_init, lr=0.04, beta=0.9, steps=steps)

    print("Momentum vs Vanilla Gradient Descent")
    print("Loss function: L = 0.5w₁² + 10w₂² (ravine-shaped)")
    print("=" * 50)
    print(f"\n{'Step':<10} {'Vanilla':>15} {'Momentum':>15}")
    print("-" * 50)

    for step in [0, 10, 25, 50, 100]:
        print(f"{step:<10} {losses_vanilla[step]:>15.6f} {losses_momentum[step]:>15.6f}")

    print("\nMomentum converges faster by damping oscillations")
    print("and accelerating in consistent directions.")

momentum_demo()

Stochastic vs Batch Optimization

Batch gradient descent computes the gradient using all training examples:

$$\nabla L = \frac{1}{n}\sum_{i=1}^n \nabla L_i$$

Stochastic gradient descent (SGD) uses a single random example:

$$\nabla L \approx \nabla L_i \quad \text{(random } i \text{)}$$

Mini-batch gradient descent uses a small subset (typically 32-256 examples):

$$\nabla L \approx \frac{1}{|B|}\sum_{i \in B} \nabla L_i$$

The stochastic approach has several advantages:

  1. Computational efficiency: One gradient step costs <!--MATHBLOCK21--> instead of <!--MATHBLOCK22-->, enabling more frequent updates.
  2. Noise as regularization: The noisy gradients act as implicit regularization, helping generalization.
  3. Escape local minima: Random fluctuations can bump the optimizer out of shallow local minima.
  4. Better hardware utilization: Mini-batches parallelize well on GPUs.

The noise level is controlled by batch size: larger batches give more accurate gradients but less regularization effect.

Key Takeaways

  • The loss landscape geometry determines optimization difficulty
  • Saddle points are more problematic than local minima in deep learning
  • Convex functions guarantee global convergence; neural networks are non-convex
  • Learning rate is critical: too high causes divergence, too low causes slow convergence
  • Condition number measures landscape "stretching"—ill-conditioned problems converge slowly
  • Momentum accelerates convergence and helps escape saddle points
  • Stochasticity provides computational efficiency and implicit regularization

Understanding these fundamentals helps you diagnose training problems (is the learning rate too high? is the problem ill-conditioned?) and appreciate why techniques like batch normalization, skip connections, and adaptive learning rates were developed.

2.4 Gradient Descent Variants Intermediate

Gradient Descent Variants

Vanilla gradient descent, while theoretically sound, struggles with the complex loss landscapes of deep learning. Over the years, researchers have developed increasingly sophisticated variants that adapt the learning rate, incorporate momentum, and handle ill-conditioned problems. This section surveys the major algorithms you'll encounter in practice.

Vanilla Gradient Descent

The basic update rule moves parameters opposite to the gradient:

$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

This simple rule works well for convex problems but has significant limitations:

  1. Single learning rate: All parameters share the same step size, regardless of their gradient magnitudes or curvature.
  2. No memory: Each step depends only on the current gradient—past information is discarded.
  3. Sensitive to scaling: Poorly scaled features create ill-conditioned landscapes that slow convergence.
  4. Stuck at saddle points: When gradients become small, progress halts even if we're not at a minimum.

These limitations motivated the development of more sophisticated optimizers.

SGD with Momentum

Momentum adds a "velocity" term that accumulates past gradients:

$$v_t = \beta v_{t-1} + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \eta v_t$$

The hyperparameter $\beta$ (typically 0.9) controls how much history to retain. Momentum provides two key benefits:

Acceleration: When gradients consistently point in the same direction, velocity builds up, effectively increasing the learning rate in that direction.

Noise reduction: In noisy gradient directions (from mini-batch variance or ravine oscillations), the noise averages out while consistent signals accumulate.

Think of momentum like a ball rolling downhill—it picks up speed on consistent slopes and smooths over small bumps.

PYTHON
import numpy as np

class SGDMomentum:
    """SGD with momentum optimizer."""

    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.lr = learning_rate
        self.beta = momentum
        self.velocity = None

    def step(self, params, gradients):
        """Perform one optimization step."""
        if self.velocity is None:
            self.velocity = np.zeros_like(params)

        # Update velocity
        self.velocity = self.beta * self.velocity + gradients

        # Update parameters
        params = params - self.lr * self.velocity
        return params

# Demo
def momentum_example():
    # Simple quadratic loss: L = x²
    x = 5.0
    optimizer = SGDMomentum(learning_rate=0.1, momentum=0.9)

    print("SGD with Momentum")
    print("-" * 40)

    for step in range(10):
        gradient = 2 * x  # d/dx[x²] = 2x
        x = optimizer.step(np.array([x]), np.array([gradient]))[0]
        print(f"Step {step+1}: x = {x:.6f}, velocity = {optimizer.velocity[0]:.6f}")

momentum_example()

Nesterov Accelerated Gradient (NAG)

Nesterov momentum is a subtle but important improvement. Instead of computing the gradient at the current position, we compute it at the "lookahead" position where momentum would take us:

$$v_t = \beta v_{t-1} + \nabla L(\theta_t - \beta v_{t-1})$$
$$\theta_{t+1} = \theta_t - \eta v_t$$

Intuition: Standard momentum is like a ball rolling blindly, only adjusting after it's already moved. Nesterov momentum looks ahead to where the ball is going, then corrects. This "lookahead" provides better responsiveness near minima—the optimizer can slow down before overshooting.

In practice, Nesterov momentum often converges slightly faster and with less oscillation than standard momentum.

AdaGrad: Adaptive Learning Rates

AdaGrad adapts the learning rate for each parameter based on historical gradients:

$$G_t = G_{t-1} + g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$

where $g_t = \nabla L(\theta_t)$, and $\epsilon \approx 10^{-8}$ prevents division by zero.

Key insight: Parameters with large accumulated gradients get smaller learning rates; parameters with small accumulated gradients get larger learning rates. This automatically handles sparse features—infrequent features get larger updates when they do appear.

The problem: The accumulated gradient $G_t$ only grows, never shrinks. Eventually, learning rates become so small that training effectively stops. This makes AdaGrad unsuitable for deep learning with many iterations.

PYTHON
import numpy as np

class AdaGrad:
    """AdaGrad optimizer."""

    def __init__(self, learning_rate=0.01, epsilon=1e-8):
        self.lr = learning_rate
        self.epsilon = epsilon
        self.G = None

    def step(self, params, gradients):
        if self.G is None:
            self.G = np.zeros_like(params)

        # Accumulate squared gradients
        self.G += gradients ** 2

        # Adaptive learning rate update
        params = params - self.lr * gradients / (np.sqrt(self.G) + self.epsilon)
        return params

def adagrad_example():
    # Parameters with different gradient scales
    params = np.array([10.0, 10.0])

    # Simulate gradients with different magnitudes
    def get_gradients(params):
        # First param has 10x larger gradient
        return np.array([10 * params[0], params[1]])

    optimizer = AdaGrad(learning_rate=1.0)

    print("AdaGrad: Adapts LR per parameter")
    print("-" * 50)

    for step in range(10):
        grads = get_gradients(params)
        params = optimizer.step(params, grads)
        effective_lr = optimizer.lr / (np.sqrt(optimizer.G) + optimizer.epsilon)
        print(f"Step {step+1}: params = [{params[0]:.3f}, {params[1]:.3f}], "
              f"effective LR = [{effective_lr[0]:.4f}, {effective_lr[1]:.4f}]")

adagrad_example()

RMSprop: Fixing AdaGrad's Decay

RMSprop (Root Mean Square Propagation) addresses AdaGrad's aggressive learning rate decay by using an exponentially decaying average instead of a sum:

$$G_t = \gamma G_{t-1} + (1 - \gamma) g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$

The decay rate $\gamma$ (typically 0.9) controls how quickly old gradients are forgotten. This keeps learning rates from vanishing while still providing per-parameter adaptation.

Intuition: RMSprop maintains a moving average of squared gradients, giving recent gradients more weight. This allows the optimizer to adapt to the current landscape rather than being dominated by ancient history.

RMSprop was introduced by Geoffrey Hinton in a Coursera lecture (not even a paper!) and quickly became widely adopted for its simplicity and effectiveness.

PYTHON
import numpy as np

class RMSprop:
    """RMSprop optimizer."""

    def __init__(self, learning_rate=0.01, decay=0.9, epsilon=1e-8):
        self.lr = learning_rate
        self.gamma = decay
        self.epsilon = epsilon
        self.G = None

    def step(self, params, gradients):
        if self.G is None:
            self.G = np.zeros_like(params)

        # Exponentially decaying average of squared gradients
        self.G = self.gamma * self.G + (1 - self.gamma) * gradients ** 2

        # Update with adaptive learning rate
        params = params - self.lr * gradients / (np.sqrt(self.G) + self.epsilon)
        return params

def rmsprop_vs_adagrad():
    """Compare RMSprop and AdaGrad over many steps."""

    params_adagrad = np.array([10.0])
    params_rmsprop = np.array([10.0])

    adagrad = AdaGrad(learning_rate=1.0)
    rmsprop = RMSprop(learning_rate=0.1)

    print("Comparison: AdaGrad vs RMSprop (100 steps)")
    print("-" * 50)

    for step in range(100):
        grad = 2 * params_adagrad  # Gradient of x²
        params_adagrad = adagrad.step(params_adagrad, grad)

        grad = 2 * params_rmsprop
        params_rmsprop = rmsprop.step(params_rmsprop, grad)

        if step in [0, 9, 24, 49, 99]:
            print(f"Step {step+1:3d}: AdaGrad = {params_adagrad[0]:.6f}, "
                  f"RMSprop = {params_rmsprop[0]:.6f}")

    print("\nAdaGrad's learning rate decays too aggressively")
    print("RMSprop maintains reasonable learning rates")

rmsprop_vs_adagrad()

Adam: Combining Momentum and Adaptation

Adam (Adaptive Moment Estimation) combines the best of momentum and adaptive learning rates:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

These are exponentially decaying averages of the gradient (first moment, like momentum) and squared gradient (second moment, like RMSprop).

Bias correction accounts for initialization at zero:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

The parameter update:

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\eta = 0.001$

Adam has become the most popular optimizer in deep learning because it:

  • Works well out-of-the-box with default parameters
  • Handles sparse gradients well
  • Adapts learning rates per-parameter
  • Incorporates momentum for acceleration

PYTHON
import numpy as np

class Adam:
    """Adam optimizer."""

    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment
        self.v = None  # Second moment
        self.t = 0     # Time step

    def step(self, params, gradients):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)

        self.t += 1

        # Update biased first and second moment estimates
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradients ** 2

        # Bias correction
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        # Update parameters
        params = params - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
        return params

def adam_example():
    """Demonstrate Adam on a simple problem."""

    # Loss: L = x² + 100y² (ill-conditioned)
    params = np.array([10.0, 10.0])

    def gradient(p):
        return np.array([2 * p[0], 200 * p[1]])

    optimizer = Adam(learning_rate=0.5)

    print("Adam Optimizer")
    print("Loss: L = x² + 100y² (ill-conditioned)")
    print("-" * 50)

    for step in range(20):
        grad = gradient(params)
        params = optimizer.step(params, grad)
        loss = params[0]**2 + 100 * params[1]**2

        if step % 4 == 0 or step == 19:
            print(f"Step {step+1:2d}: params = [{params[0]:7.4f}, {params[1]:7.4f}], loss = {loss:.4f}")

adam_example()

AdamW: Fixing Weight Decay

A subtle issue with Adam is that L2 regularization doesn't work as intended. The adaptive scaling interferes with the regularization gradient, causing it to be rescaled inappropriately.

AdamW fixes this by decoupling weight decay from the gradient:

$$\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda\theta_t\right)$$

Rather than adding $\lambda\theta$ to the gradient before adaptive scaling, we add the decay term directly to the update. This seemingly small change significantly improves generalization, especially for training large language models.

Rule of thumb: Use AdamW instead of Adam when using weight decay regularization.

Learning Rate Schedules

Even adaptive optimizers benefit from learning rate schedules—predefined rules for changing the learning rate during training:

Step decay: Reduce LR by a factor at fixed epochs

$$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$$

Exponential decay: Continuous exponential reduction

$$\eta_t = \eta_0 \cdot e^{-\lambda t}$$

Cosine annealing: Smooth decrease following a cosine curve

$$\eta_t = \eta_{\min} + \frac{\eta_0 - \eta_{\min}}{2}\left(1 + \cos\frac{\pi t}{T}\right)$$

Warmup: Start with small LR and gradually increase, preventing instability in early training when gradients are large and unreliable.

One-cycle: Warmup to a maximum LR, then decay. Often achieves faster convergence than monotonic schedules.

PYTHON
import numpy as np

def learning_rate_schedules():
    """Visualize different learning rate schedules."""

    epochs = 100
    initial_lr = 0.1

    def step_decay(epoch, drop=0.5, epochs_drop=20):
        return initial_lr * (drop ** (epoch // epochs_drop))

    def exponential_decay(epoch, decay_rate=0.02):
        return initial_lr * np.exp(-decay_rate * epoch)

    def cosine_annealing(epoch, total_epochs=100, min_lr=0.001):
        return min_lr + (initial_lr - min_lr) * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs))

    def warmup_cosine(epoch, warmup_epochs=10, total_epochs=100):
        if epoch < warmup_epochs:
            return initial_lr * epoch / warmup_epochs
        else:
            return cosine_annealing(epoch - warmup_epochs, total_epochs - warmup_epochs)

    print("Learning Rate Schedules")
    print("-" * 60)
    print(f"{'Epoch':<8} {'Step':>10} {'Exp':>10} {'Cosine':>10} {'Warmup':>10}")
    print("-" * 60)

    for epoch in [0, 10, 25, 50, 75, 99]:
        print(f"{epoch:<8} {step_decay(epoch):>10.4f} {exponential_decay(epoch):>10.4f} "
              f"{cosine_annealing(epoch):>10.4f} {warmup_cosine(epoch):>10.4f}")

learning_rate_schedules()

Comparing Optimizers

Different optimizers suit different situations:

| Optimizer | Best For | Key Properties | |-----------|----------|----------------| | SGD + Momentum | When you need simplicity and control | Requires tuning, but can achieve best results | | Adam | Default choice, quick experiments | Works well out-of-the-box | | AdamW | Training with weight decay | Better generalization than Adam | | SGD + warmup | Large batch training | More stable at scale |

General advice:

  • Start with Adam/AdamW for quick prototyping
  • Switch to SGD + momentum for final training if you can tune it well
  • Use learning rate warmup for transformers and large models
  • Always use a learning rate schedule for long training runs

PYTHON
import numpy as np

def optimizer_comparison():
    """Compare optimizers on a challenging function."""

    # Rosenbrock function: f(x,y) = (1-x)² + 100(y-x²)²
    # Minimum at (1, 1), famous for testing optimizers
    def rosenbrock(p):
        return (1 - p[0])**2 + 100 * (p[1] - p[0]**2)**2

    def rosenbrock_grad(p):
        dx = -2 * (1 - p[0]) - 400 * p[0] * (p[1] - p[0]**2)
        dy = 200 * (p[1] - p[0]**2)
        return np.array([dx, dy])

    def train(optimizer, name, steps=1000):
        params = np.array([-1.0, 1.0])  # Start point
        best_loss = rosenbrock(params)

        for _ in range(steps):
            grad = rosenbrock_grad(params)
            params = optimizer.step(params, grad)
            loss = rosenbrock(params)
            best_loss = min(best_loss, loss)

        return best_loss, params

    print("Optimizer Comparison on Rosenbrock Function")
    print("Optimum at (1, 1) with f = 0")
    print("=" * 60)

    # Test each optimizer
    results = []

    sgd = SGDMomentum(learning_rate=0.0001, momentum=0.9)
    loss, final = train(sgd, "SGD+Momentum")
    results.append(("SGD+Momentum", loss, final))

    rmsprop = RMSprop(learning_rate=0.01)
    loss, final = train(rmsprop, "RMSprop")
    results.append(("RMSprop", loss, final))

    adam = Adam(learning_rate=0.01)
    loss, final = train(adam, "Adam")
    results.append(("Adam", loss, final))

    for name, loss, final in results:
        print(f"{name:15s}: final = ({final[0]:.4f}, {final[1]:.4f}), loss = {loss:.6f}")

optimizer_comparison()

Key Takeaways

  • Vanilla SGD is simple but requires careful tuning and handles ill-conditioned problems poorly
  • Momentum accelerates convergence by accumulating gradient history
  • AdaGrad adapts learning rates per-parameter but decays too aggressively
  • RMSprop fixes AdaGrad with exponentially decaying gradient averages
  • Adam combines momentum and adaptive rates—the current default choice
  • AdamW properly handles weight decay and is preferred for regularized models
  • Learning rate schedules (warmup, cosine decay) improve final performance
  • Different optimizers suit different situations—experiment to find what works

Understanding these variants helps you diagnose training issues (is Adam failing due to weight decay interaction?) and make informed choices about which optimizer to use.

2.5 Automatic Differentiation Intermediate

Automatic Differentiation

Computing gradients by hand becomes intractable for modern neural networks with millions of parameters and complex architectures. Automatic differentiation (autodiff) solves this by computing exact derivatives automatically—not approximately like numerical differentiation, and not requiring manual derivation like symbolic differentiation.

Every deep learning framework (PyTorch, TensorFlow, JAX) is built around autodiff. Understanding how it works demystifies the "magic" of .backward() and helps you write more efficient code.

Three Approaches to Computing Derivatives

Numerical differentiation approximates derivatives using finite differences:

$$f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}$$

This is simple but slow (requires multiple function evaluations per parameter) and introduces approximation error. For a model with millions of parameters, numerical gradients would be prohibitively expensive.

Symbolic differentiation applies calculus rules to derive exact derivative expressions. Given $f(x) = x^2 \sin(x)$, it produces $f'(x) = 2x\sin(x) + x^2\cos(x)$. While exact, symbolic derivatives can "explode" in size for complex expressions, and repeatedly simplifying expressions is computationally expensive.

Automatic differentiation is the best of both worlds: it computes exact derivatives (up to floating-point precision) with computational cost proportional to the original function evaluation. This is what makes training large neural networks feasible.

The Key Insight: Decompose and Apply the Chain Rule

Every complex computation can be broken into a sequence of elementary operations (addition, multiplication, sin, exp, etc.), each with known derivatives. Autodiff:

  1. Records the sequence of operations during forward computation
  2. Applies the chain rule through this sequence to compute derivatives

For $f(x) = \sin(x^2)$:

  • Forward: compute <!--MATHBLOCK4-->, then <!--MATHBLOCK5-->
  • Backward: <!--MATHBLOCK6-->, <!--MATHBLOCK7-->
  • Chain rule: <!--MATHBLOCK8-->

The magic is that this process is completely automatic—the framework handles the bookkeeping.

Forward Mode vs Reverse Mode

Autodiff has two modes that differ in how they traverse the computational graph:

Forward mode propagates derivatives alongside values during forward computation. For each intermediate variable, we track both its value and its derivative with respect to the input.

Reverse mode first completes the forward pass, then propagates derivatives backward from output to inputs. This is what "backpropagation" means.

The efficiency difference is crucial:

  • Forward mode: cost scales with number of inputs (one forward pass per input)
  • Reverse mode: cost scales with number of outputs (one backward pass per output)

Neural networks have many inputs (millions of parameters) but typically one output (scalar loss). Therefore, reverse mode is exponentially more efficient—one backward pass gives gradients for all parameters.

PYTHON
import numpy as np

def forward_vs_reverse_mode():
    """
    Demonstrate forward mode vs reverse mode autodiff.

    Compute df/dx for f(x, y, z) = (x + y) * z
    where x=2, y=3, z=4, so f = (2+3)*4 = 20
    """

    print("Computing derivatives of f(x,y,z) = (x + y) * z")
    print("at x=2, y=3, z=4")
    print("=" * 60)

    # Values
    x, y, z = 2.0, 3.0, 4.0

    # Forward pass (same for both modes)
    u = x + y      # u = 5
    f = u * z      # f = 20

    print(f"\nForward pass:")
    print(f"  u = x + y = {u}")
    print(f"  f = u * z = {f}")

    # FORWARD MODE: compute df/dx
    # Alongside each value, track derivative with respect to x
    print(f"\nForward Mode (for df/dx):")
    print(f"  Seed: dx/dx = 1, dy/dx = 0, dz/dx = 0")

    du_dx = 1 + 0   # du/dx = dx/dx + dy/dx = 1
    df_dx = du_dx * z + u * 0  # df/dx = du/dx * z + u * dz/dx = 4
    print(f"  du/dx = {du_dx}")
    print(f"  df/dx = du/dx * z + u * dz/dx = {du_dx} * {z} + {u} * 0 = {df_dx}")

    # REVERSE MODE: compute df/dx, df/dy, df/dz all at once
    print(f"\nReverse Mode (all gradients at once):")
    print(f"  Seed: df/df = 1")

    # Backpropagate from f
    df_df = 1
    df_du = df_df * z   # f = u * z, so df/du = z
    df_dz = df_df * u   # f = u * z, so df/dz = u
    print(f"  df/du = {df_du} (from f = u*z)")
    print(f"  df/dz = {df_dz} (from f = u*z)")

    # Continue backprop through u = x + y
    df_dx = df_du * 1   # u = x + y, so du/dx = 1
    df_dy = df_du * 1   # u = x + y, so du/dy = 1
    print(f"  df/dx = df/du * du/dx = {df_du} * 1 = {df_dx}")
    print(f"  df/dy = df/du * du/dy = {df_du} * 1 = {df_dy}")

    print(f"\nSummary:")
    print(f"  Forward mode: one pass per input → 3 passes for 3 inputs")
    print(f"  Reverse mode: one pass for ALL inputs → 1 pass total")

forward_vs_reverse_mode()

Building a Simple Autodiff Engine

Let's build a minimal autodiff engine to understand the mechanics. The key components are:

  1. Value wrapper that tracks computation history
  2. Gradient storage for each value
  3. Backward functions that propagate gradients

PYTHON
import numpy as np

class Value:
    """A value that tracks its computation graph for autodiff."""

    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0  # Gradient of final output with respect to this value
        self._backward = lambda: None  # Function to propagate gradients
        self._prev = set(_children)  # Parent nodes in computation graph
        self._op = _op  # Operation that produced this value

    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')

        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            self.grad += out.grad * 1.0
            other.grad += out.grad * 1.0
        out._backward = _backward

        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')

        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad += out.grad * other.data
            other.grad += out.grad * self.data
        out._backward = _backward

        return out

    def __pow__(self, n):
        out = Value(self.data ** n, (self,), f'**{n}')

        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += out.grad * (n * self.data ** (n - 1))
        out._backward = _backward

        return out

    def relu(self):
        out = Value(max(0, self.data), (self,), 'ReLU')

        def _backward():
            self.grad += out.grad * (self.data > 0)
        out._backward = _backward

        return out

    def backward(self):
        """Compute gradients via reverse-mode autodiff."""
        # Build topological order
        topo = []
        visited = set()

        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)

        build_topo(self)

        # Seed gradient and backpropagate
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

# Demo: compute gradients of f = (a + b) * c
print("Mini Autodiff Engine Demo")
print("=" * 50)

a = Value(2.0)
b = Value(3.0)
c = Value(4.0)

# Forward pass
d = a + b    # d = 5
f = d * c    # f = 20

print(f"Forward pass:")
print(f"  a = {a.data}, b = {b.data}, c = {c.data}")
print(f"  d = a + b = {d.data}")
print(f"  f = d * c = {f.data}")

# Backward pass
f.backward()

print(f"\nBackward pass (gradients):")
print(f"  df/da = {a.grad}")
print(f"  df/db = {b.grad}")
print(f"  df/dc = {c.grad}")

# Verify: df/da = c = 4, df/db = c = 4, df/dc = d = 5
print(f"\nVerification:")
print(f"  df/da = d(d*c)/da = c * da/da = c = 4 ✓")
print(f"  df/db = d(d*c)/db = c * db/db = c = 4 ✓")
print(f"  df/dc = d(d*c)/dc = d = 5 ✓")

Autodiff in PyTorch

PyTorch makes autodiff seamless with its autograd system. Tensors with requires<em>grad=True track operations, and calling .backward() on the final output computes all gradients.

PYTHON
import torch

def pytorch_autodiff_demo():
    """Demonstrate PyTorch's automatic differentiation."""

    # Create tensors that track gradients
    x = torch.tensor(2.0, requires_grad=True)
    y = torch.tensor(3.0, requires_grad=True)

    # Computation (PyTorch builds graph automatically)
    z = x**2 + 3*x*y + y**2

    print("PyTorch Autodiff")
    print("=" * 50)
    print(f"z = x² + 3xy + y² at x={x.item()}, y={y.item()}")
    print(f"z = {z.item()}")

    # Compute gradients
    z.backward()

    print(f"\nGradients computed automatically:")
    print(f"  dz/dx = 2x + 3y = {x.grad.item()} (expected: {2*x.item() + 3*y.item()})")
    print(f"  dz/dy = 3x + 2y = {y.grad.item()} (expected: {3*x.item() + 2*y.item()})")

    # More complex example: gradient through operations
    print("\n" + "=" * 50)
    print("Gradient through neural network operations:")

    w = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
    x = torch.tensor([1.0, 1.0])

    # Forward pass
    y = w @ x           # Matrix-vector multiply
    loss = y.sum()      # Sum to get scalar

    print(f"w = \n{w.detach().numpy()}")
    print(f"x = {x.numpy()}")
    print(f"y = w @ x = {y.detach().numpy()}")
    print(f"loss = sum(y) = {loss.item()}")

    # Backward pass
    loss.backward()

    print(f"\nGradient of loss with respect to w:")
    print(f"{w.grad.numpy()}")
    print("(Each row gradient equals x, since d(w@x)/dw_ij = x_j)")

pytorch_autodiff_demo()

The Computation Graph

When you perform operations on tensors with gradient tracking, PyTorch builds a computation graph—a directed acyclic graph (DAG) where:

  • Nodes represent tensors (inputs, intermediates, outputs)
  • Edges represent operations

During backward pass, PyTorch traverses this graph in reverse topological order, applying the chain rule at each node.

Dynamic vs Static Graphs: PyTorch uses dynamic graphs—the graph is built fresh for each forward pass. This allows flexible control flow (if statements, loops with variable iterations). TensorFlow 1.x used static graphs that were defined once and reused, which enabled more optimization but was less flexible. Modern TensorFlow (2.x) also supports dynamic graphs via eager execution.

Gradient Checkpointing: Trading Compute for Memory

Standard backpropagation stores all intermediate activations for the backward pass. For very deep networks or long sequences, this memory cost becomes prohibitive.

Gradient checkpointing trades memory for compute by:

  1. Only storing activations at certain "checkpoint" layers
  2. Recomputing intermediate activations during backward pass as needed

This reduces memory from $O(n)$ to $O(\sqrt{n})$ for $n$ layers, at the cost of roughly one additional forward pass.

PYTHON
def gradient_checkpointing_concept():
    """Explain gradient checkpointing memory trade-off."""

    print("Gradient Checkpointing: Memory vs Compute Trade-off")
    print("=" * 60)

    n_layers = 100

    print(f"\nFor a {n_layers}-layer network:")
    print(f"\nStandard Backprop:")
    print(f"  Memory: Store all {n_layers} activations")
    print(f"  Compute: 1 forward + 1 backward = 2 passes")

    checkpoint_interval = 10
    n_checkpoints = n_layers // checkpoint_interval

    print(f"\nWith checkpointing every {checkpoint_interval} layers:")
    print(f"  Memory: Store only {n_checkpoints} activations")
    print(f"  Compute: 1 forward + ~1.5 backward (recomputing segments)")
    print(f"  Memory reduction: {n_layers/n_checkpoints}x")

    print(f"\nUse cases:")
    print(f"  - Training very deep networks (1000+ layers)")
    print(f"  - Long sequences in transformers")
    print(f"  - When GPU memory is the bottleneck")

gradient_checkpointing_concept()

Higher-Order Derivatives

Sometimes we need derivatives of derivatives—the Hessian (matrix of second derivatives) or gradients of gradients (for meta-learning).

PyTorch supports this by setting create</em>graph=True during backward:

PYTHON
import torch

def higher_order_demo():
    """Compute second derivatives using autodiff."""

    x = torch.tensor(3.0, requires_grad=True)

    # f(x) = x³
    f = x ** 3

    # First derivative: df/dx = 3x²
    df_dx = torch.autograd.grad(f, x, create_graph=True)[0]

    # Second derivative: d²f/dx² = 6x
    d2f_dx2 = torch.autograd.grad(df_dx, x)[0]

    print("Higher-Order Derivatives")
    print("=" * 50)
    print(f"f(x) = x³ at x = {x.item()}")
    print(f"f = {f.item()}")
    print(f"df/dx = 3x² = {df_dx.item()} (expected: {3 * x.item()**2})")
    print(f"d²f/dx² = 6x = {d2f_dx2.item()} (expected: {6 * x.item()})")

higher_order_demo()

Common Pitfalls and Best Practices

Pitfall 1: Forgetting to zero gradients Gradients accumulate by default. Always call optimizer.zero<em>grad() before computing new gradients.

Pitfall 2: In-place operations Operations like x += 1 modify tensors in-place, which can break gradient computation. Use x = x + 1 instead.

Pitfall 3: Detaching when you shouldn't .detach() removes a tensor from the computation graph. Use it intentionally (e.g., for target values in RL) but not accidentally.

Pitfall 4: Memory leaks Holding references to computation graph nodes prevents memory from being freed. For inference, use torch.no</em>grad().

PYTHON
import torch

def autodiff_best_practices():
    """Demonstrate common autodiff patterns."""

    print("Autodiff Best Practices")
    print("=" * 50)

    # 1. Zero gradients before backward
    print("\n1. Always zero gradients:")
    w = torch.tensor([1.0], requires_grad=True)

    for i in range(3):
        loss = w ** 2
        loss.backward()
        print(f"  Iteration {i+1}: grad = {w.grad.item()} (accumulates!)")

    print("  Solution: call w.grad.zero_() or optimizer.zero_grad()")

    # 2. Inference mode
    print("\n2. Use no_grad() for inference:")
    print("  with torch.no_grad():")
    print("      # No gradient tracking, saves memory")
    print("      output = model(input)")

    # 3. Detach for stop-gradient
    print("\n3. Use detach() to stop gradients:")
    print("  target = model(x).detach()  # Target doesn't get gradients")
    print("  loss = criterion(prediction, target)")

autodiff_best_practices()

How Modern Frameworks Optimize Autodiff

Production frameworks include many optimizations beyond basic reverse-mode autodiff:

  1. Operator fusion: Combine multiple operations into single optimized kernels
  2. In-place gradient accumulation: Avoid allocating new memory for gradient updates
  3. Gradient scaling: For mixed-precision training, scale gradients to avoid underflow
  4. Distributed gradient synchronization: Efficiently average gradients across multiple GPUs

These optimizations are why PyTorch and TensorFlow are so much faster than naive implementations—they're built by teams of engineers optimizing every detail.

Key Takeaways

  • Automatic differentiation computes exact gradients efficiently via the chain rule
  • Reverse mode (backpropagation) is efficient for many-input, few-output functions like neural networks
  • Autodiff works by recording operations and propagating gradients through the computation graph
  • PyTorch's autograd makes this seamless: just use tensors with requires_grad=True and call .backward()
  • Gradient checkpointing trades computation for memory in deep networks
  • Higher-order derivatives are possible by creating graphs of gradient computations
  • Common pitfalls include forgetting to zero gradients, in-place operations, and memory leaks

Autodiff is the engine that makes deep learning practical. Every time you train a model, millions of gradients are computed automatically, enabling the optimization that produces intelligent behavior from data.