Derivatives and Gradients
Every time a machine learning model learns from data, it relies on derivatives to guide its adjustments. The derivative tells the model which direction to move and by how much. Without derivatives, we would have no systematic way to improve our predictions—training would be reduced to random guessing.
This section builds your understanding of derivatives from first principles and extends them to the multivariate case. By the end, you'll understand the mathematical machinery that powers gradient descent and backpropagation.
What is a Derivative?
At its core, a derivative measures how a function's output changes as its input changes. If you have a function $f(x)$ that maps inputs to outputs, the derivative $f'(x)$ tells you the instantaneous rate of change at any point.
Mathematically, the derivative is defined as:
This formula captures a simple intuition: take two nearby points, measure how much the output changed, divide by how much the input changed, and take the limit as those points get infinitely close.
Geometric interpretation: The derivative at a point equals the slope of the tangent line to the function at that point. A positive derivative means the function is increasing; a negative derivative means it's decreasing; a zero derivative indicates a flat spot (potentially a minimum, maximum, or inflection point).
Why does this matter for ML? Consider a loss function $L(w)$ that measures how wrong our model's predictions are, where $w$ represents a model weight. The derivative $\frac{dL}{dw}$ tells us:
- If positive: Increasing <!--MATHBLOCK16--> increases the loss, so we should decrease <!--MATHBLOCK17-->
- If negative: Increasing <!--MATHBLOCK18--> decreases the loss, so we should increase <!--MATHBLOCK19-->
- Magnitude: How sensitive the loss is to changes in <!--MATHBLOCK20-->
This is the fundamental insight behind gradient descent: move the parameters in the direction that reduces the loss.
Common Derivatives in Machine Learning
Several derivatives appear repeatedly in ML. Memorizing these will help you understand neural network computations:
| Function | Derivative | Where It Appears | |----------|------------|------------------| | $x^n$ | $nx^{n-1}$ | Polynomial features, weight regularization | | $e^x$ | $e^x$ | Softmax, exponential learning rate schedules | | $\ln(x)$ | $\frac{1}{x}$ | Log-likelihood, cross-entropy loss | | $\frac{1}{1+e^{-x}}$ (sigmoid) | $\sigma(x)(1-\sigma(x))$ | Binary classification output | | $\tanh(x)$ | $1 - \tanh^2(x)$ | RNN activations | | $\max(0, x)$ (ReLU) | $1$ if $x > 0$, else $0$ | Hidden layer activations |
Notice that sigmoid and tanh have derivatives expressed in terms of themselves—this makes computation efficient during backpropagation since we already computed the forward pass values.
Computing Derivatives Numerically
While analytical derivatives are exact and efficient, numerical derivatives are invaluable for verification and debugging. The central difference formula provides a good approximation:
This is more accurate than the forward difference $\frac{f(x+h) - f(x)}{h}$ because errors cancel out symmetrically.
import numpy as np
def numerical_derivative(f, x, h=1e-5):
"""Compute derivative using central difference."""
return (f(x + h) - f(x - h)) / (2 * h)
# Verify: derivative of x^2 should be 2x
f = lambda x: x**2
x = 3.0
numerical = numerical_derivative(f, x)
analytical = 2 * x # We know d/dx[x^2] = 2x
print(f"At x = {x}:")
print(f" Numerical derivative: {numerical:.10f}")
print(f" Analytical derivative: {analytical:.10f}")
print(f" Absolute error: {abs(numerical - analytical):.2e}")The error is typically around $10^{-10}$ for smooth functions—small enough for gradient checking but too slow for training large models.
Partial Derivatives: Functions of Multiple Variables
Real ML models have thousands or millions of parameters. A loss function doesn't just depend on one weight—it depends on all of them simultaneously: $L(w_1, w_2, \ldots, w_n)$.
A partial derivative measures how the output changes when we vary just one input while holding all others fixed:
The symbol $\partial$ (rather than $d$) signals that other variables exist and are being held constant.
Example: Consider $f(x, y) = x^2 y + 3xy^2$
To find $\frac{\partial f}{\partial x}$, treat $y$ as a constant:
To find $\frac{\partial f}{\partial y}$, treat $x$ as a constant:
Each partial derivative tells us the sensitivity of $f$ to one specific variable.
import numpy as np
def f(x, y):
"""Example function: f(x,y) = x²y + 3xy²"""
return x**2 * y + 3 * x * y**2
def df_dx_analytical(x, y):
"""Analytical: ∂f/∂x = 2xy + 3y²"""
return 2*x*y + 3*y**2
def df_dy_analytical(x, y):
"""Analytical: ∂f/∂y = x² + 6xy"""
return x**2 + 6*x*y
def partial_derivative(f, x, y, var='x', h=1e-5):
"""Numerical partial derivative."""
if var == 'x':
return (f(x + h, y) - f(x - h, y)) / (2 * h)
else:
return (f(x, y + h) - f(x, y - h)) / (2 * h)
# Test at (x, y) = (2, 3)
x, y = 2.0, 3.0
print(f"f(x, y) = x²y + 3xy² at ({x}, {y})")
print(f"f = {f(x, y)}")
print(f"\n∂f/∂x: numerical = {partial_derivative(f, x, y, 'x'):.4f}, "
f"analytical = {df_dx_analytical(x, y):.4f}")
print(f"∂f/∂y: numerical = {partial_derivative(f, x, y, 'y'):.4f}, "
f"analytical = {df_dy_analytical(x, y):.4f}")The Gradient: A Vector of Partial Derivatives
When we collect all partial derivatives into a single vector, we get the gradient:
The gradient is one of the most important concepts in optimization. It has two crucial properties:
- Direction: The gradient points in the direction of steepest increase of <!--MATHBLOCK46-->
- Magnitude: <!--MATHBLOCK47--> tells us how steep that increase is
Since we want to minimize the loss function, we move in the opposite direction of the gradient. This gives us the gradient descent update rule:
where $\eta$ is the learning rate—a small positive number that controls step size.
Intuition: Imagine standing on a hilly landscape in fog. You can't see the lowest point, but you can feel which direction slopes downward at your feet. The gradient tells you the steepest uphill direction; walking the opposite way takes you downhill. Repeat until you reach a valley.
import numpy as np
def compute_gradient(f, params, h=1e-5):
"""
Compute gradient numerically for any scalar function of a vector.
Args:
f: Function that takes a numpy array and returns a scalar
params: Current parameter vector
h: Step size for finite differences
Returns:
Gradient vector (same shape as params)
"""
gradient = np.zeros_like(params, dtype=float)
for i in range(len(params)):
params_plus = params.copy()
params_minus = params.copy()
params_plus[i] += h
params_minus[i] -= h
gradient[i] = (f(params_plus) - f(params_minus)) / (2 * h)
return gradient
# Example: quadratic loss L(w) = w₁² + 2w₂²
# Gradient: ∇L = [2w₁, 4w₂]
def quadratic_loss(w):
return w[0]**2 + 2*w[1]**2
w = np.array([3.0, 2.0])
grad = compute_gradient(quadratic_loss, w)
print(f"Loss function: L(w) = w₁² + 2w₂²")
print(f"At w = {w}:")
print(f" Loss = {quadratic_loss(w)}")
print(f" Gradient (numerical): {grad}")
print(f" Gradient (analytical): {np.array([2*w[0], 4*w[1]])}")Gradient Descent in Action
Let's trace through a few steps of gradient descent to see how the gradient guides us toward the minimum:
import numpy as np
def gradient_descent_demo():
"""Demonstrate gradient descent on a simple quadratic."""
# Loss: L(w) = (w₁ - 3)² + 2(w₂ - 1)²
# Minimum at w* = [3, 1] with L(w*) = 0
def loss(w):
return (w[0] - 3)**2 + 2*(w[1] - 1)**2
def gradient(w):
return np.array([2*(w[0] - 3), 4*(w[1] - 1)])
# Start far from optimum
w = np.array([0.0, 0.0])
learning_rate = 0.1
print("Gradient Descent Progress")
print("-" * 50)
print(f"{'Step':<6} {'w₁':>8} {'w₂':>8} {'Loss':>12} {'|∇L|':>10}")
print("-" * 50)
for step in range(10):
grad = gradient(w)
grad_norm = np.linalg.norm(grad)
print(f"{step:<6} {w[0]:>8.4f} {w[1]:>8.4f} {loss(w):>12.6f} {grad_norm:>10.4f}")
# Update: move opposite to gradient
w = w - learning_rate * grad
print("-" * 50)
print(f"Final w = [{w[0]:.4f}, {w[1]:.4f}], approaching optimum [3, 1]")
gradient_descent_demo()Notice how the loss decreases and the gradient magnitude shrinks as we approach the minimum. This is typical behavior: far from the optimum, gradients are large and we make big steps; near the optimum, gradients become small and steps shrink, allowing precise convergence.
The Jacobian: Gradients for Vector-Valued Functions
When a function outputs a vector instead of a scalar (like a neural network layer), we need the Jacobian matrix instead of a gradient vector. If $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ maps $n$ inputs to $m$ outputs, the Jacobian is an $m \times n$ matrix:
Each row is the gradient of one output component. Entry $J_{ij}$ tells us how output $i$ changes when input $j$ changes.
In neural networks: A layer transforms inputs $\mathbf{x}$ into outputs $\mathbf{y}$. The Jacobian $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ describes how each output neuron responds to each input. For a linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$, the Jacobian is simply the weight matrix $\mathbf{W}$.
import numpy as np
def compute_jacobian(f, x, h=1e-5):
"""
Compute Jacobian matrix numerically.
Args:
f: Function mapping R^n -> R^m
x: Input vector of shape (n,)
Returns:
Jacobian matrix of shape (m, n)
"""
f_x = f(x)
m, n = len(f_x), len(x)
jacobian = np.zeros((m, n))
for j in range(n):
x_plus = x.copy()
x_minus = x.copy()
x_plus[j] += h
x_minus[j] -= h
jacobian[:, j] = (f(x_plus) - f(x_minus)) / (2 * h)
return jacobian
# Example: linear layer f(x) = Wx
W = np.array([[1, 2],
[3, 4],
[5, 6]], dtype=float)
def linear_layer(x):
return W @ x
x = np.array([1.0, 2.0])
print("Linear layer: f(x) = Wx")
print(f"W = \n{W}")
print(f"x = {x}")
print(f"\nJacobian (numerical):\n{compute_jacobian(linear_layer, x)}")
print(f"\nWeight matrix W:\n{W}")
print("\nFor linear layers, Jacobian = W (as expected)")The Hessian: Second-Order Information
The Hessian matrix contains all second-order partial derivatives:
While the gradient tells us the slope at a point, the Hessian tells us the curvature—how the slope itself changes. This information is valuable for optimization:
- Positive definite Hessian (all positive eigenvalues): Local minimum, bowl-shaped
- Negative definite Hessian (all negative eigenvalues): Local maximum, hill-shaped
- Indefinite Hessian (mixed eigenvalues): Saddle point
In deep learning, saddle points are far more common than local minima in high dimensions. The Hessian helps identify these, though computing it exactly is often prohibitively expensive.
Practical Application: Linear Regression Gradients
Let's derive the gradient for mean squared error in linear regression—one of the most fundamental calculations in ML.
Given data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ and predictions $\hat{y}_i = \mathbf{w}^T\mathbf{x}_i$, the MSE loss is:
To find the gradient, we differentiate with respect to $\mathbf{w}$:
This formula tells us exactly how to update weights to reduce prediction error.
import numpy as np
# Generate synthetic linear regression data
np.random.seed(42)
n_samples, n_features = 100, 3
X = np.random.randn(n_samples, n_features)
true_weights = np.array([2.0, -1.0, 0.5])
y = X @ true_weights + 0.1 * np.random.randn(n_samples)
def mse_loss(w, X, y):
"""Mean Squared Error loss."""
predictions = X @ w
return np.mean((y - predictions)**2)
def mse_gradient(w, X, y):
"""Analytical gradient of MSE."""
n = len(y)
predictions = X @ w
return -2/n * X.T @ (y - predictions)
# Gradient descent to find optimal weights
w = np.zeros(n_features) # Start at origin
learning_rate = 0.1
print("Learning linear regression weights via gradient descent")
print("-" * 55)
for epoch in range(5):
loss = mse_loss(w, X, y)
grad = mse_gradient(w, X, y)
print(f"Epoch {epoch}: Loss = {loss:.6f}, |grad| = {np.linalg.norm(grad):.6f}")
w = w - learning_rate * grad
print("-" * 55)
print(f"Learned weights: {w}")
print(f"True weights: {true_weights}")Key Takeaways
- Derivatives measure instantaneous rates of change and tell us how to adjust parameters
- Partial derivatives extend this to functions of many variables by varying one at a time
- The gradient vector points toward steepest increase; we descend by moving opposite to it
- Numerical differentiation via central differences is slower but essential for verification
- The Jacobian generalizes gradients for vector-valued functions (like neural network layers)
- The Hessian captures curvature information useful for advanced optimization
These concepts form the mathematical backbone of machine learning optimization. In the next section, we'll see how the chain rule allows us to efficiently compute gradients through compositions of functions—the key insight behind backpropagation.