Beginner Enthusiast 90 min read

Chapter 5: Supervised Learning Fundamentals

Linear regression, logistic regression, model evaluation, and cross-validation.

Libraries covered: Scikit-learn

Learning Objectives

["Implement linear and logistic regression", "Evaluate models using proper metrics", "Apply cross-validation techniques"]


5.1 The Supervised Learning Framework Beginner

The Supervised Learning Framework

Supervised learning is the most widely used paradigm in machine learning. The word "supervised" comes from the presence of a teacher—labeled examples that show the algorithm what correct answers look like. Given enough examples of inputs paired with their correct outputs, the algorithm learns patterns that let it predict outputs for new, unseen inputs.

This framework underlies everything from spam detection to medical diagnosis to self-driving cars. Understanding it deeply is essential for any ML practitioner.

What Makes Learning "Supervised"

In supervised learning, we have access to a dataset where each example consists of two parts: features (also called inputs, predictors, or independent variables) and a label (also called the target, output, or dependent variable). The features describe the example; the label is what we want to predict.

Consider predicting house prices. The features might include square footage, number of bedrooms, location, and age of the house. The label is the sale price. We have historical data where we know both the features and the actual sale prices. Our goal is to learn a function that maps features to prices, so we can predict prices for houses we haven't seen before.

Mathematically, we have a dataset of $n$ examples:

$$\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$$

where $x_i \in \mathbb{R}^d$ is a feature vector with $d$ dimensions, and $y_i$ is the corresponding label. Our goal is to learn a function $f: \mathbb{R}^d \rightarrow \mathcal{Y}$ that accurately predicts labels for new inputs.

The "supervision" comes from having access to the true labels $y_i$ during training. This contrasts with unsupervised learning, where we only have features and must discover structure without labeled guidance, and reinforcement learning, where we learn from rewards rather than explicit labels.

Classification vs Regression

Supervised learning problems fall into two categories based on the nature of the label:

Classification: The label is categorical—it belongs to one of a finite set of classes. Binary classification has two classes (spam/not spam, fraud/legitimate, disease/healthy). Multiclass classification has more than two classes (digit recognition with classes 0-9, species identification, sentiment as positive/neutral/negative).

Regression: The label is continuous—a real number. House prices, temperature predictions, stock returns, and age estimation are regression problems.

The distinction matters because it determines which algorithms apply, which loss functions make sense, and how we evaluate performance. Some algorithms work for both (decision trees, neural networks), while others are specific to one type (logistic regression for classification, linear regression for regression—despite its name).

PYTHON
import numpy as np
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split

# Generate a classification dataset
X_clf, y_clf = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, random_state=42
)
print(f"Classification: {X_clf.shape[0]} samples, {X_clf.shape[1]} features")
print(f"Labels: {np.unique(y_clf)} (binary)")

# Generate a regression dataset
X_reg, y_reg = make_regression(
    n_samples=1000, n_features=20, n_informative=10,
    noise=10, random_state=42
)
print(f"\nRegression: {X_reg.shape[0]} samples, {X_reg.shape[1]} features")
print(f"Labels: continuous values ranging from {y_reg.min():.1f} to {y_reg.max():.1f}")

The Learning Process

Supervised learning follows a systematic process:

1. Data Collection: Gather examples with features and labels. Data quality is crucial—noisy labels, missing values, or unrepresentative samples will hurt performance.

2. Data Preprocessing: Clean and transform the data. This includes handling missing values, encoding categorical variables, scaling features, and potentially creating new features.

3. Model Selection: Choose an algorithm appropriate for your problem. Consider the nature of your data, interpretability requirements, computational constraints, and the bias-variance tradeoff.

4. Training: Feed the training data to the algorithm. The algorithm adjusts its internal parameters to minimize prediction errors on the training set.

5. Validation: Evaluate the trained model on data it hasn't seen. This estimates how well the model will perform on truly new data.

6. Iteration: Based on validation results, adjust preprocessing, try different algorithms, tune hyperparameters, or gather more data.

7. Deployment: Once satisfied with performance, deploy the model to make predictions on new data in production.

PYTHON
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Simulating the learning process
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Step 1-2: Split data (simulating train/test separation)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 2: Preprocess (scale features)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use training statistics!

# Step 3-4: Select model and train
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Step 5: Validate
train_acc = accuracy_score(y_train, model.predict(X_train_scaled))
test_acc = accuracy_score(y_test, model.predict(X_test_scaled))

print(f"Training accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")

The Hypothesis Space

When we choose a model, we're implicitly defining a hypothesis space—the set of all possible functions the model can represent. Linear regression can only represent linear functions. Decision trees represent piecewise constant functions. Neural networks can represent arbitrarily complex functions.

The hypothesis space is a fundamental choice. If the true relationship between features and labels isn't in your hypothesis space, no amount of training data will let you learn it perfectly. A linear model cannot capture a quadratic relationship, no matter how much data you have.

Formally, we're searching for a function $f \in \mathcal{H}$ (from the hypothesis space $\mathcal{H}$) that minimizes some loss function:

$$\hat{f} = \arg\min_{f \in \mathcal{H}} \sum_{i=1}^{n} L(y_i, f(x_i))$$

where $L$ measures the discrepancy between predictions $f(x_i)$ and true labels $y_i$.

The hypothesis space determines the model's capacity—its ability to fit complex patterns. Too little capacity and the model can't capture the true relationship (underfitting). Too much capacity and the model memorizes training data without learning generalizable patterns (overfitting).

Inductive Bias

Every learning algorithm embodies inductive biases—assumptions about what kinds of patterns are more likely or preferable. These biases allow generalization from finite training data to infinite possible inputs.

Without inductive bias, learning is impossible. Given any finite training set, infinitely many functions could perfectly fit the data but differ wildly on new inputs. We need assumptions to prefer some functions over others.

Common inductive biases include:

Smoothness: Nearby inputs should have nearby outputs. This is why many algorithms struggle with discontinuous functions.

Simplicity: Simpler explanations are preferred (Occam's razor). Regularization implements this by penalizing complexity.

Linearity: Many algorithms assume linear relationships or transform data to make relationships approximately linear.

Feature independence: Naive Bayes assumes features are conditionally independent given the label.

Understanding your algorithm's inductive bias helps you choose appropriate models. If you believe the true relationship is linear, use linear models. If you expect complex nonlinear interactions, consider tree ensembles or neural networks.

The Bias-Variance Tradeoff

Model performance depends on two sources of error:

Bias measures systematic error—how far off the model's average prediction is from the truth. High-bias models make strong assumptions that may not hold, leading to underfitting. A linear model has high bias when the true relationship is nonlinear.

Variance measures sensitivity to training data—how much predictions change with different training samples. High-variance models are flexible and fit the training data closely, but may not generalize. They overfit to noise and idiosyncrasies of the specific training set.

For a given prediction, the expected error decomposes as:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

The irreducible noise comes from inherent randomness in the problem—variation that can't be predicted from the features alone.

The tradeoff arises because reducing bias (by using more flexible models) typically increases variance, and vice versa. Simple models have high bias but low variance. Complex models have low bias but high variance. The sweet spot depends on your data size and the true complexity of the relationship.

PYTHON
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate data with nonlinear relationship
np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y_true = np.sin(2 * np.pi * X).ravel()
y = y_true + np.random.normal(0, 0.2, 100)

# Split into train/test
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Fit models with different complexities
for degree in [1, 4, 15]:
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train_poly))
    test_mse = mean_squared_error(y_test, model.predict(X_test_poly))

    print(f"Degree {degree:2d}: Train MSE = {train_mse:.4f}, Test MSE = {test_mse:.4f}")

Generalization: The Ultimate Goal

The purpose of supervised learning isn't to memorize training data—it's to generalize to new, unseen examples. A model that perfectly predicts training labels but fails on new data is useless. This is why we always evaluate on held-out test data.

Generalization error is the expected error on the true data distribution, not just the training set:

$$\text{Generalization Error} = \mathbb{E}_{(x,y) \sim P}[L(y, f(x))]$$

We can't compute this directly (we don't know the true distribution $P$), but we estimate it using test data. The gap between training error and test error reveals overfitting.

Several factors affect generalization:

Training data size: More data generally improves generalization by better representing the true distribution and reducing variance.

Model complexity: Must match the problem's complexity. Too simple underfits; too complex overfits.

Data quality: Noisy labels, outliers, and distribution shift hurt generalization.

Regularization: Techniques that constrain the model, reducing overfitting.

Train/Validation/Test Split

Proper data splitting is essential for reliable model development:

Training set (typically 60-80%): Used to fit model parameters. The model sees these examples during learning.

Validation set (typically 10-20%): Used for hyperparameter tuning and model selection. We evaluate on validation to choose between models without contaminating test evaluation.

Test set (typically 10-20%): Held out completely until final evaluation. Provides an unbiased estimate of real-world performance.

The key principle is that test data must remain unseen until final evaluation. If you tune hyperparameters based on test performance, you're implicitly fitting to the test set, and your test error becomes an optimistic estimate.

PYTHON
from sklearn.model_selection import train_test_split

# Full dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# First split: separate test set (held out until final evaluation)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: separate validation set (for hyperparameter tuning)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 of 0.8 = 0.2
)

print(f"Training set:   {len(X_train)} samples ({len(X_train)/len(X):.0%})")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X):.0%})")
print(f"Test set:       {len(X_test)} samples ({len(X_test)/len(X):.0%})")

Putting It Together

Here's a complete example demonstrating the supervised learning workflow:

PYTHON
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load real dataset
data = load_breast_cancer()
X, y = data.data, data.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {data.target_names}")

# Split data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Try different models (model selection using validation set)
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42)
}

print("\nModel Selection (using validation set):")
best_model_name = None
best_val_acc = 0

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    val_acc = accuracy_score(y_val, model.predict(X_val_scaled))
    print(f"  {name}: {val_acc:.3f}")

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_model_name = name

# Final evaluation on test set (only once!)
best_model = models[best_model_name]
test_acc = accuracy_score(y_test, best_model.predict(X_test_scaled))
print(f"\nBest model: {best_model_name}")
print(f"Final test accuracy: {test_acc:.3f}")

Key Takeaways

  • Supervised learning uses labeled data to learn a mapping from features to labels
  • Classification predicts categories; regression predicts continuous values
  • The hypothesis space defines what functions your model can represent
  • Inductive bias encodes assumptions that enable generalization
  • The bias-variance tradeoff balances underfitting and overfitting
  • Generalization to new data is the ultimate goal, not training performance
  • Proper train/validation/test splits prevent overfitting during model development

5.2 Linear Regression Beginner

Linear Regression

Linear regression is the most fundamental algorithm in machine learning. Despite its simplicity, it remains remarkably powerful and widely used. Understanding linear regression deeply—its assumptions, derivation, and limitations—provides a foundation for comprehending more complex models. Many concepts introduced here (loss functions, optimization, regularization) apply throughout machine learning.

The Linear Model

Linear regression assumes that the relationship between features and target is approximately linear. Given input features $x = (x_1, x_2, \ldots, x_d)$, the model predicts:

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = w_0 + \mathbf{w}^T \mathbf{x}$$

Here $w_0$ is the intercept (or bias term), and $w_1, \ldots, w_d$ are weights (or coefficients) that determine how much each feature contributes to the prediction. The model is "linear" because the prediction is a linear combination of the features—each feature is multiplied by its weight and summed.

In matrix form, if we include the intercept by prepending a 1 to each feature vector, the model becomes:

$$\hat{y} = \mathbf{w}^T \mathbf{x}$$

where $\mathbf{w} = (w_0, w_1, \ldots, w_d)^T$ and $\mathbf{x} = (1, x_1, \ldots, x_d)^T$.

The geometric interpretation is elegant: in 2D (one feature), the model is a line; in 3D (two features), it's a plane; in higher dimensions, it's a hyperplane. The weights define the orientation of this hyperplane, and our goal is to find the hyperplane that best fits the data.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

# Generate simple linear data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X.flatten() + np.random.randn(100)  # y = 4 + 3x + noise

# Visualize
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.6)
plt.xlabel('Feature (x)')
plt.ylabel('Target (y)')
plt.title('Linear Relationship with Noise')
plt.show()

The Least Squares Objective

How do we find the best weights? We need to define what "best" means. The standard approach is least squares: find weights that minimize the sum of squared differences between predictions and actual values.

Given $n$ training examples $(x_i, y_i)$, the residual for example $i$ is the prediction error:

$$r_i = y_i - \hat{y}_i = y_i - \mathbf{w}^T \mathbf{x}_i$$

The sum of squared residuals (SSR), also called the residual sum of squares (RSS), is:

$$\text{SSR} = \sum_{i=1}^{n} r_i^2 = \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2$$

We minimize this because squared errors have nice mathematical properties: they're differentiable everywhere, they penalize large errors more heavily than small ones, and the solution has a closed-form expression.

In matrix notation, with $\mathbf{X}$ as the $n \times (d+1)$ design matrix (each row is an example) and $\mathbf{y}$ as the $n$-dimensional target vector:

$$\text{SSR}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 = (\mathbf{y} - \mathbf{X}\mathbf{w})^T(\mathbf{y} - \mathbf{X}\mathbf{w})$$

The Normal Equation

The beauty of least squares is that it has an analytical solution. Taking the gradient of SSR with respect to $\mathbf{w}$ and setting it to zero:

$$\nabla_\mathbf{w} \text{SSR} = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) = 0$$

Solving for $\mathbf{w}$:

$$\mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}$$

If $\mathbf{X}^T\mathbf{X}$ is invertible, the optimal weights are:

$$\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

This is the normal equation. The matrix $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is called the pseudo-inverse of $\mathbf{X}$.

PYTHON
import numpy as np

# Generate data
np.random.seed(42)
n_samples = 100
X = 2 * np.random.rand(n_samples, 1)
y = 4 + 3 * X.flatten() + np.random.randn(n_samples)

# Add bias term (column of ones)
X_b = np.c_[np.ones((n_samples, 1)), X]

# Normal equation solution
w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y

print(f"Estimated weights: w0 = {w[0]:.4f}, w1 = {w[1]:.4f}")
print(f"True values:       w0 = 4.0000, w1 = 3.0000")

Gradient Descent Alternative

While the normal equation gives an exact solution, it requires computing a matrix inverse, which is $O(d^3)$ for $d$ features. For high-dimensional data or when the matrix is ill-conditioned, gradient descent is more practical.

The gradient tells us the direction of steepest increase in the loss. We move in the opposite direction to decrease the loss:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \nabla_\mathbf{w} \text{SSR}(\mathbf{w}_t)$$

where $\alpha$ is the learning rate—a hyperparameter controlling step size. The gradient is:

$$\nabla_\mathbf{w} \text{SSR} = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) = 2\mathbf{X}^T(\mathbf{X}\mathbf{w} - \mathbf{y})$$

For large datasets, computing this full gradient is expensive. Stochastic gradient descent (SGD) approximates the gradient using a single example or small batch, trading accuracy for speed.

PYTHON
import numpy as np

# Gradient descent implementation
def gradient_descent(X, y, learning_rate=0.1, n_iterations=1000):
    n_samples, n_features = X.shape
    w = np.zeros(n_features)  # Initialize weights to zero

    history = []
    for i in range(n_iterations):
        # Compute predictions
        y_pred = X @ w

        # Compute gradient
        gradient = (2/n_samples) * X.T @ (y_pred - y)

        # Update weights
        w = w - learning_rate * gradient

        # Track loss
        if i % 100 == 0:
            loss = np.mean((y_pred - y)**2)
            history.append(loss)

    return w, history

# Apply to our data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X.flatten() + np.random.randn(100)
X_b = np.c_[np.ones((100, 1)), X]

w_gd, loss_history = gradient_descent(X_b, y, learning_rate=0.1, n_iterations=1000)
print(f"Gradient descent solution: w0 = {w_gd[0]:.4f}, w1 = {w_gd[1]:.4f}")

Assumptions of Linear Regression

Linear regression makes several assumptions. When these are violated, the model may produce unreliable estimates:

Linearity: The relationship between features and target is linear. Nonlinear relationships require feature transformations or different models.

Independence: Observations are independent of each other. Violated in time series data where consecutive observations are correlated.

Homoscedasticity: The variance of residuals is constant across all levels of the features. When variance changes with feature values (heteroscedasticity), standard errors are unreliable.

Normality of residuals: For inference (confidence intervals, p-values), residuals should be approximately normally distributed. Less critical for pure prediction.

No multicollinearity: Features shouldn't be highly correlated with each other. High multicollinearity inflates variance of coefficient estimates, making them unstable.

PYTHON
import numpy as np
from sklearn.linear_model import LinearRegression

# Example: Checking assumptions
np.random.seed(42)
X = np.random.randn(100, 3)

# Create multicollinearity: X[:,2] is almost a copy of X[:,1]
X[:, 2] = X[:, 1] + 0.01 * np.random.randn(100)

y = 2 + 3*X[:, 0] + 1.5*X[:, 1] + np.random.randn(100)

# Fit model
model = LinearRegression()
model.fit(X, y)

print("Coefficients (true: 3, 1.5, 0):")
print(f"  Feature 0: {model.coef_[0]:.2f}")
print(f"  Feature 1: {model.coef_[1]:.2f}")
print(f"  Feature 2: {model.coef_[2]:.2f}")
print("\nNote: Features 1 and 2 are highly correlated.")
print("Coefficients for collinear features are unstable and hard to interpret.")

Evaluating Linear Regression

Several metrics assess regression model performance:

Mean Squared Error (MSE): Average of squared errors. Same units as variance of target.

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Root Mean Squared Error (RMSE): Square root of MSE. Same units as target, more interpretable.

$$\text{RMSE} = \sqrt{\text{MSE}}$$

Mean Absolute Error (MAE): Average of absolute errors. Less sensitive to outliers than MSE.

$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

R-squared ($R^2$): Proportion of variance explained by the model. Ranges from 0 to 1 (can be negative for poor models on test data).

$$R^2 = 1 - \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

where SST is the total sum of squares (variance of target).

PYTHON
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Generate data
np.random.seed(42)
X = np.random.randn(200, 5)
y = 2 + X @ np.array([1.5, -2, 0.5, 0, 3]) + np.random.randn(200)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Evaluate
print("Training Set:")
print(f"  MSE:  {mean_squared_error(y_train, y_pred_train):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_train, y_pred_train):.4f}")
print(f"  R²:   {r2_score(y_train, y_pred_train):.4f}")

print("\nTest Set:")
print(f"  MSE:  {mean_squared_error(y_test, y_pred_test):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred_test):.4f}")
print(f"  R²:   {r2_score(y_test, y_pred_test):.4f}")

Polynomial Regression

Linear regression is linear in the weights, not necessarily in the features. By creating polynomial features, we can fit nonlinear relationships while still using the linear regression framework.

For a single feature $x$, polynomial regression of degree $p$ uses features $(1, x, x^2, \ldots, x^p)$:

$$\hat{y} = w_0 + w_1 x + w_2 x^2 + \cdots + w_p x^p$$

This is still linear regression—we're just fitting a linear model in an expanded feature space. The flexibility increases with degree, but so does the risk of overfitting.

PYTHON
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate nonlinear data
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 0.5 * X.flatten()**2 + X.flatten() + 2 + np.random.randn(100) * 0.5

# Split
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Fit models with different degrees
for degree in [1, 2, 5, 15]:
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train_poly))
    test_mse = mean_squared_error(y_test, model.predict(X_test_poly))

    print(f"Degree {degree:2d}: Train MSE = {train_mse:.4f}, Test MSE = {test_mse:.4f}")

Feature Scaling for Linear Regression

Linear regression is sensitive to feature scales when using gradient descent (the normal equation is scale-invariant but practically limited to smaller problems). Features with larger magnitudes dominate the gradient, leading to slow or unstable convergence.

Standardization (z-score normalization) transforms features to have zero mean and unit variance:

$$x' = \frac{x - \mu}{\sigma}$$

After standardization, coefficients are directly comparable—larger coefficients indicate more important features (assuming features are on similar scales originally).

PYTHON
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Features with very different scales
np.random.seed(42)
X = np.column_stack([
    np.random.randn(100) * 1,        # Scale ~1
    np.random.randn(100) * 1000,     # Scale ~1000
    np.random.randn(100) * 0.001     # Scale ~0.001
])
y = 3*X[:, 0] + 0.003*X[:, 1] + 3000*X[:, 2] + np.random.randn(100)

# Without scaling
model_unscaled = LinearRegression()
model_unscaled.fit(X, y)
print("Coefficients without scaling:")
print(f"  {model_unscaled.coef_}")

# With scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)
print("\nCoefficients with scaling:")
print(f"  {model_scaled.coef_}")
print("\nScaled coefficients show relative importance on comparable scale.")

Interpreting Coefficients

Linear regression coefficients have a clear interpretation: $w_j$ represents the expected change in $y$ for a one-unit increase in $x_j$, holding all other features constant.

This "holding constant" interpretation assumes no multicollinearity. With correlated features, changing one feature while holding others constant is unrealistic, and coefficients become harder to interpret.

For categorical features encoded with one-hot encoding, each coefficient represents the difference from the reference category.

PYTHON
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Create interpretable example
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'sqft': np.random.uniform(1000, 3000, n),
    'bedrooms': np.random.randint(1, 6, n),
    'age': np.random.uniform(0, 50, n)
})

# Price = 100*sqft + 20000*bedrooms - 1000*age + noise
data['price'] = (100 * data['sqft'] +
                 20000 * data['bedrooms'] -
                 1000 * data['age'] +
                 np.random.randn(n) * 10000)

# Fit model
X = data[['sqft', 'bedrooms', 'age']]
y = data['price']

model = LinearRegression()
model.fit(X, y)

print("Coefficient Interpretation:")
print(f"  Intercept: ${model.intercept_:,.0f}")
print(f"  sqft:      ${model.coef_[0]:,.0f} per square foot")
print(f"  bedrooms:  ${model.coef_[1]:,.0f} per bedroom")
print(f"  age:       ${model.coef_[2]:,.0f} per year of age")

Scikit-learn Implementation

Scikit-learn provides a clean, consistent interface for linear regression:

PYTHON
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# Generate data
np.random.seed(42)
X = np.random.randn(500, 10)
true_weights = np.array([3, -2, 1.5, 0, 0, 4, -1, 0.5, 0, 2])
y = X @ true_weights + 5 + np.random.randn(500)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create pipeline with scaling
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit
pipeline.fit(X_train, y_train)

# Evaluate
y_pred = pipeline.predict(X_test)
print(f"Test R²:   {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

# Access coefficients
model = pipeline.named_steps['regressor']
print(f"\nIntercept: {model.intercept_:.4f}")
print(f"Top 3 coefficients (by magnitude):")
coef_idx = np.argsort(np.abs(model.coef_))[::-1][:3]
for idx in coef_idx:
    print(f"  Feature {idx}: {model.coef_[idx]:.4f}")

Key Takeaways

  • Linear regression predicts a continuous target as a weighted sum of features
  • The least squares objective minimizes sum of squared residuals
  • The normal equation gives a closed-form solution: <!--MATHBLOCK43-->
  • Gradient descent is an alternative for large-scale problems
  • Key assumptions: linearity, independence, homoscedasticity, no multicollinearity
  • Polynomial regression fits nonlinear relationships by expanding the feature space
  • Coefficients represent the effect of each feature, holding others constant
  • Feature scaling is important for gradient descent and coefficient comparison
  • measures proportion of variance explained; RMSE measures prediction error in target units

5.3 Logistic Regression Intermediate

Logistic Regression

Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It's one of the most important algorithms in machine learning, serving as the foundation for understanding neural networks and providing a principled approach to binary classification. Logistic regression models the probability that an input belongs to a particular class, making it both a classifier and a probabilistic model.

From Regression to Classification

Linear regression predicts continuous values, but what if we want to predict a binary outcome—yes/no, spam/not spam, disease/healthy? We could try using linear regression and thresholding at 0.5, but this approach has problems: linear regression can predict values outside [0, 1], and it's sensitive to outliers in ways that distort the decision boundary.

Logistic regression solves this by predicting probabilities directly. Instead of predicting $y$, it predicts $P(y = 1 | x)$—the probability that the positive class is true given the input features.

The key insight is to transform the linear model's output through a function that maps any real number to the (0, 1) interval. The logistic function (also called the sigmoid function) does exactly this:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

This S-shaped function has several desirable properties: it outputs values between 0 and 1, it's differentiable everywhere, and it has a simple derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

# The sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)

plt.figure(figsize=(8, 5))
plt.plot(z, sigmoid(z), 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('z (linear combination)')
plt.ylabel('σ(z) (probability)')
plt.title('The Sigmoid Function')
plt.grid(True, alpha=0.3)
plt.show()

print("Key properties:")
print(f"  σ(-∞) → 0")
print(f"  σ(0)  = 0.5")
print(f"  σ(+∞) → 1")

The Logistic Regression Model

Logistic regression combines a linear model with the sigmoid function. Given input features $\mathbf{x}$, the model computes:

  1. A linear combination: <!--MATHBLOCK22-->
  2. Transform through sigmoid: <!--MATHBLOCK23-->

The weights $\mathbf{w}$ determine the decision boundary. In 2D, this boundary is a line; in 3D, a plane; in higher dimensions, a hyperplane. The sigmoid function provides a smooth transition from "almost certainly class 0" to "almost certainly class 1" as we cross this boundary.

To make a prediction, we threshold the probability:

$$\hat{y} = \begin{cases} 1 & \text{if } P(y=1|\mathbf{x}) \geq 0.5 \\ 0 & \text{otherwise} \end{cases}$$

Since $\sigma(0) = 0.5$, this is equivalent to predicting class 1 when $\mathbf{w}^T \mathbf{x} \geq 0$.

The Log-Odds Interpretation

There's an elegant interpretation of logistic regression in terms of odds and log-odds. The odds of an event are the ratio of the probability it occurs to the probability it doesn't:

$$\text{odds} = \frac{P(y=1)}{P(y=0)} = \frac{P(y=1)}{1 - P(y=1)}$$

Taking the logarithm gives the log-odds (or logit):

$$\text{logit}(P) = \log\frac{P}{1-P}$$

In logistic regression, the log-odds are a linear function of the features:

$$\log\frac{P(y=1|\mathbf{x})}{1 - P(y=1|\mathbf{x})} = \mathbf{w}^T \mathbf{x}$$

This means a one-unit increase in feature $x_j$ changes the log-odds by $w_j$, or equivalently, multiplies the odds by $e^{w_j}$. This interpretation makes coefficients meaningful for understanding feature effects.

PYTHON
import numpy as np

# Interpreting coefficients
w = np.array([0.5, -0.3, 1.2])  # Example coefficients

print("Coefficient Interpretation (log-odds):")
print(f"  w[0] = {w[0]:.1f}: Each unit increase in x0 increases log-odds by {w[0]:.1f}")
print(f"  w[1] = {w[1]:.1f}: Each unit increase in x1 decreases log-odds by {abs(w[1]):.1f}")
print(f"  w[2] = {w[2]:.1f}: Each unit increase in x2 increases log-odds by {w[2]:.1f}")

print("\nOdds ratio interpretation:")
for i, wi in enumerate(w):
    odds_ratio = np.exp(wi)
    if odds_ratio > 1:
        print(f"  x{i}: Odds multiply by {odds_ratio:.2f} per unit increase")
    else:
        print(f"  x{i}: Odds multiply by {odds_ratio:.2f} (decrease) per unit increase")

Maximum Likelihood Estimation

Unlike linear regression, logistic regression doesn't have a closed-form solution. We find the optimal weights using maximum likelihood estimation (MLE).

The likelihood function measures how probable the observed data is under a given set of parameters. For binary classification with independent observations, the likelihood is:

$$L(\mathbf{w}) = \prod_{i=1}^{n} P(y_i | \mathbf{x}_i, \mathbf{w}) = \prod_{i=1}^{n} \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1-y_i}$$

where $\hat{p}_i = \sigma(\mathbf{w}^T \mathbf{x}_i)$ is the predicted probability for example $i$.

Taking the logarithm (which preserves the maximum) gives the log-likelihood:

$$\ell(\mathbf{w}) = \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1 - \hat{p}_i) \right]$$

We maximize log-likelihood, or equivalently, minimize the negative log-likelihood (also called binary cross-entropy loss):

$$\mathcal{L}(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1 - \hat{p}_i) \right]$$

This loss function has intuitive properties: when $y_i = 1$, we want $\hat{p}_i$ close to 1 (making $-\log(\hat{p}_i)$ small); when $y_i = 0$, we want $\hat{p}_i$ close to 0 (making $-\log(1-\hat{p}_i)$ small).

Gradient Descent for Logistic Regression

The gradient of the log-likelihood has a beautifully simple form:

$$\nabla_\mathbf{w} \mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) \mathbf{x}_i = \frac{1}{n} \mathbf{X}^T (\hat{\mathbf{p}} - \mathbf{y})$$

This is identical in form to the gradient for linear regression, just with predictions $\hat{p}_i$ instead of $\hat{y}_i$. This simplicity is one reason logistic regression is so elegant.

We update weights iteratively:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \nabla_\mathbf{w} \mathcal{L}$$

where $\alpha$ is the learning rate.

PYTHON
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def logistic_regression_gd(X, y, learning_rate=0.1, n_iterations=1000):
    """Train logistic regression using gradient descent."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    losses = []
    for i in range(n_iterations):
        # Forward pass
        z = X @ w
        p = sigmoid(z)

        # Compute loss
        epsilon = 1e-15  # Prevent log(0)
        loss = -np.mean(y * np.log(p + epsilon) + (1 - y) * np.log(1 - p + epsilon))
        losses.append(loss)

        # Compute gradient
        gradient = (1 / n_samples) * X.T @ (p - y)

        # Update weights
        w = w - learning_rate * gradient

    return w, losses

# Generate sample data
np.random.seed(42)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1, random_state=42)

# Add bias term
X_b = np.c_[np.ones((X.shape[0], 1)), X]

# Train
w, losses = logistic_regression_gd(X_b, y, learning_rate=0.5, n_iterations=500)

print(f"Learned weights: {w}")
print(f"Final loss: {losses[-1]:.4f}")

Decision Boundary

The decision boundary is where the predicted probability equals 0.5, which occurs when $\mathbf{w}^T \mathbf{x} = 0$. For 2D features with weights $(w_0, w_1, w_2)$, this gives:

$$w_0 + w_1 x_1 + w_2 x_2 = 0$$

Rearranging: $x_2 = -\frac{w_0}{w_2} - \frac{w_1}{w_2} x_1$

This is a line in 2D (hyperplane in higher dimensions). The coefficients $w_1$ and $w_2$ determine the slope, while $w_0$ shifts the boundary.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate and fit
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1, random_state=42)

model = LogisticRegression()
model.fit(X, y)

# Plot decision boundary
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', alpha=0.6)
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', alpha=0.6)

# Draw boundary line
x1_range = np.array([X[:, 0].min() - 1, X[:, 0].max() + 1])
w0, w1, w2 = model.intercept_[0], model.coef_[0, 0], model.coef_[0, 1]
x2_boundary = -(w0 + w1 * x1_range) / w2
plt.plot(x1_range, x2_boundary, 'k--', linewidth=2, label='Decision Boundary')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.title('Logistic Regression Decision Boundary')
plt.show()

Multiclass Classification

Logistic regression naturally handles binary classification, but many problems have more than two classes. Two main approaches extend it to multiclass:

One-vs-Rest (OvR): Train $K$ binary classifiers, each distinguishing one class from all others. For prediction, choose the class with the highest probability. Simple but can produce miscalibrated probabilities.

Softmax (Multinomial): Generalize the sigmoid to multiple classes using the softmax function:

$$P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x}}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^T \mathbf{x}}}$$

This ensures all class probabilities sum to 1. The loss function becomes cross-entropy:

$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})$$

where $y_{ik}$ is 1 if example $i$ belongs to class $k$, and 0 otherwise.

PYTHON
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Multiclass problem
X, y = make_classification(n_samples=300, n_features=4, n_informative=3,
                           n_classes=3, n_clusters_per_class=1, random_state=42)

# One-vs-Rest
ovr_model = LogisticRegression(multi_class='ovr', solver='lbfgs')
ovr_model.fit(X, y)

# Multinomial (softmax)
softmax_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
softmax_model.fit(X, y)

# Compare predictions
sample = X[0:1]
print(f"Sample prediction:")
print(f"  OvR probabilities:      {ovr_model.predict_proba(sample)[0].round(3)}")
print(f"  Softmax probabilities:  {softmax_model.predict_proba(sample)[0].round(3)}")
print(f"  Both predict class:     {ovr_model.predict(sample)[0]}")

Regularization in Logistic Regression

Like linear regression, logistic regression can overfit, especially with many features or limited data. Regularization prevents this by penalizing large weights.

L2 regularization (Ridge) adds a penalty term $\lambda \|\mathbf{w}\|_2^2$ to the loss:

$$\mathcal{L}_{L2} = \mathcal{L} + \lambda \sum_{j=1}^{d} w_j^2$$

This shrinks all weights toward zero but rarely to exactly zero.

L1 regularization (Lasso) adds penalty $\lambda \|\mathbf{w}\|_1$:

$$\mathcal{L}_{L1} = \mathcal{L} + \lambda \sum_{j=1}^{d} |w_j|$$

L1 can drive weights exactly to zero, performing feature selection.

Elastic Net combines both: $\lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2$

In scikit-learn, the regularization strength is controlled by parameter C = 1/λ. Smaller C means stronger regularization.

PYTHON
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# High-dimensional data (many features, prone to overfitting)
X, y = make_classification(n_samples=100, n_features=50, n_informative=10,
                           n_redundant=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different regularization strengths
for C in [0.01, 0.1, 1.0, 10.0]:
    model = LogisticRegression(C=C, penalty='l2', solver='lbfgs', max_iter=1000)
    model.fit(X_train, y_train)

    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    n_nonzero = np.sum(np.abs(model.coef_) > 0.01)

    print(f"C={C:5.2f}: Train acc={train_acc:.3f}, Test acc={test_acc:.3f}, "
          f"Non-zero weights≈{n_nonzero}")

Evaluating Classification Models

Classification models need different metrics than regression. Key metrics include:

Accuracy: Proportion of correct predictions. Can be misleading with imbalanced classes.

Precision: Of predicted positives, how many are actually positive?

$$\text{Precision} = \frac{TP}{TP + FP}$$

Recall (Sensitivity): Of actual positives, how many did we predict positive?

$$\text{Recall} = \frac{TP}{TP + FN}$$

F1 Score: Harmonic mean of precision and recall.

$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

ROC-AUC: Area under the ROC curve. Measures discrimination ability across all thresholds.

PYTHON
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, classification_report)

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           weights=[0.9, 0.1], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("Classification Metrics:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"  Precision: {precision_score(y_test, y_pred):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"  ROC-AUC:   {roc_auc_score(y_test, y_prob):.3f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Choosing the Classification Threshold

By default, we predict class 1 when $P(y=1) \geq 0.5$. But this threshold can be adjusted based on the costs of different errors.

If false negatives are costly (e.g., missing a disease), lower the threshold to increase recall at the cost of precision. If false positives are costly (e.g., spam filter blocking important emails), raise the threshold.

PYTHON
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

X, y = make_classification(n_samples=1000, n_features=10, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

print("Effect of threshold on precision/recall:")
print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<12} {'F1':<12}")
print("-" * 48)

for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred = (y_prob >= threshold).astype(int)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    print(f"{threshold:<12.1f} {prec:<12.3f} {rec:<12.3f} {f1:<12.3f}")

Full Example with Scikit-learn

Here's a complete workflow for logistic regression:

PYTHON
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score

# Load real dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])

# Train
pipeline.fit(X_train, y_train)

# Evaluate
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

print("Breast Cancer Classification Results:")
print(f"Classes: {data.target_names}")
print(f"\nROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Feature importance (by coefficient magnitude)
model = pipeline.named_steps['classifier']
importance = np.abs(model.coef_[0])
top_features = np.argsort(importance)[::-1][:5]

print("Top 5 most important features:")
for idx in top_features:
    print(f"  {data.feature_names[idx]}: {model.coef_[0][idx]:.4f}")

Key Takeaways

  • Logistic regression models the probability of class membership using the sigmoid function
  • The log-odds of the positive class are a linear function of features
  • Maximum likelihood estimation finds optimal weights by minimizing cross-entropy loss
  • Gradient descent optimizes the non-convex loss function
  • The decision boundary is a hyperplane where <!--MATHBLOCK55-->
  • Softmax extends logistic regression to multiclass problems
  • Regularization (L1/L2) prevents overfitting and can perform feature selection
  • Choose the classification threshold based on the relative costs of false positives vs false negatives
  • Evaluate with precision, recall, F1, and ROC-AUC, not just accuracy

5.4 Decision Trees Intermediate

Decision Trees

Decision trees are among the most intuitive machine learning algorithms. They learn a hierarchy of if-then-else rules that partition the feature space into regions, making predictions based on which region an input falls into. Unlike linear models that learn smooth decision boundaries, decision trees create axis-aligned rectangular partitions—a fundamentally different approach with unique strengths and weaknesses.

The Tree Structure

A decision tree consists of nodes and branches. Each internal node represents a decision based on a feature: "Is feature $x_j$ less than threshold $t$?" Each branch represents the outcome of that decision, and each leaf node represents a prediction (a class for classification, a value for regression).

To classify a new example, we start at the root and follow branches based on the example's feature values until we reach a leaf. The prediction at that leaf becomes our output.

Consider predicting whether someone will default on a loan. A decision tree might learn rules like: "If income > $50,000 and debt ratio < 0.3, predict no default; otherwise if credit score < 600, predict default..." These rules are human-interpretable, making decision trees popular when explainability matters.

PYTHON
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Train a simple tree
iris = load_iris()
X, y = iris.data[:, :2], iris.target  # Use only 2 features for visualization

tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)

# Visualize the tree
plt.figure(figsize=(12, 8))
plot_tree(tree, feature_names=iris.feature_names[:2],
          class_names=iris.target_names, filled=True, rounded=True)
plt.title('Decision Tree for Iris Classification')
plt.show()

How Trees Learn: Recursive Partitioning

Decision trees are built top-down through recursive partitioning. Starting with all training data at the root, the algorithm:

  1. Selects the best feature and threshold to split the data into two groups
  2. Creates child nodes for each group
  3. Recursively repeats on each child until stopping criteria are met

The key question is: what makes a split "good"? We want splits that create groups with homogeneous labels—ideally, each group contains only one class. Two common measures quantify this:

Gini Impurity: Measures the probability of misclassifying a randomly chosen element if it were labeled according to the class distribution in that node.

$$\text{Gini}(S) = 1 - \sum_{k=1}^{K} p_k^2$$

where $p_k$ is the proportion of class $k$ in set $S$. A pure node (all one class) has Gini = 0; a maximally impure binary node (50-50 split) has Gini = 0.5.

Entropy: From information theory, measures the uncertainty in the class distribution.

$$\text{Entropy}(S) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

A pure node has entropy = 0; a maximally impure binary node has entropy = 1.

For a potential split that divides $S$ into $S_{\text{left}}$ and $S_{\text{right}}$, we compute the weighted average impurity after splitting:

$$\text{Impurity}_{\text{split}} = \frac{|S_{\text{left}}|}{|S|} \text{Impurity}(S_{\text{left}}) + \frac{|S_{\text{right}}|}{|S|} \text{Impurity}(S_{\text{right}})$$

The best split minimizes this value, or equivalently, maximizes the information gain (reduction in impurity).

PYTHON
import numpy as np

def gini_impurity(y):
    """Calculate Gini impurity of a set."""
    if len(y) == 0:
        return 0
    proportions = np.bincount(y) / len(y)
    return 1 - np.sum(proportions ** 2)

def entropy(y):
    """Calculate entropy of a set."""
    if len(y) == 0:
        return 0
    proportions = np.bincount(y) / len(y)
    proportions = proportions[proportions > 0]  # Avoid log(0)
    return -np.sum(proportions * np.log2(proportions))

# Example: Compare impurity measures
y_pure = np.array([1, 1, 1, 1, 1])
y_mixed = np.array([1, 1, 1, 0, 0])
y_balanced = np.array([1, 1, 0, 0])

print("Impurity Comparison:")
print(f"{'Distribution':<20} {'Gini':<10} {'Entropy':<10}")
print("-" * 40)
print(f"{'Pure (all 1s)':<20} {gini_impurity(y_pure):<10.3f} {entropy(y_pure):<10.3f}")
print(f"{'60-40 split':<20} {gini_impurity(y_mixed):<10.3f} {entropy(y_mixed):<10.3f}")
print(f"{'50-50 split':<20} {gini_impurity(y_balanced):<10.3f} {entropy(y_balanced):<10.3f}")

Regression Trees

Decision trees naturally extend to regression by predicting the average target value in each leaf region. Instead of impurity, we minimize variance or mean squared error within nodes:

$$\text{MSE}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y}_S)^2$$

where $\bar{y}_S$ is the mean target value in set $S$.

The best split minimizes the weighted average MSE after splitting. Leaf predictions are simply the average of training targets that fall into that region.

PYTHON
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

# Generate nonlinear data
np.random.seed(42)
X = np.sort(5 * np.random.rand(200, 1), axis=0)
y = np.sin(X).ravel() + np.random.randn(200) * 0.1

# Fit trees with different depths
plt.figure(figsize=(12, 4))

for i, depth in enumerate([2, 5, 10]):
    plt.subplot(1, 3, i + 1)
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X, y)

    X_test = np.linspace(0, 5, 100).reshape(-1, 1)
    y_pred = tree.predict(X_test)

    plt.scatter(X, y, alpha=0.5, s=10)
    plt.plot(X_test, y_pred, 'r-', linewidth=2)
    plt.title(f'Depth = {depth}')

plt.tight_layout()
plt.show()

Stopping Criteria and Pruning

Without constraints, a decision tree can grow until each leaf contains only one training example—perfectly fitting the training data but likely overfitting badly. We control tree complexity through:

Stopping criteria (pre-pruning): Stop growing before the tree becomes too complex.

  • max<em>depth: Maximum depth of the tree
  • min</em>samples<em>split: Minimum samples required to split a node
  • min</em>samples<em>leaf: Minimum samples required in a leaf
  • max</em>leaf<em>nodes: Maximum number of leaf nodes
  • min</em>impurity<em>decrease: Only split if impurity decreases by at least this amount

Post-pruning: Grow the full tree, then remove branches that don't improve validation performance. Cost-complexity pruning adds a penalty for tree size.

PYTHON
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate data
X, y = make_classification(n_samples=500, n_features=20, n_informative=10,
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Effect of max_depth on overfitting:")
print(f"{'max_depth':<12} {'Train Acc':<12} {'Test Acc':<12}")
print("-" * 36)

for depth in [2, 5, 10, 20, None]:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)

    train_acc = tree.score(X_train, y_train)
    test_acc = tree.score(X_test, y_test)

    depth_str = str(depth) if depth else "None (full)"
    print(f"{depth_str:<12} {train_acc:<12.3f} {test_acc:<12.3f}")

Feature Importance

Decision trees provide a natural measure of feature importance: how much each feature contributes to reducing impurity across all splits. Features used higher in the tree or that create larger impurity reductions are more important.

The importance of feature $j$ is computed as:

$$\text{Importance}(j) = \sum_{\text{nodes using } j} \frac{n_{\text{node}}}{n} \cdot \Delta\text{Impurity}$$

where the sum is over all nodes that split on feature $j$, and $\Delta\text{Impurity}$ is the impurity reduction at that split.

PYTHON
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Train tree
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X, y)

# Get feature importances
importances = tree.feature_importances_
indices = np.argsort(importances)[::-1][:10]  # Top 10

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(10), importances[indices])
plt.xticks(range(10), [data.feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Top 10 Feature Importances (Decision Tree)')
plt.tight_layout()
plt.show()

Advantages of Decision Trees

Decision trees have several compelling properties:

Interpretability: The decision rules can be visualized and explained to non-technical stakeholders. This is crucial in regulated industries like healthcare and finance.

No feature scaling required: Trees split on thresholds, so the scale of features doesn't matter. You can mix features with different units without preprocessing.

Handles nonlinear relationships: Trees can approximate any function by creating enough rectangular partitions. They don't assume linearity.

Handles mixed feature types: Both numerical and categorical features can be used directly (with appropriate encoding).

Implicit feature selection: Unimportant features are simply not used in splits.

Fast prediction: Prediction requires only $O(\log n)$ comparisons for a balanced tree.

Limitations of Decision Trees

However, trees have significant weaknesses:

High variance: Small changes in training data can produce very different trees. This instability makes single trees unreliable.

Prone to overfitting: Without careful tuning, trees easily memorize training data. Deep trees with many leaves overfit.

Axis-aligned boundaries: Splits are perpendicular to feature axes. Diagonal boundaries require many splits to approximate, leading to complex trees.

Greedy optimization: Trees are built greedily, making locally optimal splits. This doesn't guarantee a globally optimal tree.

Poor extrapolation: For regression, trees predict constant values in each region. They can't extrapolate beyond the range of training data.

PYTHON
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Demonstrate high variance
X = np.random.randn(100, 5)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Train multiple trees on bootstrap samples
print("Variance in tree predictions:")
trees = []
for i in range(5):
    # Bootstrap sample
    idx = np.random.choice(100, 100, replace=True)
    X_boot, y_boot = X[idx], y[idx]

    tree = DecisionTreeClassifier(max_depth=5, random_state=i)
    tree.fit(X_boot, y_boot)
    trees.append(tree)

# Compare predictions on same test point
test_point = np.array([[0.5, 0.3, 0, 0, 0]])
predictions = [tree.predict(test_point)[0] for tree in trees]
print(f"Predictions from 5 trees on same point: {predictions}")
print("Trees can disagree significantly!")

Decision Boundary Visualization

Understanding how decision trees partition space helps build intuition:

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons

# Generate data
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)

# Fit tree
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X, y)

# Create mesh for decision boundary
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolors='black')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Tree Boundary (Axis-Aligned Rectangles)')
plt.show()

Practical Tips

Start simple: Begin with shallow trees (maxdepth=3-5) and increase complexity only if underfitting.

Use cross-validation: Tune hyperparameters using cross-validation to find the right balance of complexity.

Consider class weights: For imbalanced datasets, use class<em>weight=&#039;balanced&#039; to adjust for class frequencies.

Look at feature importances: They help understand what the tree learned and can guide feature engineering.

Use trees as building blocks: Single trees are often outperformed by ensemble methods (Random Forests, Gradient Boosting) that combine many trees.

PYTHON
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

tree = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(tree, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
print(f"Test accuracy: {grid_search.score(X_test, y_test):.4f}")

Key Takeaways

  • Decision trees learn hierarchical if-then-else rules that partition feature space
  • Trees are built by recursive partitioning, choosing splits that maximize impurity reduction
  • Gini impurity and entropy measure node homogeneity for classification
  • Regression trees minimize variance within nodes, predicting averages in leaf regions
  • Control overfitting with stopping criteria (maxdepth, minsamplesleaf) or pruning
  • Feature importance shows which features contribute most to predictions
  • Trees are interpretable and require no feature scaling, but have high variance and create axis-aligned boundaries
  • Single trees are often surpassed by ensemble methods that aggregate many trees

5.5 Ensemble Methods Intermediate

Ensemble Methods

Individual models have limitations. Linear models can't capture nonlinearity. Decision trees have high variance. Neural networks require massive data. Ensemble methods overcome these limitations by combining multiple models, leveraging the wisdom of crowds: while individual models make errors, their collective predictions are often more accurate than any single model.

Ensemble methods dominate machine learning competitions and real-world applications. Random Forests and Gradient Boosting are among the most powerful algorithms for structured data.

The Power of Combining Models

Why do ensembles work? Consider a simple scenario: you have three independent classifiers, each with 70% accuracy. If they vote on each prediction, the ensemble is correct when at least two agree:

$$P(\text{majority correct}) = P(3 \text{ correct}) + P(2 \text{ correct}) = 0.7^3 + 3 \cdot 0.7^2 \cdot 0.3 = 0.784$$

The ensemble achieves 78.4% accuracy—better than any individual model. This improvement grows with more diverse models.

The key insight is that ensembles reduce variance without substantially increasing bias. Individual models might overfit to different parts of the data, but their errors tend to cancel out when averaged.

PYTHON
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

# Generate data
X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42)

# Single tree (high variance)
single_tree = DecisionTreeClassifier(max_depth=10, random_state=42)
single_scores = cross_val_score(single_tree, X, y, cv=10)

# Simple ensemble: average predictions from multiple trees
n_trees = 10
ensemble_predictions = np.zeros((len(y), n_trees))

for i in range(n_trees):
    tree = DecisionTreeClassifier(max_depth=10, random_state=i)
    tree.fit(X, y)
    ensemble_predictions[:, i] = tree.predict(X)

# Majority vote
ensemble_vote = (ensemble_predictions.mean(axis=1) >= 0.5).astype(int)
ensemble_acc = (ensemble_vote == y).mean()

print(f"Single tree CV accuracy: {single_scores.mean():.3f} (+/- {single_scores.std():.3f})")
print(f"Simple ensemble accuracy (same data): {ensemble_acc:.3f}")
print("\nNote: True ensembles use different training subsets.")

Bagging: Bootstrap Aggregating

Bagging (Bootstrap Aggregating) creates diversity by training each model on a different bootstrap sample of the training data. A bootstrap sample is drawn with replacement, so each sample contains about 63% of unique training examples (some appear multiple times, others not at all).

For prediction:

  • Classification: Majority vote across all models
  • Regression: Average predictions across all models

Bagging reduces variance because different samples lead to different models, and averaging cancels out their individual errors.

PYTHON
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Single tree
single_tree = DecisionTreeClassifier(max_depth=10, random_state=42)
single_tree.fit(X_train, y_train)
single_acc = single_tree.score(X_test, y_test)

# Bagged ensemble
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=50,
    max_samples=1.0,  # Bootstrap sample size
    bootstrap=True,   # Sample with replacement
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_acc = bagging.score(X_test, y_test)

print(f"Single tree accuracy:    {single_acc:.4f}")
print(f"Bagged ensemble (50 trees): {bagging_acc:.4f}")

Random Forests

Random Forests extend bagging with an additional source of randomness: at each split, only a random subset of features is considered. This decorrelates the trees, making the ensemble more diverse and further reducing variance.

The algorithm:

  1. For each tree, draw a bootstrap sample of the training data
  2. Grow a tree, but at each split, randomly select <!--MATHBLOCK1--> features and find the best split among only those features
  3. Grow trees fully (no pruning) to reduce bias
  4. Aggregate predictions by voting (classification) or averaging (regression)

The key hyperparameter is max<em>features ($m$): the number of features to consider at each split. Common defaults are $\sqrt{d}$ for classification and $d/3$ for regression, where $d$ is the total number of features.

PYTHON
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,       # Number of trees
    max_features='sqrt',    # Features to consider at each split
    max_depth=None,         # Grow full trees
    min_samples_leaf=1,     # Allow small leaves
    bootstrap=True,         # Use bootstrap samples
    random_state=42,
    n_jobs=-1               # Parallelize
)
rf.fit(X_train, y_train)

print(f"Random Forest accuracy: {rf.score(X_test, y_test):.4f}")
print(f"Number of trees: {rf.n_estimators}")
print(f"Features per split: {int(np.sqrt(X_train.shape[1]))}")

# Feature importance
importances = rf.feature_importances_
top_idx = np.argsort(importances)[::-1][:5]
print("\nTop 5 features:")
for idx in top_idx:
    print(f"  {data.feature_names[idx]}: {importances[idx]:.4f}")

Out-of-Bag Estimation

In bagging and Random Forests, each tree sees only about 63% of training examples. The remaining 37% (out-of-bag samples) can be used for validation without a separate holdout set.

For each training example, we predict using only trees that didn't include it in their bootstrap sample. This provides an unbiased estimate of generalization error—a free cross-validation!

PYTHON
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Random Forest with OOB estimation
rf = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,  # Enable OOB estimation
    random_state=42,
    n_jobs=-1
)
rf.fit(X, y)

# Compare OOB score with cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5)

print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"5-Fold CV: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
print("\nOOB score closely approximates CV without the computational cost!")

Boosting: Sequential Learning

While bagging trains models independently, boosting trains them sequentially. Each new model focuses on correcting the errors of the previous models. This reduces bias (not just variance) and can achieve very low error rates.

The idea: start with a weak learner (slightly better than random). Train it on the data. Find the examples it gets wrong. Train the next model to focus more on those hard examples. Repeat, combining all models with appropriate weights.

AdaBoost

Adaptive Boosting (AdaBoost) maintains weights for each training example. Initially, all examples have equal weight. After each model:

  1. Increase weights of misclassified examples
  2. Decrease weights of correctly classified examples
  3. Train the next model on the reweighted data

Final prediction is a weighted vote, where better models (lower error) get higher voting weight.

PYTHON
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# AdaBoost with decision stumps (depth-1 trees)
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Weak learner
    n_estimators=100,
    learning_rate=1.0,
    algorithm='SAMME.R',
    random_state=42
)
ada.fit(X_train, y_train)

print(f"AdaBoost accuracy: {ada.score(X_test, y_test):.4f}")

# Track performance vs number of estimators
staged_scores = list(ada.staged_score(X_test, y_test))
print(f"After 10 trees:  {staged_scores[9]:.4f}")
print(f"After 50 trees:  {staged_scores[49]:.4f}")
print(f"After 100 trees: {staged_scores[99]:.4f}")

Gradient Boosting

Gradient Boosting generalizes boosting to arbitrary loss functions. Instead of reweighting examples, each new model is trained to predict the negative gradient (residuals) of the loss function with respect to the current predictions.

For squared error loss, the gradient is simply the residual: $y_i - \hat{y}_i$. Each new tree learns to predict these residuals, gradually improving the ensemble's predictions.

The update rule: $\hat{y}^{(t)} = \hat{y}^{(t-1)} + \eta \cdot h_t(x)$

where $\eta$ is the learning rate (shrinkage) and $h_t$ is the new tree predicting residuals.

PYTHON
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Gradient Boosting
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,     # Shrinkage factor
    max_depth=3,           # Shallow trees (unlike RF)
    min_samples_leaf=5,
    subsample=0.8,         # Stochastic GB: use 80% of data per tree
    random_state=42
)
gb.fit(X_train, y_train)

print(f"Gradient Boosting accuracy: {gb.score(X_test, y_test):.4f}")

# Track performance
staged_scores = list(gb.staged_score(X_test, y_test))
print(f"After 10 trees:  {staged_scores[9]:.4f}")
print(f"After 50 trees:  {staged_scores[49]:.4f}")
print(f"After 100 trees: {staged_scores[99]:.4f}")

XGBoost and Modern Implementations

XGBoost (Extreme Gradient Boosting) is an optimized implementation that dominated Kaggle competitions. It includes:

  • Regularization to prevent overfitting
  • Parallel tree construction
  • Handling of missing values
  • Tree pruning based on gain
  • Built-in cross-validation

Other popular implementations include LightGBM (faster, histogram-based) and CatBoost (handles categorical features natively).

PYTHON
# XGBoost example (if installed)
try:
    import xgboost as xgb
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split

    X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    xgb_model = xgb.XGBClassifier(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,  # Feature subsampling
        reg_lambda=1.0,        # L2 regularization
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    xgb_model.fit(X_train, y_train)
    print(f"XGBoost accuracy: {xgb_model.score(X_test, y_test):.4f}")
except ImportError:
    print("XGBoost not installed. Install with: pip install xgboost")

Random Forest vs Gradient Boosting

Both are powerful, but they differ:

| Aspect | Random Forest | Gradient Boosting | |--------|--------------|-------------------| | Training | Parallel (fast) | Sequential (slower) | | Overfitting | Resistant (many deep trees) | Sensitive (shallow trees, regularize) | | Hyperparameters | Fewer, robust defaults | Many, requires tuning | | Interpretability | Feature importance | Feature importance, SHAP | | Performance | Very good baseline | Often best on tabular data |

Use Random Forest when: you want a quick, robust baseline; you have limited time for tuning; you need parallelization.

Use Gradient Boosting when: you need maximum performance; you can invest time in tuning; the data is large and complex.

PYTHON
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import time

X, y = make_classification(n_samples=2000, n_features=30, n_informative=15, random_state=42)

# Compare training time and accuracy
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

print(f"{'Model':<20} {'CV Accuracy':<15} {'Training Time':<15}")
print("-" * 50)

for name, model in models.items():
    start = time.time()
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    elapsed = time.time() - start

    print(f"{name:<20} {scores.mean():.4f}          {elapsed:.2f}s")

Stacking

Stacking (Stacked Generalization) combines diverse models using a meta-learner. The base models make predictions, which become features for the meta-model that produces the final prediction.

To avoid leakage, base model predictions must be made on data they weren't trained on—typically using cross-validation.

PYTHON
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base learners
base_learners = [
    ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=50, random_state=42)),
    ('svc', SVC(probability=True, random_state=42))
]

# Meta-learner
stacking = StackingClassifier(
    estimators=base_learners,
    final_estimator=LogisticRegression(),
    cv=5,  # Use 5-fold CV for base predictions
    stack_method='predict_proba',
    n_jobs=-1
)

stacking.fit(X_train, y_train)
print(f"Stacking accuracy: {stacking.score(X_test, y_test):.4f}")

# Compare with individual models
for name, model in base_learners:
    model.fit(X_train, y_train)
    print(f"{name} accuracy:       {model.score(X_test, y_test):.4f}")

Practical Tips

Start with Random Forest: It's a robust baseline that requires minimal tuning.

For best performance, use Gradient Boosting: But invest time in hyperparameter tuning (learning rate, max depth, number of trees).

Watch for overfitting in boosting: Use early stopping based on validation error. Monitor train vs validation loss.

Use cross-validation: Ensemble methods can still overfit. Validate properly.

Feature importance: Both methods provide feature importance scores—use them for understanding and feature selection.

Learning rate and nestimators trade-off: Lower learning rate requires more trees but often improves generalization.

Key Takeaways

  • Ensembles combine multiple models to reduce variance and improve predictions
  • Bagging creates diversity through bootstrap sampling; predictions are averaged
  • Random Forests add feature randomness to bagging, decorrelating trees
  • Boosting trains models sequentially, each correcting previous errors
  • AdaBoost reweights examples; Gradient Boosting fits residuals
  • XGBoost/LightGBM are optimized, regularized implementations of gradient boosting
  • Random Forests are robust and parallel; Gradient Boosting often achieves best performance with tuning
  • Stacking combines diverse models with a meta-learner
  • Both methods provide feature importance for interpretability

5.6 Support Vector Machines Intermediate

Support Vector Machines

Support Vector Machines (SVMs) are powerful classifiers based on elegant geometric and mathematical principles. They find the optimal decision boundary by maximizing the margin between classes—the distance from the boundary to the nearest training examples. This margin maximization provides theoretical guarantees and often excellent generalization.

SVMs also introduce the kernel trick, which allows them to learn nonlinear decision boundaries by implicitly mapping data to high-dimensional feature spaces. This combination of margin maximization and kernels made SVMs dominant in machine learning for decades.

The Maximum Margin Principle

Consider a binary classification problem where classes are linearly separable. Infinitely many hyperplanes could separate them, but which is best? SVMs choose the hyperplane that maximizes the margin—the distance to the nearest training point from either class.

Why maximize the margin? Intuition: a larger margin means the classifier is more confident; small perturbations in the data won't change the classification. Theory: larger margins correspond to lower VC dimension, providing better generalization bounds.

The decision boundary is defined by $\mathbf{w}^T \mathbf{x} + b = 0$, where $\mathbf{w}$ is the normal vector and $b$ is the bias. For a point $\mathbf{x}_i$, its distance to the hyperplane is:

$$\text{distance} = \frac{|\mathbf{w}^T \mathbf{x}_i + b|}{\|\mathbf{w}\|}$$

The support vectors are the training examples closest to the decision boundary. They "support" the hyperplane—removing other training points wouldn't change it.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs

# Generate linearly separable data
X, y = make_blobs(n_samples=100, centers=2, random_state=42, cluster_std=1.5)

# Train SVM with linear kernel
svm = SVC(kernel='linear', C=1.0)
svm.fit(X, y)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', s=50)

# Highlight support vectors
plt.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
            s=200, facecolors='none', edgecolors='black', linewidths=2)

# Draw decision boundary and margins
ax = plt.gca()
xlim, ylim = ax.get_xlim(), ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
XX, YY = np.meshgrid(xx, yy)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = svm.decision_function(xy).reshape(XX.shape)

# Plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1],
           alpha=0.5, linestyles=['--', '-', '--'])

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM with Maximum Margin (Support Vectors Circled)')
plt.show()

print(f"Number of support vectors: {len(svm.support_vectors_)}")

The SVM Optimization Problem

To maximize the margin, SVMs solve a constrained optimization problem. Let $y_i \in \{-1, +1\}$ denote class labels. We want:

$$\text{maximize} \quad \frac{2}{\|\mathbf{w}\|}$$

subject to $y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1$ for all $i$ (all points correctly classified and outside the margin).

Equivalently, we minimize $\frac{1}{2}\|\mathbf{w}\|^2$ (a convex problem):

$$\text{minimize} \quad \frac{1}{2}\|\mathbf{w}\|^2$$
$$\text{subject to} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1$$

This is a quadratic programming problem with a global optimum, solvable efficiently.

Soft Margin SVM

Real data is rarely linearly separable. Soft margin SVMs allow some misclassifications by introducing slack variables $\xi_i \geq 0$:

$$\text{minimize} \quad \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i$$
$$\text{subject to} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i$$

The hyperparameter $C$ controls the trade-off:

  • Large C: Strict margin, few violations (risk of overfitting)
  • Small C: Wider margin, more violations (risk of underfitting)

Points with $\xi_i = 0$ are correctly classified outside the margin. Points with $0 < \xi_i < 1$ are within the margin but correctly classified. Points with $\xi_i \geq 1$ are misclassified.

PYTHON
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=500, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Effect of C (regularization strength):")
print(f"{'C':<10} {'Train Acc':<12} {'Test Acc':<12} {'# Support Vectors'}")
print("-" * 50)

for C in [0.01, 0.1, 1.0, 10.0, 100.0]:
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_train, y_train)

    train_acc = svm.score(X_train, y_train)
    test_acc = svm.score(X_test, y_test)
    n_sv = len(svm.support_vectors_)

    print(f"{C:<10} {train_acc:<12.4f} {test_acc:<12.4f} {n_sv}")

The Dual Formulation and Support Vectors

The SVM optimization problem is typically solved in its dual form, which reveals why it's called "Support Vector" Machine. The dual involves Lagrange multipliers $\alpha_i$ for each training example:

$$\text{maximize} \quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i^T \mathbf{x}_j)$$
$$\text{subject to} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0$$

The optimal $\mathbf{w}$ is a linear combination of training examples:

$$\mathbf{w} = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i$$

Crucially, most $\alpha_i = 0$. Only the support vectors (examples with $\alpha_i > 0$) contribute to $\mathbf{w}$. These are the points on or inside the margin.

This sparsity is computationally advantageous: the model depends only on support vectors, not all training data.

The Kernel Trick

Linear SVMs find hyperplane boundaries. But what about nonlinear relationships? The kernel trick allows SVMs to learn complex boundaries without explicitly computing in high-dimensional space.

The key insight: the dual formulation depends on data only through dot products $\mathbf{x}_i^T \mathbf{x}_j$. If we map data to a higher-dimensional space $\phi(\mathbf{x})$, we need $\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$.

A kernel function computes this dot product directly: $K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$

We never explicitly compute $\phi$—just the kernel. This is efficient even when $\phi$ maps to infinite-dimensional spaces!

Common kernels:

Linear: $K(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T \mathbf{x}'$

Polynomial: $K(\mathbf{x}, \mathbf{x}') = (\gamma \mathbf{x}^T \mathbf{x}' + r)^d$

RBF (Radial Basis Function): $K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2)$

Sigmoid: $K(\mathbf{x}, \mathbf{x}') = \tanh(\gamma \mathbf{x}^T \mathbf{x}' + r)$

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons

# Generate nonlinearly separable data
X, y = make_moons(n_samples=200, noise=0.15, random_state=42)

# Compare kernels
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

kernels = ['linear', 'poly', 'rbf']
titles = ['Linear Kernel', 'Polynomial Kernel (d=3)', 'RBF Kernel']

for ax, kernel, title in zip(axes, kernels, titles):
    if kernel == 'poly':
        svm = SVC(kernel='poly', degree=3, C=1.0)
    else:
        svm = SVC(kernel=kernel, C=1.0)

    svm.fit(X, y)

    # Create mesh
    xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-0.5, X[:, 0].max()+0.5, 100),
                         np.linspace(X[:, 1].min()-0.5, X[:, 1].max()+0.5, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolors='black')
    ax.set_title(f'{title}\nAccuracy: {svm.score(X, y):.3f}')

plt.tight_layout()
plt.show()

The RBF Kernel in Depth

The RBF kernel (also called Gaussian kernel) is the most popular choice for nonlinear SVMs. It measures similarity based on Euclidean distance:

$$K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2)$$

The parameter $\gamma$ controls the kernel's "reach":

  • Large <!--MATHBLOCK39-->: Tight influence, each point only affects nearby points (complex, wiggly boundary)
  • Small <!--MATHBLOCK40-->: Wide influence, decisions based on distant points (smoother boundary)

The RBF kernel implicitly maps to an infinite-dimensional feature space, allowing it to fit arbitrarily complex boundaries.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
gammas = [0.1, 1.0, 10.0, 100.0]

for ax, gamma in zip(axes, gammas):
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X, y)

    xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 100),
                         np.linspace(-1.5, 1.5, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolors='black', s=20)
    ax.set_title(f'gamma={gamma}\nAcc: {svm.score(X, y):.3f}')

plt.tight_layout()
plt.show()

Multiclass Classification

SVMs are inherently binary classifiers. For multiclass problems, two strategies are common:

One-vs-Rest (OvR): Train $K$ binary classifiers, each distinguishing one class from all others. Predict the class whose classifier gives the highest score.

One-vs-One (OvO): Train $\binom{K}{2}$ classifiers, one for each pair of classes. Predict by voting. More classifiers but each uses less data.

Scikit-learn uses OvO by default for SVC with kernel='rbf'.

PYTHON
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Multiclass problem
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

svm = SVC(kernel='rbf', C=1.0, gamma='scale', decision_function_shape='ovr')
svm.fit(X_train, y_train)

print(f"Multiclass accuracy: {svm.score(X_test, y_test):.4f}")
print(f"Classes: {iris.target_names}")
print(f"Number of classifiers (OvO): {svm.n_support_.sum()} support vectors total")

Feature Scaling is Critical

SVMs are sensitive to feature scales because the kernel functions depend on distances. Features with larger magnitudes dominate the distance calculations, making other features irrelevant.

Always standardize features before training SVMs (zero mean, unit variance).

PYTHON
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create data with different scales
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=42)
X[:, 0] *= 1000  # Make first feature much larger

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
svm_unscaled = SVC(kernel='rbf')
svm_unscaled.fit(X_train, y_train)
print(f"Without scaling: {svm_unscaled.score(X_test, y_test):.4f}")

# With scaling (pipeline ensures proper handling)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf'))
])
pipeline.fit(X_train, y_train)
print(f"With scaling:    {pipeline.score(X_test, y_test):.4f}")

Hyperparameter Tuning

SVM performance depends critically on hyperparameters:

  • C: Regularization (margin strictness)
  • gamma: RBF kernel width (for RBF kernel)
  • degree: Polynomial degree (for polynomial kernel)

Grid search with cross-validation is standard practice:

PYTHON
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

# Parameter grid
param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.01, 0.1, 1],
    'svm__kernel': ['rbf', 'linear']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
print(f"Test accuracy: {grid_search.score(X_test, y_test):.4f}")

SVMs for Regression (SVR)

Support Vector Regression uses the same principles for regression. Instead of finding a margin that separates classes, SVR finds a tube of width $\epsilon$ that captures most training points. Points outside the tube incur a loss.

PYTHON
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Generate regression data
np.random.seed(42)
X = np.sort(5 * np.random.rand(200, 1), axis=0)
y = np.sin(X).ravel() + np.random.randn(200) * 0.1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# SVR requires scaling
scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).ravel()

svr = SVR(kernel='rbf', C=100, gamma='scale', epsilon=0.1)
svr.fit(X_train_scaled, y_train_scaled)

y_pred_scaled = svr.predict(X_test_scaled)
y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).ravel()

print(f"SVR R²: {r2_score(y_test, y_pred):.4f}")
print(f"SVR RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

When to Use SVMs

Strengths:

  • Effective in high-dimensional spaces
  • Memory efficient (only store support vectors)
  • Versatile through different kernels
  • Strong theoretical foundation

Weaknesses:

  • Slow training for large datasets (<!--MATHBLOCK44--> to <!--MATHBLOCK45-->)
  • Requires feature scaling
  • Less effective with noisy data (many overlapping classes)
  • Kernel and hyperparameter selection requires expertise
  • No direct probability estimates (though can be calibrated)

Use SVMs when:

  • Dataset is small to medium-sized
  • Features are high-dimensional
  • Clear margin of separation exists
  • You need a well-founded algorithm with guarantees

Consider alternatives when:

  • Dataset is very large (use linear SVM or gradient boosting)
  • You need probability estimates (use logistic regression or random forests)
  • Interpretability is critical (use decision trees or linear models)
  • Key Takeaways

  • SVMs maximize the margin between classes for robust classification
  • Support vectors are training examples that define the decision boundary
  • The C parameter controls the trade-off between margin width and training errors
  • The kernel trick enables nonlinear boundaries without explicit feature mapping
  • RBF kernel is most common; gamma controls its flexibility
  • Feature scaling is essential for SVMs due to distance-based computations
  • Hyperparameter tuning (C, gamma, kernel) significantly impacts performance
  • SVMs work well for small to medium datasets with clear class separation
  • For large datasets, consider linear SVMs or alternative algorithms