Intermediate Advanced 105 min read

Chapter 7: Ensemble Methods

Random forests, gradient boosting, XGBoost, and model stacking.

Learning Objectives

["Build ensemble models", "Master gradient boosting", "Optimize hyperparameters with Optuna"]


7.1 Train/Validation/Test Split Beginner

Train/Validation/Test Split

How do you know if your machine learning model will work on new data it has never seen? You can't wait until deployment to find out—by then, poor predictions could have real consequences. The solution is to simulate the future by holding out part of your data during development.

Proper data splitting is fundamental to building reliable models. It's also one of the most common sources of mistakes, leading to overly optimistic performance estimates that collapse when models meet reality.

Why Split Data?

Machine learning models are excellent at finding patterns—sometimes too excellent. Given enough capacity, a model can memorize the training data perfectly, achieving zero training error while learning nothing generalizable. This is overfitting: the model captures noise and idiosyncrasies of the training set rather than the underlying signal.

To detect overfitting and estimate real-world performance, we need data the model has never seen during training. This is the purpose of data splitting: reserve some data exclusively for evaluation.

Consider this analogy: if a student sees the exact exam questions while studying, their test score won't reflect their true understanding. We need unseen questions to assess genuine learning.

PYTHON
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Generate simple data
np.random.seed(42)
X = np.linspace(0, 1, 30).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + np.random.randn(30) * 0.2

# Fit a high-degree polynomial (will overfit)
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X)

model = LinearRegression()
model.fit(X_poly, y)

# Training error looks great!
train_pred = model.predict(X_poly)
train_mse = mean_squared_error(y, train_pred)
print(f"Training MSE: {train_mse:.4f}")
print("Looks great! But is this model actually good?")
print("We can't know without testing on unseen data.")

The Three-Way Split

The standard approach divides data into three sets:

Training set (60-80%): Used to fit model parameters. The model sees these examples during learning. All gradient computations, weight updates, and pattern extraction happen here.

Validation set (10-20%): Used for model selection and hyperparameter tuning. We evaluate candidate models on validation data to choose the best one. The model never trains on this data, but we use it to make decisions.

Test set (10-20%): Held out completely until final evaluation. Touched only once to get an unbiased estimate of real-world performance. Never used for any decisions during development.

The key insight is that any data used to make decisions becomes "seen" in some sense. If you tune hyperparameters based on test performance, you're implicitly fitting to the test set, making your final estimate optimistic.

PYTHON
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           random_state=42)

# First split: separate test set (final evaluation only)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 * 0.8 = 0.2
)

print("Data Split:")
print(f"  Training:   {len(X_train):4d} samples ({len(X_train)/len(X):.0%})")
print(f"  Validation: {len(X_val):4d} samples ({len(X_val)/len(X):.0%})")
print(f"  Test:       {len(X_test):4d} samples ({len(X_test)/len(X):.0%})")
print(f"  Total:      {len(X):4d} samples")

The Workflow

A proper machine learning workflow respects the data splits:

  1. Split data into train/validation/test before any analysis
  2. Explore and preprocess using only training data statistics
  3. Train models on training data
  4. Evaluate and tune using validation data
  5. Select final model based on validation performance
  6. Report final performance on test data (once!)

The test set is like a sealed envelope. Opening it should happen only at the very end, and only once.

PYTHON
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np

# Setup: split data
np.random.seed(42)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

# Step 2: Preprocess (fit on training only!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Steps 3-4: Train and evaluate multiple models on validation
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

print("Model Selection (using validation set):")
best_model = None
best_val_acc = 0

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    val_acc = accuracy_score(y_val, model.predict(X_val_scaled))
    print(f"  {name}: {val_acc:.4f}")

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_model = model
        best_name = name

# Step 5-6: Final evaluation on test set (only once!)
test_acc = accuracy_score(y_test, best_model.predict(X_test_scaled))
print(f"\nSelected model: {best_name}")
print(f"Final test accuracy: {test_acc:.4f}")

Stratified Splitting

For classification with imbalanced classes, random splitting might put all rare class examples in one set by chance. Stratified splitting ensures each split has the same class proportions as the original data.

PYTHON
from sklearn.model_selection import train_test_split
import numpy as np

# Imbalanced dataset: 90% class 0, 10% class 1
np.random.seed(42)
X = np.random.randn(1000, 5)
y = np.array([0]*900 + [1]*100)
np.random.shuffle(y)

# Regular split (might have imbalanced splits)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Stratified split (preserves class proportions)
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Class Distribution Comparison:")
print(f"Original:           {np.mean(y):.2%} class 1")
print(f"Regular test set:   {np.mean(y_test_reg):.2%} class 1")
print(f"Stratified test set: {np.mean(y_test_strat):.2%} class 1")
print("\nStratified splitting preserves class balance!")

Time Series Splitting

For time series data, random splitting violates temporal ordering—you'd be using future data to predict the past. Instead, use temporal splitting: train on earlier data, test on later data.

PYTHON
import numpy as np
import pandas as pd

# Simulated time series
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=365, freq='D')
values = np.cumsum(np.random.randn(365)) + np.arange(365) * 0.1

df = pd.DataFrame({'date': dates, 'value': values})

# Wrong: random split (data leakage!)
# train, test = train_test_split(df, test_size=0.2)

# Correct: temporal split
train_size = int(len(df) * 0.8)
train = df.iloc[:train_size]
test = df.iloc[train_size:]

print("Time Series Split:")
print(f"Training: {train['date'].min()} to {train['date'].max()} ({len(train)} days)")
print(f"Test:     {test['date'].min()} to {test['date'].max()} ({len(test)} days)")
print("\nTraining data is strictly before test data.")

Data Leakage

Data leakage occurs when information from outside the training set improperly influences model training. This leads to overly optimistic performance estimates that don't reflect real-world performance.

Common sources of leakage:

Preprocessing on full data: Fitting scalers, encoders, or imputers on all data before splitting. The training set then contains information about the test set.

Feature engineering using future information: For time series, using features that wouldn't be available at prediction time.

Target leakage: Features that encode the target variable or are caused by it rather than predicting it.

Duplicate or near-duplicate data: Same examples appearing in both train and test sets.

PYTHON
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# WRONG: Scale before splitting (leakage!)
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X)  # Sees all data!
X_train_wrong, X_test_wrong, y_train, y_test = train_test_split(
    X_scaled_wrong, y, test_size=0.2, random_state=42
)

# CORRECT: Split first, then scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler_correct = StandardScaler()
X_train_correct = scaler_correct.fit_transform(X_train)  # Fit only on train
X_test_correct = scaler_correct.transform(X_test)  # Transform test

print("Data Leakage Example:")
print("In this simple case, results are similar, but with real data,")
print("leakage can give overly optimistic estimates that fail in production.")
print("\nAlways: Split first, then preprocess!")

Using Pipelines to Prevent Leakage

Scikit-learn Pipelines help prevent leakage by bundling preprocessing with modeling. The pipeline ensures preprocessing is fit only on training data.

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=50, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline ensures proper ordering
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('classifier', LogisticRegression(random_state=42))
])

# During cross-validation, preprocessing is fit fresh on each fold's training data
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)

# Final fit and evaluation
pipeline.fit(X_train, y_train)
test_acc = pipeline.score(X_test, y_test)

print("Pipeline ensures no data leakage:")
print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Test Accuracy: {test_acc:.4f}")

When Data is Limited

With small datasets, holding out 20% for validation and 20% for testing leaves little for training. Alternatives:

Cross-validation for validation: Use k-fold cross-validation instead of a fixed validation set. More robust estimates with all data used for training.

Nested cross-validation: Outer loop for test evaluation, inner loop for hyperparameter tuning. Most rigorous but computationally expensive.

Bootstrap estimation: Sample with replacement to create training sets; use out-of-bag samples for validation.

PYTHON
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np

# Small dataset
X, y = make_classification(n_samples=200, n_features=20, random_state=42)

# With limited data, use cross-validation instead of fixed splits
model = LogisticRegression(random_state=42)
cv_scores = cross_val_score(model, X, y, cv=10)  # 10-fold CV

print("Cross-Validation with Limited Data:")
print(f"10-Fold CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Each fold uses {len(X)*0.9:.0f} samples for training")
print("\nMore reliable than a single 80/20 split with small data.")

Common Mistakes

Using test data for decisions: If you check test performance and then change your model, the test set is no longer unbiased.

Forgetting to stratify: With imbalanced classes, splits might not represent the true distribution.

Ignoring temporal structure: Random splitting time series creates unrealistic train/test scenarios.

Data leakage in preprocessing: Fitting transformers on full data before splitting.

Multiple test evaluations: Each time you evaluate on the test set and adjust, you're implicitly fitting to it.

PYTHON
# Anti-pattern: "Peeking" at test data
# DON'T DO THIS:

# model_v1.fit(X_train, y_train)
# print(f"Test acc: {model_v1.score(X_test, y_test)}")  # Peek at test
#
# # "Hmm, not good enough, let me try something else"
# model_v2.fit(X_train, y_train)  # Adjusted based on test result
# print(f"Test acc: {model_v2.score(X_test, y_test)}")  # Peek again
#
# # This process makes test accuracy optimistic!

print("Don't peek at test data multiple times!")
print("Use validation set for model development.")
print("Test set is for final, one-time evaluation only.")

Practical Guidelines

Decide splits before looking at data: Avoid the temptation to adjust splits based on results.

Use stratification for classification: Especially with imbalanced classes.

Respect temporal order: For time series, always test on future data.

Use pipelines: They help prevent preprocessing leakage.

Document your splits: Record random seeds and methodology for reproducibility.

When in doubt, use cross-validation: More robust than single splits, especially with limited data.

Key Takeaways

  • Training data fits the model; validation data selects hyperparameters; test data estimates final performance
  • The test set should be touched only once for final evaluation
  • Stratified splitting preserves class proportions in each split
  • Time series requires temporal splitting—train on past, test on future
  • Data leakage occurs when test information influences training; use pipelines to prevent it
  • With limited data, cross-validation is more reliable than fixed splits
  • Common mistakes: peeking at test data, forgetting stratification, preprocessing before splitting
  • Proper splitting is essential for realistic performance estimates

7.2 Classification Metrics Intermediate

Classification Metrics

Evaluating classification models requires careful consideration of what "good" means for your specific problem. A model with 99% accuracy might be excellent or useless depending on the context. This section dives deep into classification metrics, helping you choose the right ones and interpret them correctly.

Beyond Accuracy: Why One Metric Isn't Enough

Accuracy—the proportion of correct predictions—is intuitive but often misleading. Consider a medical screening test where 1% of patients have the disease. A model that always predicts "healthy" achieves 99% accuracy while being completely useless for its purpose.

Different errors have different costs. Missing a fraud transaction (false negative) might cost thousands of dollars. Blocking a legitimate transaction (false positive) might cost a customer. These costs determine which metrics matter most.

PYTHON
import numpy as np

# Imbalanced scenario: 1% fraud rate
n_transactions = 10000
n_fraud = 100
n_legitimate = 9900

# Model A: Always predicts legitimate
pred_a = np.zeros(n_transactions)
actual = np.array([1]*n_fraud + [0]*n_legitimate)

accuracy_a = np.mean(pred_a == actual)
fraud_caught_a = 0

# Model B: Catches 80% of fraud, 5% false positive rate
tp_b = 80  # Caught fraud
fn_b = 20  # Missed fraud
fp_b = 495  # False alarms (5% of 9900)
tn_b = 9405  # Correct legitimate

accuracy_b = (tp_b + tn_b) / n_transactions
fraud_caught_b = tp_b

print("Fraud Detection Comparison:")
print(f"Model A (always 'legitimate'): {accuracy_a:.1%} accuracy, {fraud_caught_a} frauds caught")
print(f"Model B (actual classifier):   {accuracy_b:.1%} accuracy, {fraud_caught_b} frauds caught")
print("\nModel A has higher accuracy but catches zero fraud!")
print("Accuracy alone is misleading for imbalanced problems.")

The Confusion Matrix in Depth

The confusion matrix reveals exactly where your model succeeds and fails. For binary classification:

| | Predicted Positive | Predicted Negative | |--------------------|--------------------|--------------------| | Actual Positive | True Positive (TP) | False Negative (FN) | | Actual Negative | False Positive (FP) | True Negative (TN) |

Each cell tells a story:

  • TP: Correctly identified positives (hits)
  • TN: Correctly identified negatives (correct rejections)
  • FP: False alarms (Type I error)
  • FN: Misses (Type II error)

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate imbalanced data
X, y = make_classification(n_samples=2000, n_features=20, weights=[0.9, 0.1],
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train and predict
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print("Confusion Matrix Analysis:")
print(f"True Negatives:  {tn} (correctly identified negatives)")
print(f"False Positives: {fp} (false alarms)")
print(f"False Negatives: {fn} (missed positives)")
print(f"True Positives:  {tp} (correctly identified positives)")

# Visualize
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Precision, Recall, and Their Tradeoff

Precision answers: "When the model says positive, how often is it right?"

$$\text{Precision} = \frac{TP}{TP + FP}$$

High precision means few false positives. Important when false positives are costly (spam filter blocking important email, recommending surgery that isn't needed).

Recall (Sensitivity) answers: "Of all actual positives, how many did we find?"

$$\text{Recall} = \frac{TP}{TP + FN}$$

High recall means few false negatives. Important when false negatives are costly (missing cancer, failing to detect fraud).

These metrics are inversely related: improving one typically worsens the other. The classification threshold controls this tradeoff.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Precision-Recall at different thresholds
thresholds = [0.3, 0.5, 0.7]
print("Precision-Recall Tradeoff:")
print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<12}")
print("-" * 36)

for t in thresholds:
    y_pred = (y_proba >= t).astype(int)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    print(f"{t:<12.1f} {prec:<12.3f} {rec:<12.3f}")

print("\nLower threshold → Higher recall, lower precision")
print("Higher threshold → Higher precision, lower recall")

F1 Score and Fβ Variants

The F1 score balances precision and recall using their harmonic mean:

$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

The harmonic mean penalizes extreme imbalances—you can't get a good F1 by excelling at one while failing at the other.

The Fβ score generalizes this with a parameter β:

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

  • β = 1: Standard F1, equal weight
  • β = 2: F2, recall is twice as important
  • β = 0.5: F0.5, precision is twice as important

PYTHON
from sklearn.metrics import f1_score, fbeta_score, precision_score, recall_score
import numpy as np

# Simulated predictions
np.random.seed(42)
y_true = np.array([1]*100 + [0]*400)
y_pred = np.array([1]*70 + [0]*30 + [1]*50 + [0]*350)  # Some errors

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
f2 = fbeta_score(y_true, y_pred, beta=2)
f05 = fbeta_score(y_true, y_pred, beta=0.5)

print("F-Score Variants:")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 (balanced):      {f1:.3f}")
print(f"F2 (recall-heavy):  {f2:.3f}")
print(f"F0.5 (prec-heavy):  {f05:.3f}")

print("\nUse F2 when missing positives is costly (medical diagnosis)")
print("Use F0.5 when false alarms are costly (spam filter)")

ROC Curve and AUC

The ROC curve (Receiver Operating Characteristic) shows the tradeoff between true positive rate (recall) and false positive rate across all thresholds.

  • True Positive Rate (TPR) = TP/(TP+FN) = Recall
  • False Positive Rate (FPR) = FP/(FP+TN)

The curve shows how much "true positive catching" you get for each unit of "false positive acceptance."

AUC (Area Under the ROC Curve) summarizes this curve:

  • AUC = 1.0: Perfect separation
  • AUC = 0.5: Random guessing
  • AUC < 0.5: Worse than random (flip predictions!)

AUC measures ranking quality: the probability that a randomly chosen positive ranks higher than a randomly chosen negative.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Compare two models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

plt.figure(figsize=(8, 6))

for name, model in models.items():
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Precision-Recall Curves

For imbalanced datasets, the PR curve is often more informative than ROC. ROC can look good even when a model performs poorly on the minority class because TN dominates the FPR calculation.

The PR curve plots precision vs recall at various thresholds. Average Precision (AP) summarizes the curve.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Highly imbalanced data
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP = {ap:.3f})')
plt.axhline(y=y_test.mean(), color='r', linestyle='--',
            label=f'Baseline (positive rate = {y_test.mean():.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve (Imbalanced Data)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Class imbalance: {y_test.mean():.1%} positive")
print(f"Average Precision: {ap:.3f}")
print("\nPR curves are more informative for imbalanced data than ROC.")

Multiclass Classification Metrics

For multiple classes, metrics can be computed and aggregated in different ways:

Macro averaging: Compute metric for each class, then average. Treats all classes equally regardless of size.

Micro averaging: Aggregate TP, FP, FN across all classes, then compute metric. Weighted by class frequency.

Weighted averaging: Average per-class metrics weighted by class support.

PYTHON
from sklearn.metrics import classification_report, f1_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

print("Averaging Methods for F1:")
print(f"  Macro:    {f1_score(y_test, y_pred, average='macro'):.3f} (all classes equal)")
print(f"  Micro:    {f1_score(y_test, y_pred, average='micro'):.3f} (weighted by frequency)")
print(f"  Weighted: {f1_score(y_test, y_pred, average='weighted'):.3f} (weighted by support)")

Choosing the Right Metric

| Scenario | Recommended Metrics | |----------|---------------------| | Balanced classes | Accuracy, F1 | | Imbalanced classes | PR-AUC, F1, Recall | | False positives costly | Precision, F0.5 | | False negatives costly | Recall, F2 | | Ranking/scoring needed | ROC-AUC | | Probability calibration | Log Loss, Brier Score |

PYTHON
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, average_precision_score, log_loss)
import numpy as np

def comprehensive_evaluation(y_true, y_pred, y_proba):
    """Evaluate with multiple metrics."""
    return {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1': f1_score(y_true, y_pred),
        'ROC-AUC': roc_auc_score(y_true, y_proba),
        'PR-AUC': average_precision_score(y_true, y_proba),
        'Log Loss': log_loss(y_true, y_proba)
    }

# Example usage
np.random.seed(42)
y_true = np.array([1]*50 + [0]*450)
y_proba = np.clip(np.where(y_true == 1, 0.7, 0.2) + np.random.randn(500)*0.2, 0.01, 0.99)
y_pred = (y_proba >= 0.5).astype(int)

metrics = comprehensive_evaluation(y_true, y_pred, y_proba)

print("Comprehensive Evaluation:")
for name, value in metrics.items():
    print(f"  {name:<12}: {value:.4f}")

Key Takeaways

  • Accuracy alone is insufficient, especially for imbalanced classes
  • Precision measures false positive rate; recall measures false negative rate
  • There's a fundamental tradeoff between precision and recall
  • F1 balances both; allows custom weighting
  • ROC-AUC measures ranking quality across all thresholds
  • PR-AUC is more informative for imbalanced data
  • For multiclass, choose averaging method based on whether class sizes matter
  • Choose metrics based on business costs of different error types
  • Always examine the confusion matrix for detailed error analysis

7.3 Regression Metrics Intermediate

Regression Metrics

Regression models predict continuous values—house prices, temperatures, stock returns, customer lifetime value. Evaluating these predictions requires different metrics than classification. We need to measure how far predictions are from actual values and whether these errors are acceptable for our application.

This section covers the essential regression metrics, their interpretations, and when to use each one.

Mean Squared Error (MSE)

Mean Squared Error is the average of squared differences between predictions and actual values:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

MSE squares each error, so large errors contribute disproportionately. This makes MSE sensitive to outliers—a few bad predictions can dominate the metric.

The units of MSE are the square of the target units (e.g., dollars² for price prediction), making direct interpretation difficult.

PYTHON
import numpy as np
from sklearn.metrics import mean_squared_error

# Example predictions
y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 140, 190, 260, 290])

# Manual calculation
errors = y_true - y_pred
squared_errors = errors ** 2
mse = np.mean(squared_errors)

print("MSE Calculation:")
print(f"True values:     {y_true}")
print(f"Predictions:     {y_pred}")
print(f"Errors:          {errors}")
print(f"Squared errors:  {squared_errors}")
print(f"MSE:             {mse:.2f}")
print(f"MSE (sklearn):   {mean_squared_error(y_true, y_pred):.2f}")

Root Mean Squared Error (RMSE)

RMSE is the square root of MSE, bringing the metric back to the original units:

$$\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

RMSE is more interpretable than MSE. If predicting house prices in dollars, RMSE is also in dollars—you can say "the model is off by about $X on average" (though it's not exactly the average error).

Like MSE, RMSE penalizes large errors heavily due to the squaring.

PYTHON
import numpy as np
from sklearn.metrics import mean_squared_error

y_true = np.array([200000, 350000, 500000, 450000, 600000])  # House prices
y_pred = np.array([210000, 340000, 480000, 460000, 580000])

mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)

print("House Price Prediction:")
print(f"MSE:  {mse:,.0f} dollars²  (hard to interpret)")
print(f"RMSE: ${rmse:,.0f}  (interpretable: typical error magnitude)")

Mean Absolute Error (MAE)

Mean Absolute Error averages the absolute differences:

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

MAE treats all errors equally—a $100 error contributes the same whether it's on a $200 or $200,000 prediction. This makes MAE more robust to outliers than MSE/RMSE.

MAE is directly interpretable as the average prediction error in the target's units.

PYTHON
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

y_true = np.array([100, 150, 200, 250, 1000])  # Last value is an outlier
y_pred = np.array([110, 140, 190, 260, 500])   # Prediction misses the outlier badly

mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)

print("Impact of Outliers:")
print(f"MAE:  {mae:.1f}")
print(f"RMSE: {rmse:.1f}")
print(f"\nRMSE is {rmse/mae:.1f}x larger than MAE due to the outlier.")
print("MAE is more robust to outliers than RMSE.")

MSE vs MAE: When to Use Each

Use MSE/RMSE when:

  • Large errors are disproportionately bad
  • You want to penalize outlier predictions more
  • Optimizing squared loss (many algorithms do this naturally)

Use MAE when:

  • All errors should be weighted equally
  • Outliers in the data shouldn't dominate evaluation
  • You want a robust, interpretable metric

PYTHON
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Two models with same MAE but different error distributions
y_true = np.array([100]*10)

# Model A: consistent small errors
y_pred_a = np.array([90, 110, 90, 110, 90, 110, 90, 110, 90, 110])

# Model B: mostly perfect, one big error
y_pred_b = np.array([100, 100, 100, 100, 100, 100, 100, 100, 100, 0])

print("Same MAE, Different Error Patterns:")
print(f"\nModel A (consistent errors):")
print(f"  MAE:  {mean_absolute_error(y_true, y_pred_a):.1f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_true, y_pred_a)):.1f}")

print(f"\nModel B (one big error):")
print(f"  MAE:  {mean_absolute_error(y_true, y_pred_b):.1f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_true, y_pred_b)):.1f}")

print("\nRMSE penalizes the concentrated error in Model B more heavily.")

R-Squared (Coefficient of Determination)

measures the proportion of variance in the target that's explained by the model:

$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2} = 1 - \frac{SS_{res}}{SS_{tot}}$$

where $SS_{res}$ is the residual sum of squares and $SS_{tot}$ is the total sum of squares.

Interpretation:

  • R² = 1: Perfect predictions
  • R² = 0: Model predicts as well as always guessing the mean
  • R² < 0: Model is worse than predicting the mean (possible on test data!)

R² is scale-independent, making it useful for comparing models across different datasets. However, it can be misleading—a high R² doesn't mean predictions are accurate enough for the application.

PYTHON
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error

# Generate data with known relationship
np.random.seed(42)
X = np.linspace(0, 10, 100)
y_true = 2 * X + 5 + np.random.randn(100) * 2  # True relationship with noise

# Different quality predictions
y_perfect = 2 * X + 5  # Perfect model
y_good = 2 * X + 5 + np.random.randn(100) * 1  # Good model
y_mean = np.full(100, y_true.mean())  # Always predict mean
y_bad = np.random.randn(100) * 10  # Random noise

print("R² Interpretation:")
print(f"Perfect model:     R² = {r2_score(y_true, y_perfect):.4f}")
print(f"Good model:        R² = {r2_score(y_true, y_good):.4f}")
print(f"Mean prediction:   R² = {r2_score(y_true, y_mean):.4f}")
print(f"Random prediction: R² = {r2_score(y_true, y_bad):.4f}")
print("\nR² < 0 means worse than predicting the mean!")

Adjusted R-Squared

Standard R² always increases (or stays the same) when adding more features, even useless ones. Adjusted R² penalizes for additional features:

$$R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}$$

where $n$ is sample size and $p$ is number of features.

Adjusted R² decreases if a new feature doesn't improve the model enough to justify its inclusion.

PYTHON
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

def adjusted_r2(r2, n, p):
    """Calculate adjusted R-squared."""
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

# Generate data
np.random.seed(42)
n = 100
X_useful = np.random.randn(n, 2)  # 2 useful features
X_noise = np.random.randn(n, 5)   # 5 noise features
y = X_useful @ [3, -2] + np.random.randn(n)

# Model with useful features only
model1 = LinearRegression().fit(X_useful, y)
r2_1 = r2_score(y, model1.predict(X_useful))
adj_r2_1 = adjusted_r2(r2_1, n, 2)

# Model with all features (useful + noise)
X_all = np.hstack([X_useful, X_noise])
model2 = LinearRegression().fit(X_all, y)
r2_2 = r2_score(y, model2.predict(X_all))
adj_r2_2 = adjusted_r2(r2_2, n, 7)

print("R² vs Adjusted R²:")
print(f"\n2 useful features:")
print(f"  R²:     {r2_1:.4f}")
print(f"  Adj R²: {adj_r2_1:.4f}")

print(f"\n7 features (2 useful + 5 noise):")
print(f"  R²:     {r2_2:.4f}")
print(f"  Adj R²: {adj_r2_2:.4f}")

print("\nR² increased with noise features, but Adjusted R² decreased!")

Mean Absolute Percentage Error (MAPE)

MAPE expresses errors as percentages of actual values:

$$\text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|$$

MAPE is intuitive ("the model is off by X%") but has problems:

  • Undefined when actual values are zero
  • Asymmetric: overestimates are penalized more than underestimates
  • Problematic with values near zero

PYTHON
import numpy as np

def mape(y_true, y_pred):
    """Calculate Mean Absolute Percentage Error."""
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

y_true = np.array([100, 200, 300, 400, 500])
y_pred = np.array([110, 180, 290, 420, 480])

mape_value = mape(y_true, y_pred)
mae_value = np.mean(np.abs(y_true - y_pred))

print(f"MAPE: {mape_value:.1f}%")
print(f"MAE:  {mae_value:.1f}")
print("\nMAPE provides percentage interpretation.")

# Problem with small values
y_small = np.array([0.1, 0.2, 0.3])
y_pred_small = np.array([0.15, 0.25, 0.35])
print(f"\nWith small values, same absolute error gives huge MAPE:")
print(f"MAPE: {mape(y_small, y_pred_small):.1f}%")

Symmetric Mean Absolute Percentage Error (SMAPE)

SMAPE addresses MAPE's asymmetry by using the average of actual and predicted in the denominator:

$$\text{SMAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|) / 2}$$

SMAPE is bounded between 0% and 200% and treats over- and under-predictions more symmetrically.

PYTHON
import numpy as np

def smape(y_true, y_pred):
    """Calculate Symmetric Mean Absolute Percentage Error."""
    return np.mean(2 * np.abs(y_true - y_pred) / (np.abs(y_true) + np.abs(y_pred))) * 100

y_true = np.array([100, 200, 300])

# Overestimate by 50%
y_over = y_true * 1.5
# Underestimate by 33% (same ratio)
y_under = y_true / 1.5

print("MAPE Asymmetry vs SMAPE Symmetry:")
print(f"Overestimate by 50%:  MAPE={mape(y_true, y_over):.1f}%, SMAPE={smape(y_true, y_over):.1f}%")
print(f"Underestimate by 33%: MAPE={mape(y_true, y_under):.1f}%, SMAPE={smape(y_true, y_under):.1f}%")
print("\nSMAPE treats over and under-predictions more symmetrically.")

Comparing Metrics on Real Data

PYTHON
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Train models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

print("Regression Metrics Comparison (California Housing):")
print(f"{'Model':<20} {'RMSE':<10} {'MAE':<10} {'R²':<10}")
print("-" * 50)

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"{name:<20} {rmse:<10.4f} {mae:<10.4f} {r2:<10.4f}")

Choosing the Right Metric

| Metric | Units | Outlier Sensitivity | Best For | |--------|-------|---------------------|----------| | MSE | Squared | High | Optimization, penalizing large errors | | RMSE | Original | High | Interpretable error magnitude | | MAE | Original | Low | Robust evaluation, median-like behavior | | R² | Unitless | Medium | Comparing across datasets | | MAPE | Percentage | Varies | Business reporting (with caveats) |

PYTHON
import numpy as np

# Decision guide
print("Choosing a Regression Metric:")
print("\n1. Do large errors matter more than small ones?")
print("   Yes → Use RMSE or MSE")
print("   No  → Use MAE")

print("\n2. Is your data prone to outliers?")
print("   Yes → Use MAE (more robust)")
print("   No  → RMSE is fine")

print("\n3. Need to compare across different scales?")
print("   Yes → Use R² or MAPE")
print("   No  → RMSE or MAE work")

print("\n4. Target values near zero?")
print("   Yes → Avoid MAPE (use MAE or RMSE)")
print("   No  → MAPE can work for percentage interpretation")

Key Takeaways

  • MSE squares errors, penalizing large errors heavily; units are squared
  • RMSE is the square root of MSE, interpretable in original units
  • MAE averages absolute errors, more robust to outliers
  • measures variance explained, useful for comparing across datasets
  • Adjusted R² penalizes unnecessary features
  • MAPE gives percentage errors but has issues with small values
  • Choose based on: outlier sensitivity, interpretability needs, and error cost structure
  • Multiple metrics provide a fuller picture than any single number

7.4 Hyperparameter Tuning Intermediate

Hyperparameter Tuning

Machine learning models have two types of values that affect their behavior: parameters learned from data (like neural network weights) and hyperparameters set before training (like learning rate or tree depth). While the model figures out parameters automatically, choosing hyperparameters falls on you—and the choice matters enormously. A poorly tuned model might achieve 70% accuracy while the same architecture, properly tuned, reaches 95%.

Hyperparameter tuning is the process of systematically searching for the best configuration. This section covers the major strategies, from simple grid search to sophisticated Bayesian optimization.

Parameters vs Hyperparameters

Parameters are learned during training. In linear regression, the coefficients are parameters. In neural networks, the weights and biases are parameters. You don't set these—the optimization algorithm discovers them by minimizing the loss function on training data.

Hyperparameters control the learning process itself. They're set before training begins and remain fixed throughout. Examples include learning rate (how big each optimization step is), regularization strength (how much to penalize complexity), number of trees in a random forest, and the architecture of a neural network.

The distinction matters because different search strategies apply. Parameters are found by gradient descent or similar optimization on training data. Hyperparameters require a meta-level search—trying different configurations and evaluating each one's performance.

PYTHON
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestClassifier

# Linear regression example
ridge = Ridge(alpha=1.0)  # alpha is a HYPERPARAMETER
ridge.fit(X_train, y_train)
# After fitting, ridge.coef_ contains PARAMETERS (learned from data)

# Random Forest example
rf = RandomForestClassifier(
    n_estimators=100,     # HYPERPARAMETER: number of trees
    max_depth=10,         # HYPERPARAMETER: tree depth limit
    min_samples_split=5,  # HYPERPARAMETER: minimum samples to split
    random_state=42
)
rf.fit(X_train, y_train)
# The decision rules in each tree are PARAMETERS (learned from data)

print("Hyperparameters are set before training.")
print("Parameters are discovered during training.")

The Hyperparameter Search Problem

Finding optimal hyperparameters is challenging for several reasons. First, the search space is often large—even a model with just five hyperparameters, each with ten possible values, has 100,000 combinations. Second, evaluating each combination requires training a model and measuring validation performance, which can take minutes to hours per configuration. Third, the relationship between hyperparameters and performance is complex and non-smooth, with interactions between different settings.

The naive approach—trying every combination—quickly becomes infeasible. We need smarter strategies.

PYTHON
import numpy as np

# Simple example: just 3 hyperparameters with 10 values each
hp1_options = 10
hp2_options = 10
hp3_options = 10

total_combinations = hp1_options * hp2_options * hp3_options
time_per_eval = 30  # seconds

total_time_hours = (total_combinations * time_per_eval) / 3600

print(f"Total combinations: {total_combinations:,}")
print(f"Time per evaluation: {time_per_eval} seconds")
print(f"Total search time: {total_time_hours:.1f} hours")
print("\nThis is just 3 hyperparameters!")
print("Real models often have 10+ hyperparameters.")

Grid search is the simplest approach: define a grid of values for each hyperparameter and try every combination. It's exhaustive and guaranteed to find the best configuration within the grid.

The strength of grid search is completeness—it won't miss any combination you specified. The weakness is inefficiency. Grid search spends equal time on promising and unpromising regions of the search space. If the first hyperparameter you try happens to be optimal, you still evaluate every other value.

Grid search works well when you have few hyperparameters, reasonable guesses about their ranges, and enough computational budget to try many combinations.

PYTHON
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the grid
param_grid = {
    'C': [0.1, 1, 10, 100],           # Regularization strength
    'kernel': ['rbf', 'poly'],         # Kernel type
    'gamma': ['scale', 'auto', 0.1, 0.01]  # Kernel coefficient
}

# Count combinations
n_combinations = 4 * 2 * 4  # 32 combinations

# Grid search with cross-validation
grid_search = GridSearchCV(
    SVC(),
    param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,         # Use all CPU cores
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nTotal combinations evaluated: {n_combinations}")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
print(f"Test accuracy: {grid_search.score(X_test, y_test):.4f}")

Random search samples hyperparameter configurations randomly from specified distributions. Instead of exhaustively covering a grid, it explores the space more broadly but less thoroughly.

Surprisingly, random search often outperforms grid search in practice. The reason is that some hyperparameters matter more than others. Grid search wastes evaluations on unimportant dimensions. If only two of five hyperparameters significantly affect performance, grid search tries many combinations that differ only in the unimportant three. Random search, by contrast, gives each evaluation a fresh value for every hyperparameter.

Random search is also more flexible. You can specify continuous distributions (like log-uniform for learning rate) rather than discrete grids.

PYTHON
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform
import numpy as np

# Define distributions for hyperparameters
param_distributions = {
    'n_estimators': randint(50, 500),           # Discrete uniform
    'max_depth': randint(3, 30),                # Discrete uniform
    'min_samples_split': randint(2, 20),        # Discrete uniform
    'min_samples_leaf': randint(1, 10),         # Discrete uniform
    'max_features': uniform(0.1, 0.9)           # Continuous uniform
}

# Random search
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=50,         # Try 50 random configurations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"\nConfigurations evaluated: 50")
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV accuracy: {random_search.best_score_:.4f}")
print(f"Test accuracy: {random_search.score(X_test, y_test):.4f}")

Why Random Search Often Wins

The key insight is that hyperparameters have varying importance. In many problems, one or two hyperparameters dominate performance while others have minimal effect. Grid search allocates evaluations evenly across all dimensions, wasting budget on unimportant ones. Random search spreads samples across the important dimensions regardless of what else is being varied.

Consider a case where learning rate matters a lot and momentum barely matters. A 10x10 grid search gives you only 10 distinct learning rate values, repeated 10 times with different momentum. Random search with 100 samples gives you 100 distinct learning rate values. You're more likely to find a good learning rate.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

# Simulate: hyperparameter 1 matters, hyperparameter 2 doesn't
np.random.seed(42)

def true_performance(h1, h2):
    # Performance depends mostly on h1, barely on h2
    return -((h1 - 0.3)**2) + 0.01 * np.sin(h2 * 10) + np.random.randn() * 0.01

# Grid search: 9 points (3x3)
h1_grid = [0.1, 0.5, 0.9]
h2_grid = [0.1, 0.5, 0.9]
grid_h1 = [h1 for h1 in h1_grid for h2 in h2_grid]
grid_h2 = [h2 for h1 in h1_grid for h2 in h2_grid]
grid_perf = [true_performance(h1, h2) for h1, h2 in zip(grid_h1, grid_h2)]

# Random search: 9 points
rand_h1 = np.random.uniform(0, 1, 9)
rand_h2 = np.random.uniform(0, 1, 9)
rand_perf = [true_performance(h1, h2) for h1, h2 in zip(rand_h1, rand_h2)]

print("Grid search explores only 3 unique values of h1 (the important one)")
print(f"Unique h1 values in grid: {sorted(set(grid_h1))}")

print(f"\nRandom search explores 9 unique values of h1")
print(f"Unique h1 values in random: {sorted(rand_h1)[:5]}...")

print(f"\nBest performance - Grid: {max(grid_perf):.4f}")
print(f"Best performance - Random: {max(rand_perf):.4f}")

Bayesian Optimization

Bayesian optimization uses past evaluations to guide future ones. Instead of sampling blindly (random) or exhaustively (grid), it builds a probabilistic model of how hyperparameters relate to performance. This model predicts which configurations are likely to be good, allowing intelligent exploration.

The algorithm maintains two things: a surrogate model (often Gaussian Process or Tree-structured Parzen Estimator) that estimates performance across the hyperparameter space, and an acquisition function that balances exploration (trying uncertain regions) with exploitation (trying regions predicted to be good).

Bayesian optimization shines when evaluations are expensive. By learning from previous trials, it often finds good configurations in far fewer evaluations than random or grid search.

PYTHON
# Using Optuna for Bayesian optimization
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Suppress Optuna's logging for cleaner output
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    # Define hyperparameter search space
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0)
    }

    model = GradientBoostingClassifier(**params, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy')
    return scores.mean()

# Run optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30, show_progress_bar=True)

print(f"\nBest trial accuracy: {study.best_trial.value:.4f}")
print(f"Best hyperparameters:")
for key, value in study.best_params.items():
    print(f"  {key}: {value}")

Understanding Optuna's Search Strategy

Optuna uses a Tree-structured Parzen Estimator (TPE) by default. TPE models the distribution of hyperparameters that led to good performance versus poor performance. It then samples configurations likely to come from the "good" distribution.

The key advantage is that Optuna learns which regions of the search space are promising. Early trials explore broadly; later trials focus on promising regions while occasionally exploring elsewhere to avoid missing good areas.

PYTHON
import optuna
import matplotlib.pyplot as plt

# Create a study and examine the search process
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=False)

# Plot optimization history
trials = study.trials
trial_numbers = [t.number for t in trials]
trial_values = [t.value for t in trials]
best_so_far = [max(trial_values[:i+1]) for i in range(len(trial_values))]

print("Optimization Progress:")
print(f"Trial 10 best: {best_so_far[9]:.4f}")
print(f"Trial 25 best: {best_so_far[24]:.4f}")
print(f"Trial 50 best: {best_so_far[49]:.4f}")
print("\nBayesian optimization typically finds good solutions early,")
print("then continues improving as it refines its model of the search space.")

Cross-Validation in Hyperparameter Tuning

Every hyperparameter configuration must be evaluated somehow. Using a single validation split is fast but noisy—you might pick hyperparameters that happen to work well on that particular split but don't generalize. Cross-validation provides more reliable estimates at the cost of more computation.

The standard approach is k-fold cross-validation within the hyperparameter search. For each configuration, train k models and average their validation scores. This reduces the variance in your performance estimates, making the search less likely to be fooled by lucky or unlucky splits.

PYTHON
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Compare single split vs cross-validation for hyperparameter evaluation
def evaluate_single_split(model, X_train, y_train, X_val, y_val):
    model.fit(X_train, y_train)
    return model.score(X_val, y_val)

def evaluate_cv(model, X, y, cv=5):
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    return scores.mean(), scores.std()

# Test with different random seeds to see variance
single_split_scores = []
cv_scores = []

for seed in range(10):
    # Different random split each time
    X_tr, X_val, y_tr, y_val = train_test_split(
        X_train, y_train, test_size=0.2, random_state=seed
    )

    model = RandomForestClassifier(n_estimators=100, random_state=42)

    single_split_scores.append(evaluate_single_split(model, X_tr, y_tr, X_val, y_val))
    cv_mean, cv_std = evaluate_cv(model, X_train, y_train, cv=5)
    cv_scores.append(cv_mean)

print("Variance in evaluation methods:")
print(f"Single split std: {np.std(single_split_scores):.4f}")
print(f"CV score std:     {np.std(cv_scores):.4f}")
print("\nCross-validation gives more stable estimates.")

Nested Cross-Validation

When you use cross-validation for hyperparameter tuning, you're optimizing on the validation folds. This means your final CV score is optimistically biased—the hyperparameters were chosen to maximize it. For an unbiased estimate of generalization performance, you need nested cross-validation.

Nested CV has two loops: an outer loop for performance estimation and an inner loop for hyperparameter selection. The outer loop holds out test folds that are never used for any training or tuning decisions. The inner loop performs hyperparameter search on the remaining data.

This is computationally expensive but provides honest performance estimates.

PYTHON
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.svm import SVC
import numpy as np

# Inner CV: hyperparameter tuning
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Outer CV: performance estimation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Simple param grid for demonstration
param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 0.1]}

# Nested cross-validation
nested_scores = []

for train_idx, test_idx in outer_cv.split(X_train):
    X_tr, X_te = X_train[train_idx], X_train[test_idx]
    y_tr, y_te = y_train[train_idx], y_train[test_idx]

    # Inner loop: find best hyperparameters
    grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')
    grid_search.fit(X_tr, y_tr)

    # Evaluate on outer test fold
    score = grid_search.score(X_te, y_te)
    nested_scores.append(score)

print("Nested Cross-Validation Results:")
print(f"Scores per fold: {[f'{s:.4f}' for s in nested_scores]}")
print(f"Mean: {np.mean(nested_scores):.4f} (+/- {np.std(nested_scores)*2:.4f})")
print("\nThis is an unbiased estimate of generalization performance.")

Early Stopping

Training deep learning models involves hyperparameters like the number of epochs. Rather than fixing this, early stopping monitors validation performance during training and stops when it stops improving. This automatically tunes the training duration.

Early stopping prevents overfitting (training stops before the model memorizes noise) and saves computation (no wasted epochs after convergence). It's one of the most important techniques in deep learning.

The typical setup uses a patience parameter: stop if validation loss hasn't improved for N consecutive epochs.

PYTHON
from sklearn.neural_network import MLPClassifier
import numpy as np

# Simulate early stopping behavior
np.random.seed(42)

# Generate data
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MLPClassifier with early stopping
mlp_early = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    max_iter=500,
    early_stopping=True,       # Enable early stopping
    validation_fraction=0.1,   # Hold out 10% for validation
    n_iter_no_change=10,       # Patience: stop after 10 epochs with no improvement
    random_state=42
)

mlp_early.fit(X_train, y_train)

print(f"Early stopping after {mlp_early.n_iter_} iterations")
print(f"Test accuracy: {mlp_early.score(X_test, y_test):.4f}")

# Without early stopping
mlp_full = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    max_iter=500,
    early_stopping=False,
    random_state=42
)

mlp_full.fit(X_train, y_train)

print(f"\nWithout early stopping: {mlp_full.n_iter_} iterations")
print(f"Test accuracy: {mlp_full.score(X_test, y_test):.4f}")

Learning Rate Schedules

Learning rate is often the most important hyperparameter in neural network training. Rather than using a fixed value, learning rate schedules adjust the rate during training—typically starting higher and decreasing over time.

Common schedules include step decay (reduce by factor every N epochs), exponential decay, and cosine annealing. The intuition is that large steps help early exploration while small steps help fine-tuning near convergence.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

def step_decay(epoch, initial_lr=0.1, drop=0.5, epochs_drop=10):
    return initial_lr * (drop ** (epoch // epochs_drop))

def exponential_decay(epoch, initial_lr=0.1, decay_rate=0.95):
    return initial_lr * (decay_rate ** epoch)

def cosine_annealing(epoch, initial_lr=0.1, total_epochs=50):
    return initial_lr * (1 + np.cos(np.pi * epoch / total_epochs)) / 2

epochs = np.arange(50)

print("Learning Rate Schedules:")
print(f"Step decay at epoch 25: {step_decay(25):.4f}")
print(f"Exponential decay at epoch 25: {exponential_decay(25):.4f}")
print(f"Cosine annealing at epoch 25: {cosine_annealing(25):.4f}")

print("\nSchedules help training by:")
print("- Large initial LR: fast progress early")
print("- Decreasing LR: fine-tuning near convergence")

Hyperparameter Importance

Not all hyperparameters matter equally. Understanding which ones are important helps focus your search. Some techniques provide hyperparameter importance analysis.

PYTHON
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Create study with importance analysis
study = optuna.create_study(direction='maximize')

def objective_importance(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
    }

    model = RandomForestClassifier(**params, random_state=42)
    return cross_val_score(model, X_train, y_train, cv=3).mean()

study.optimize(objective_importance, n_trials=50, show_progress_bar=False)

# Get hyperparameter importance
importance = optuna.importance.get_param_importances(study)

print("Hyperparameter Importance:")
for param, imp in importance.items():
    print(f"  {param}: {imp:.4f}")

print("\nFocus tuning effort on the most important hyperparameters.")

Practical Guidelines

Start with defaults: Many libraries have well-tuned defaults. Start there and only tune if needed.

Use random search first: It's simple, parallelizable, and often good enough. Switch to Bayesian optimization only if you need to squeeze out more performance.

Define sensible ranges: Use domain knowledge to constrain the search. Learning rates below 1e-6 or above 1 are rarely useful.

Use log scales for certain parameters: Learning rate, regularization strength, and similar parameters often benefit from log-uniform sampling (equal probability for 0.001, 0.01, 0.1).

Start coarse, then refine: Do a broad search first, then narrow the range around promising regions.

Consider computational budget: With limited time, fewer hyperparameters with wider ranges may beat many hyperparameters with narrow ranges.

PYTHON
from scipy.stats import loguniform, uniform, randint

# Good parameter distributions for common hyperparameters
param_distributions_guide = {
    # Learning rate: log scale, typically 1e-5 to 1e-1
    'learning_rate': loguniform(1e-5, 1e-1),

    # Regularization: log scale
    'alpha': loguniform(1e-6, 1e1),

    # Tree depth: integer, usually 3-30
    'max_depth': randint(3, 31),

    # Number of estimators: integer, usually 50-500
    'n_estimators': randint(50, 501),

    # Dropout: uniform, usually 0.1-0.5
    'dropout': uniform(0.1, 0.4),

    # Batch size: powers of 2
    'batch_size': [16, 32, 64, 128, 256]
}

print("Hyperparameter Distribution Guide:")
for param, dist in param_distributions_guide.items():
    if hasattr(dist, 'rvs'):
        samples = [dist.rvs() for _ in range(3)]
        print(f"  {param}: e.g., {samples[0]:.6f}, {samples[1]:.6f}, {samples[2]:.6f}")
    else:
        print(f"  {param}: choose from {dist}")

Key Takeaways

  • Hyperparameters control learning; parameters are learned from data
  • Grid search is exhaustive but inefficient for many hyperparameters
  • Random search often outperforms grid search by exploring important dimensions better
  • Bayesian optimization (Optuna) learns from past trials to search intelligently
  • Cross-validation provides stable hyperparameter evaluation; nested CV gives unbiased performance estimates
  • Early stopping automatically tunes training duration and prevents overfitting
  • Learning rate schedules adapt the learning rate during training
  • Not all hyperparameters matter equally—focus on the important ones
  • Start with defaults, use random search broadly, refine with Bayesian optimization if needed

7.5 Bias-Variance Tradeoff Intermediate

Bias-Variance Tradeoff

Every machine learning model faces a fundamental tension. Simple models might miss important patterns in the data—they're too rigid to capture the underlying relationship. Complex models might capture noise along with signal—they're so flexible they fit random fluctuations. This tension is the bias-variance tradeoff, one of the most important concepts in machine learning.

Understanding this tradeoff explains why adding more parameters doesn't always help, why training accuracy can be misleading, and how to diagnose and fix model performance issues.

What is Bias?

Bias is the error from overly simplistic assumptions in the learning algorithm. A model with high bias fails to capture the true relationship between features and target—it underfits the data.

Imagine trying to fit a straight line to data that follows a curved pattern. No matter how much data you have or how carefully you train, the linear model will systematically miss the curvature. This systematic error is bias. The model is too simple for the problem.

High-bias models produce similar predictions regardless of the training data. They're consistent but consistently wrong in the same way.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# True relationship is quadratic
np.random.seed(42)
X = np.linspace(0, 1, 50).reshape(-1, 1)
y_true = 3 * X.ravel()**2 - 2 * X.ravel() + 1
y = y_true + np.random.randn(50) * 0.1

# Fit linear model (high bias - underfitting)
linear_model = LinearRegression()
linear_model.fit(X, y)
y_pred_linear = linear_model.predict(X)

# Fit quadratic model (appropriate complexity)
quad_model = make_pipeline(PolynomialFeatures(2), LinearRegression())
quad_model.fit(X, y)
y_pred_quad = quad_model.predict(X)

# Calculate errors
mse_linear = np.mean((y - y_pred_linear)**2)
mse_quad = np.mean((y - y_pred_quad)**2)

print("High Bias Example (Linear fit to quadratic data):")
print(f"Linear model MSE: {mse_linear:.4f}")
print(f"Quadratic model MSE: {mse_quad:.4f}")
print("\nThe linear model has high bias - it can't capture the curved relationship.")

What is Variance?

Variance is the error from sensitivity to small fluctuations in the training data. A model with high variance changes dramatically based on which particular samples it sees—it overfits to noise in the training set.

If you train a complex model on different random subsets of your data and get wildly different predictions, that's high variance. The model is fitting not just the underlying pattern but also the random noise specific to each training set.

High-variance models fit training data very well but fail on new data. They've memorized rather than learned.

PYTHON
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Generate data with noise
np.random.seed(42)
X = np.linspace(0, 1, 20).reshape(-1, 1)
y_true = np.sin(2 * np.pi * X.ravel())

# Train high-degree polynomial on different data samples
predictions = []

for seed in range(5):
    # Different noise each time
    np.random.seed(seed)
    y = y_true + np.random.randn(20) * 0.3

    # High degree polynomial (high variance)
    model = make_pipeline(PolynomialFeatures(15), LinearRegression())
    model.fit(X, y)

    X_test = np.linspace(0, 1, 100).reshape(-1, 1)
    predictions.append(model.predict(X_test))

# Check variance in predictions
predictions = np.array(predictions)
variance_at_each_point = np.var(predictions, axis=0)

print("High Variance Example (Degree-15 polynomial):")
print(f"Average prediction variance: {np.mean(variance_at_each_point):.4f}")
print("Different training samples → wildly different predictions")
print("\nThis is overfitting: the model fits noise specific to each training set.")

The Bias-Variance Decomposition

The expected prediction error can be mathematically decomposed into three components:

$$\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

Bias² measures how far the average prediction is from the true value. It's the systematic error that remains even with infinite data.

Variance measures how much predictions vary across different training sets. It's the error from fitting to noise.

Irreducible noise is the inherent randomness in the problem—error that no model can eliminate because the true relationship itself has randomness.

The tradeoff arises because reducing one often increases the other. Simple models have high bias but low variance. Complex models have low bias but high variance.

PYTHON
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

def bias_variance_simulation(degree, n_simulations=100):
    """Estimate bias and variance for polynomial of given degree."""
    np.random.seed(42)

    # True function
    def f_true(x):
        return np.sin(2 * np.pi * x)

    # Test points
    X_test = np.array([[0.3], [0.5], [0.7]])
    y_true_test = f_true(X_test.ravel())

    predictions = []

    for _ in range(n_simulations):
        # Generate new training data each time
        X_train = np.random.uniform(0, 1, 30).reshape(-1, 1)
        y_train = f_true(X_train.ravel()) + np.random.randn(30) * 0.3

        # Fit polynomial
        model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model.fit(X_train, y_train)

        predictions.append(model.predict(X_test))

    predictions = np.array(predictions)

    # Average prediction
    avg_pred = np.mean(predictions, axis=0)

    # Bias: difference between average prediction and true value
    bias_sq = np.mean((avg_pred - y_true_test)**2)

    # Variance: spread of predictions
    variance = np.mean(np.var(predictions, axis=0))

    return bias_sq, variance

print("Bias-Variance Decomposition by Model Complexity:")
print(f"{'Degree':<10} {'Bias²':<12} {'Variance':<12} {'Total':<12}")
print("-" * 46)

for degree in [1, 3, 5, 10, 15]:
    bias_sq, variance = bias_variance_simulation(degree)
    total = bias_sq + variance
    print(f"{degree:<10} {bias_sq:<12.4f} {variance:<12.4f} {total:<12.4f}")

print("\nLow complexity → high bias, low variance")
print("High complexity → low bias, high variance")
print("Sweet spot minimizes total error")

Underfitting and Overfitting

Underfitting occurs when a model is too simple to capture the underlying pattern. Signs include poor performance on both training and test data. The model has high bias.

Overfitting occurs when a model is too complex and captures noise along with signal. Signs include excellent training performance but poor test performance—a large gap between the two. The model has high variance.

The goal is to find the sweet spot: a model complex enough to capture the true pattern but not so complex that it fits noise.

PYTHON
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate data
np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y = 3*X.ravel()**2 - X.ravel() + np.random.randn(100) * 0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Underfitting vs Overfitting:")
print(f"{'Degree':<10} {'Train MSE':<12} {'Test MSE':<12} {'Gap':<12} {'Diagnosis':<15}")
print("-" * 61)

for degree in [1, 2, 5, 15]:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train))
    test_mse = mean_squared_error(y_test, model.predict(X_test))
    gap = test_mse - train_mse

    if train_mse > 0.1 and test_mse > 0.1:
        diagnosis = "Underfitting"
    elif gap > 0.1:
        diagnosis = "Overfitting"
    else:
        diagnosis = "Good fit"

    print(f"{degree:<10} {train_mse:<12.4f} {test_mse:<12.4f} {gap:<12.4f} {diagnosis:<15}")

Model Complexity and the U-Curve

As model complexity increases, training error typically decreases (more flexibility to fit the data) while test error first decreases then increases (U-shaped curve). The optimal complexity is where test error is minimized.

This U-curve is a visual representation of the bias-variance tradeoff. On the left (low complexity), high bias dominates. On the right (high complexity), high variance dominates. The minimum is the optimal balance.

PYTHON
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate data
np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y = np.sin(2 * np.pi * X.ravel()) + np.random.randn(100) * 0.3

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

degrees = range(1, 16)
train_errors = []
test_errors = []

for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train, y_train)

    train_errors.append(mean_squared_error(y_train, model.predict(X_train)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test)))

best_degree = degrees[np.argmin(test_errors)]

print("Model Complexity vs Error (U-Curve):")
print(f"Optimal polynomial degree: {best_degree}")
print(f"Min test MSE: {min(test_errors):.4f}")
print(f"\nDegree 1 test MSE: {test_errors[0]:.4f} (underfitting)")
print(f"Degree 15 test MSE: {test_errors[14]:.4f} (overfitting)")
print(f"\nToo simple → high bias → poor test error")
print(f"Too complex → high variance → poor test error")

Learning Curves

Learning curves plot model performance versus training set size. They're diagnostic tools for understanding whether your model has bias or variance issues.

High bias (underfitting): Both training and validation errors are high and converge quickly. Adding more data doesn't help much because the model can't capture the pattern regardless of sample size.

High variance (overfitting): Training error is low but validation error is high, with a large gap between them. Adding more data helps because it gives the model more signal to learn from and less ability to memorize noise.

PYTHON
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline

# Generate data
np.random.seed(42)
X = np.linspace(0, 1, 200).reshape(-1, 1)
y = np.sin(2 * np.pi * X.ravel()) + np.random.randn(200) * 0.3

# High bias model (too simple)
simple_model = LinearRegression()

# High variance model (too complex)
complex_model = make_pipeline(PolynomialFeatures(15), LinearRegression())

# Appropriate model
good_model = make_pipeline(PolynomialFeatures(4), Ridge(alpha=0.1))

def analyze_learning_curve(model, X, y, name):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5,
        train_sizes=np.linspace(0.2, 1.0, 5),
        scoring='neg_mean_squared_error'
    )

    train_mse = -train_scores.mean(axis=1)
    val_mse = -val_scores.mean(axis=1)

    print(f"\n{name}:")
    print(f"  Smallest training set - Train: {train_mse[0]:.3f}, Val: {val_mse[0]:.3f}")
    print(f"  Largest training set  - Train: {train_mse[-1]:.3f}, Val: {val_mse[-1]:.3f}")
    print(f"  Gap at largest: {val_mse[-1] - train_mse[-1]:.3f}")

    return train_mse[-1], val_mse[-1], val_mse[-1] - train_mse[-1]

print("Learning Curve Analysis:")

analyze_learning_curve(simple_model, X, y, "High Bias (Linear)")
analyze_learning_curve(complex_model, X, y, "High Variance (Degree 15)")
analyze_learning_curve(good_model, X, y, "Good Fit (Degree 4 + Regularization)")

print("\n\nDiagnostic Guide:")
print("High bias: Both errors high, small gap → need more complex model")
print("High variance: Low train error, high val error → need simpler model or more data")

Regularization: Controlling Variance

Regularization adds a penalty term to the loss function that discourages complex models. This reduces variance at the cost of slightly increased bias—often a good tradeoff.

L2 Regularization (Ridge) adds a penalty proportional to the squared magnitude of coefficients:

$$\text{Loss} = \text{MSE} + \lambda \sum w_i^2$$

This shrinks all coefficients toward zero but rarely makes them exactly zero. It's good when all features are potentially relevant.

L1 Regularization (Lasso) adds a penalty proportional to the absolute magnitude:

$$\text{Loss} = \text{MSE} + \lambda \sum |w_i|$$

This can shrink coefficients exactly to zero, effectively performing feature selection. It's good when you suspect many features are irrelevant.

The regularization strength λ controls the bias-variance tradeoff: higher λ means more regularization, lower variance, higher bias.

PYTHON
import numpy as np
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate data with many features (some irrelevant)
np.random.seed(42)
n_samples, n_features = 100, 50
X = np.random.randn(n_samples, n_features)
# Only first 5 features matter
y = X[:, :5] @ np.array([3, -2, 1, 0.5, -1]) + np.random.randn(n_samples) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Regularization Comparison:")
print(f"{'Model':<30} {'Train MSE':<12} {'Test MSE':<12}")
print("-" * 54)

# No regularization (will overfit)
lr = LinearRegression()
lr.fit(X_train, y_train)
print(f"{'Linear (no regularization)':<30} {mean_squared_error(y_train, lr.predict(X_train)):<12.4f} {mean_squared_error(y_test, lr.predict(X_test)):<12.4f}")

# L2 regularization
for alpha in [0.1, 1.0, 10.0]:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    print(f"{f'Ridge (alpha={alpha})':<30} {mean_squared_error(y_train, ridge.predict(X_train)):<12.4f} {mean_squared_error(y_test, ridge.predict(X_test)):<12.4f}")

# L1 regularization
for alpha in [0.01, 0.1, 1.0]:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train, y_train)
    n_nonzero = np.sum(lasso.coef_ != 0)
    print(f"{f'Lasso (alpha={alpha}, {n_nonzero} features)':<30} {mean_squared_error(y_train, lasso.predict(X_train)):<12.4f} {mean_squared_error(y_test, lasso.predict(X_test)):<12.4f}")

Cross-Validation for Regularization Tuning

The regularization strength is a hyperparameter that must be tuned. Cross-validation finds the value that best balances bias and variance on held-out data.

PYTHON
import numpy as np
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Use same data as before
np.random.seed(42)
n_samples, n_features = 100, 50
X = np.random.randn(n_samples, n_features)
y = X[:, :5] @ np.array([3, -2, 1, 0.5, -1]) + np.random.randn(n_samples) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Cross-validated Ridge
alphas = np.logspace(-4, 4, 50)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)

print("Cross-Validated Regularization:")
print(f"Best Ridge alpha: {ridge_cv.alpha_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, ridge_cv.predict(X_test)):.4f}")

# Cross-validated Lasso
lasso_cv = LassoCV(cv=5, max_iter=10000)
lasso_cv.fit(X_train, y_train)

print(f"\nBest Lasso alpha: {lasso_cv.alpha_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, lasso_cv.predict(X_test)):.4f}")
print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")

Other Techniques to Manage Variance

Beyond explicit regularization, several techniques help control variance:

Early stopping: Stop training before the model fully fits the training data. Acts as implicit regularization.

Dropout (neural networks): Randomly drop units during training, preventing co-adaptation and reducing overfitting.

Data augmentation: Generate additional training examples through transformations. More data means less overfitting.

Ensemble methods: Combine multiple models to reduce variance. Random forests average many high-variance trees to get a lower-variance ensemble.

PYTHON
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate data
np.random.seed(42)
X = np.random.randn(200, 10)
y = X[:, 0]**2 + X[:, 1] - X[:, 2] + np.random.randn(200) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single deep tree (high variance)
tree = DecisionTreeRegressor(max_depth=20, random_state=42)
tree.fit(X_train, y_train)

# Bagging (average multiple trees)
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(max_depth=20),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)

# Random forest
rf = RandomForestRegressor(n_estimators=50, max_depth=20, random_state=42)
rf.fit(X_train, y_train)

print("Ensemble Methods Reduce Variance:")
print(f"{'Model':<25} {'Train MSE':<12} {'Test MSE':<12}")
print("-" * 49)
print(f"{'Single Tree':<25} {mean_squared_error(y_train, tree.predict(X_train)):<12.4f} {mean_squared_error(y_test, tree.predict(X_test)):<12.4f}")
print(f"{'Bagging (50 trees)':<25} {mean_squared_error(y_train, bagging.predict(X_train)):<12.4f} {mean_squared_error(y_test, bagging.predict(X_test)):<12.4f}")
print(f"{'Random Forest (50 trees)':<25} {mean_squared_error(y_train, rf.predict(X_train)):<12.4f} {mean_squared_error(y_test, rf.predict(X_test)):<12.4f}")

print("\nEnsembles average out the variance of individual high-variance models.")

Diagnosing and Fixing Issues

When your model isn't performing well, diagnose whether it's a bias or variance problem:

High training error + high test error (both poor):

  • Problem: High bias (underfitting)
  • Solutions: Use more complex model, add features, reduce regularization

Low training error + high test error (large gap):

  • Problem: High variance (overfitting)
  • Solutions: Use simpler model, add regularization, get more data, use ensemble methods

Both errors acceptable and close:

  • The model is well-tuned!

PYTHON
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def diagnose_model(train_mse, test_mse, threshold=0.1):
    """Diagnose bias vs variance issues."""
    gap = test_mse - train_mse

    print(f"Train MSE: {train_mse:.4f}")
    print(f"Test MSE:  {test_mse:.4f}")
    print(f"Gap:       {gap:.4f}")

    if train_mse > threshold and test_mse > threshold:
        print("\nDiagnosis: HIGH BIAS (underfitting)")
        print("Solutions:")
        print("  - Use more complex model")
        print("  - Add more/better features")
        print("  - Reduce regularization")
    elif gap > threshold:
        print("\nDiagnosis: HIGH VARIANCE (overfitting)")
        print("Solutions:")
        print("  - Use simpler model")
        print("  - Add regularization")
        print("  - Get more training data")
        print("  - Use ensemble methods")
    else:
        print("\nDiagnosis: Good balance!")
        print("Model complexity is appropriate for the data.")

# Example
np.random.seed(42)
X = np.linspace(0, 1, 100).reshape(-1, 1)
y = np.sin(2 * np.pi * X.ravel()) + np.random.randn(100) * 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("=== Scenario 1: High Bias ===")
model = LinearRegression()
model.fit(X_train, y_train)
diagnose_model(
    mean_squared_error(y_train, model.predict(X_train)),
    mean_squared_error(y_test, model.predict(X_test))
)

print("\n=== Scenario 2: High Variance ===")
model = make_pipeline(PolynomialFeatures(15), LinearRegression())
model.fit(X_train, y_train)
diagnose_model(
    mean_squared_error(y_train, model.predict(X_train)),
    mean_squared_error(y_test, model.predict(X_test))
)

print("\n=== Scenario 3: Good Balance ===")
model = make_pipeline(PolynomialFeatures(4), Ridge(alpha=0.01))
model.fit(X_train, y_train)
diagnose_model(
    mean_squared_error(y_train, model.predict(X_train)),
    mean_squared_error(y_test, model.predict(X_test))
)

The Double Descent Phenomenon

In deep learning, a fascinating phenomenon challenges the classical bias-variance picture. As model complexity increases far beyond the interpolation threshold (where the model can fit training data perfectly), test error sometimes decreases again—the "double descent" curve.

This happens because extremely overparameterized models can find simpler solutions despite having the capacity for complex ones. The dynamics of gradient descent in high dimensions favor certain types of solutions.

This doesn't invalidate the bias-variance tradeoff but shows that the relationship is more nuanced in very high-dimensional settings typical of modern deep learning.

PYTHON
print("Double Descent Phenomenon:")
print("\nClassical View:")
print("  Complexity ↑ → First test error ↓, then test error ↑")
print("  (The U-curve)")

print("\nModern Deep Learning View:")
print("  As complexity goes far beyond interpolation:")
print("  Test error can decrease again!")

print("\nImplications:")
print("  - Very large models can generalize well")
print("  - Implicit regularization from optimization matters")
print("  - The bias-variance story is nuanced in high dimensions")
print("  - Still: regularization and good practices help")

Practical Guidelines

Start simple: Begin with a simple model to establish a baseline. Add complexity only if needed.

Compare training and test error: The gap tells you about variance. Both being high tells you about bias.

Use validation properly: Never tune on test data. Use cross-validation for reliable estimates.

Regularize by default: Some regularization rarely hurts and often helps. L2 is a safe default.

Get more data if possible: More data almost always helps with variance (overfitting).

Use ensembles for high-variance models: Bagging and boosting can dramatically reduce variance.

Plot learning curves: They reveal whether more data would help or if you need a different model.

PYTHON
print("Quick Reference - Bias vs Variance:")
print("\n" + "="*60)
print(f"{'Symptom':<30} {'Likely Issue':<15} {'Solution':<20}")
print("="*60)
print(f"{'Both errors high':<30} {'High bias':<15} {'More complexity':<20}")
print(f"{'Large train-test gap':<30} {'High variance':<15} {'Regularize/simplify':<20}")
print(f"{'More data helps a lot':<30} {'High variance':<15} {'Get more data':<20}")
print(f"{'More data doesn\\'t help':<30} {'High bias':<15} {'Better features':<20}")
print("="*60)

Key Takeaways

  • Bias is systematic error from overly simple models (underfitting)
  • Variance is error from sensitivity to training data (overfitting)
  • Total error = Bias² + Variance + Irreducible noise
  • As complexity increases, bias decreases but variance increases
  • The optimal model balances bias and variance to minimize total error
  • Learning curves diagnose whether you have bias or variance problems
  • Regularization (L1, L2) controls variance by penalizing complexity
  • Ensemble methods reduce variance by averaging multiple models
  • High training error + high test error = underfitting (need more complexity)
  • Low training error + high test error = overfitting (need more regularization)
  • The bias-variance tradeoff is fundamental to machine learning—understanding it helps you build better models