Enthusiast Intermediate 90 min read

Chapter 6: Classification Algorithms

KNN, decision trees, SVMs, and Naive Bayes classifiers.

Libraries covered: Scikit-learn

Learning Objectives

["Understand distance-based classifiers", "Build decision trees", "Apply kernel methods"]


6.5 Classification Metrics and Evaluation Intermediate

Classification Metrics and Evaluation

Evaluating classification models requires more than just counting correct predictions. Different applications have different costs for different types of errors. A spam filter that blocks important emails is worse than one that lets some spam through. A medical test that misses cancer is worse than one that triggers unnecessary follow-up tests.

Understanding classification metrics helps you choose the right metric for your problem and interpret model performance correctly.

The Confusion Matrix

The confusion matrix is the foundation of classification evaluation. For binary classification, it's a 2×2 table showing how predictions compare to actual labels:

| | Predicted Positive | Predicted Negative | |--------------------|--------------------|--------------------| | Actual Positive | True Positive (TP) | False Negative (FN) | | Actual Negative | False Positive (FP) | True Negative (TN) |

  • True Positive (TP): Correctly predicted positive
  • True Negative (TN): Correctly predicted negative
  • False Positive (FP): Incorrectly predicted positive (Type I error)
  • False Negative (FN): Incorrectly predicted negative (Type II error)

All classification metrics derive from these four numbers.

PYTHON
import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Generate and split data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_classes=2, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"\nTN={cm[0,0]}, FP={cm[0,1]}")
print(f"FN={cm[1,0]}, TP={cm[1,1]}")

# Visualize
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot()
plt.title('Confusion Matrix')
plt.show()

Accuracy: The Intuitive but Flawed Metric

Accuracy is the proportion of correct predictions:

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

Accuracy is intuitive but can be misleading with imbalanced classes. If 99% of emails are not spam, a model that predicts "not spam" for everything achieves 99% accuracy while being completely useless.

PYTHON
import numpy as np
from sklearn.metrics import accuracy_score

# Imbalanced scenario: 95% negative, 5% positive
y_true = np.array([0]*950 + [1]*50)

# Naive model: always predict negative
y_pred_naive = np.zeros(1000)

# Slightly better model
y_pred_better = np.array([0]*940 + [1]*60)
# Let's say it gets 40 of the 50 positives right, but has 20 false positives

print("The Accuracy Paradox:")
print(f"Always predict negative: {accuracy_score(y_true, y_pred_naive):.1%} accuracy")
print("But this model is useless for detecting positives!")
print("\nAccuracy is misleading with imbalanced classes.")

Precision and Recall

Precision answers: "Of all positive predictions, how many were actually positive?"

$$\text{Precision} = \frac{TP}{TP + FP}$$

High precision means few false positives. Important when false positives are costly (spam filter blocking legitimate email).

Recall (also called sensitivity or true positive rate) answers: "Of all actual positives, how many did we correctly identify?"

$$\text{Recall} = \frac{TP}{TP + FN}$$

High recall means few false negatives. Important when false negatives are costly (cancer screening, fraud detection).

PYTHON
from sklearn.metrics import precision_score, recall_score
import numpy as np

# Simulated predictions
y_true = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0])

# TP=2, FP=1, FN=3, TN=4
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print("Precision vs Recall:")
print(f"True labels:  {y_true}")
print(f"Predictions:  {y_pred}")
print(f"\nPrecision: {precision:.2f} (2 of 3 positive predictions were correct)")
print(f"Recall:    {recall:.2f} (2 of 5 actual positives were found)")

The Precision-Recall Tradeoff

There's an inherent tradeoff between precision and recall. Lowering the classification threshold catches more positives (higher recall) but also increases false positives (lower precision). Raising the threshold increases precision but decreases recall.

The right balance depends on the application. Cancer screening should favor recall (don't miss cancer). Spam filtering might favor precision (don't block important emails).

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score

# Generate data
X, y = make_classification(n_samples=1000, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model and get probabilities
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Vary threshold
thresholds = np.arange(0.1, 0.9, 0.05)
precisions, recalls = [], []

for t in thresholds:
    y_pred = (y_proba >= t).astype(int)
    precisions.append(precision_score(y_test, y_pred, zero_division=0))
    recalls.append(recall_score(y_test, y_pred))

plt.figure(figsize=(8, 5))
plt.plot(thresholds, precisions, 'b-', label='Precision')
plt.plot(thresholds, recalls, 'r-', label='Recall')
plt.xlabel('Classification Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("As threshold increases:")
print("  - Precision tends to increase (fewer false positives)")
print("  - Recall tends to decrease (more false negatives)")

F1 Score: Balancing Precision and Recall

The F1 score is the harmonic mean of precision and recall:

$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$

The harmonic mean penalizes extreme values—both precision and recall must be high for a good F1 score. F1 ranges from 0 to 1, with 1 being perfect.

More generally, the Fβ score allows weighting:

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

  • β = 1: Equal weight (standard F1)
  • β = 2: Recall is twice as important (F2)
  • β = 0.5: Precision is twice as important (F0.5)

PYTHON
from sklearn.metrics import f1_score, fbeta_score, precision_score, recall_score
import numpy as np

y_true = np.array([1]*100 + [0]*900)  # 10% positive
y_pred = np.array([1]*80 + [0]*20 + [1]*50 + [0]*850)  # Some errors

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
f2 = fbeta_score(y_true, y_pred, beta=2)
f05 = fbeta_score(y_true, y_pred, beta=0.5)

print("F-scores:")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 (balanced):           {f1:.3f}")
print(f"F2 (recall-weighted):    {f2:.3f}")
print(f"F0.5 (precision-weighted): {f05:.3f}")

ROC Curve and AUC

The ROC curve (Receiver Operating Characteristic) plots True Positive Rate vs False Positive Rate at various thresholds:

  • True Positive Rate (TPR) = Recall = TP/(TP+FN)
  • False Positive Rate (FPR) = FP/(FP+TN)

A random classifier produces a diagonal line. Better classifiers curve toward the upper-left corner.

AUC (Area Under the ROC Curve) summarizes the curve in a single number:

  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random classifier
  • AUC < 0.5: Worse than random (invert predictions!)

AUC measures the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate data
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"AUC: {auc:.3f}")
print("AUC represents the probability of ranking a random positive above a random negative")

Precision-Recall Curve

For imbalanced datasets, the Precision-Recall curve is often more informative than ROC. It plots precision vs recall at various thresholds.

The area under the PR curve (PR-AUC or Average Precision) summarizes performance. Unlike ROC-AUC, PR-AUC is sensitive to class imbalance—a random classifier achieves PR-AUC equal to the positive class proportion, not 0.5.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Imbalanced data
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP = {ap:.3f})')
plt.axhline(y=y_test.mean(), color='r', linestyle='--', label=f'Random (baseline = {y_test.mean():.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve (Imbalanced Data)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Average Precision: {ap:.3f}")
print(f"Positive class proportion: {y_test.mean():.2f}")

Multiclass Metrics

For multiclass classification, metrics can be computed in several ways:

Macro averaging: Compute metric for each class, then average. Treats all classes equally.

Micro averaging: Aggregate TP, FP, FN across classes, then compute metric. Weighted by class frequency.

Weighted averaging: Average metrics weighted by class support (number of samples per class).

PYTHON
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Multiclass problem
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Multiclass Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

print("\nAveraging Methods:")
print(f"Macro F1:    {f1_score(y_test, y_pred, average='macro'):.3f}")
print(f"Micro F1:    {f1_score(y_test, y_pred, average='micro'):.3f}")
print(f"Weighted F1: {f1_score(y_test, y_pred, average='weighted'):.3f}")

Log Loss (Cross-Entropy Loss)

Log loss evaluates predicted probabilities, not just class predictions. It penalizes confident wrong predictions heavily.

$$\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]$$

Lower log loss is better. Log loss captures calibration—whether predicted probabilities match actual frequencies.

PYTHON
from sklearn.metrics import log_loss
import numpy as np

y_true = np.array([1, 1, 1, 0, 0, 0])

# Well-calibrated predictions
y_proba_good = np.array([0.9, 0.8, 0.7, 0.2, 0.1, 0.3])

# Overconfident wrong predictions
y_proba_bad = np.array([0.99, 0.99, 0.01, 0.01, 0.01, 0.99])

print("Log Loss Comparison:")
print(f"Well-calibrated: {log_loss(y_true, y_proba_good):.4f}")
print(f"Overconfident wrong: {log_loss(y_true, y_proba_bad):.4f}")
print("\nOverconfident wrong predictions are heavily penalized!")

Choosing the Right Metric

The right metric depends on your problem:

| Scenario | Recommended Metric | |----------|-------------------| | Balanced classes, care about overall accuracy | Accuracy | | Imbalanced classes | F1, PR-AUC | | Cost of FP >> Cost of FN | Precision, or F0.5 | | Cost of FN >> Cost of FP | Recall, or F2 | | Need to rank predictions | ROC-AUC | | Probability calibration matters | Log Loss | | Compare models across thresholds | AUC (ROC or PR) |

PYTHON
import numpy as np
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, average_precision_score)

def evaluate_model(y_true, y_pred, y_proba):
    """Compute multiple metrics."""
    metrics = {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1': f1_score(y_true, y_pred),
        'ROC-AUC': roc_auc_score(y_true, y_proba),
        'PR-AUC': average_precision_score(y_true, y_proba)
    }
    return metrics

# Simulated predictions
np.random.seed(42)
y_true = np.array([1]*100 + [0]*900)  # 10% positive
y_proba = np.random.beta(2, 5, 1000)  # Random probabilities
y_proba[:100] += 0.3  # Boost positives slightly
y_proba = np.clip(y_proba, 0, 1)
y_pred = (y_proba >= 0.5).astype(int)

metrics = evaluate_model(y_true, y_pred, y_proba)

print("Comprehensive Evaluation:")
for name, value in metrics.items():
    print(f"  {name:<12}: {value:.3f}")

print("\nChoose metrics based on your problem's costs and constraints!")

Practical Guidelines

Always examine the confusion matrix first: It shows exactly where your model makes errors.

Don't rely on accuracy alone: Especially with imbalanced classes.

Consider business costs: Weight metrics by the actual cost of each error type.

Use multiple metrics: Different metrics reveal different aspects of performance.

Compare to baselines: How does your model compare to always predicting the majority class?

Look at calibration: If you need probabilities, verify they're well-calibrated.

Key Takeaways

  • The confusion matrix shows TP, TN, FP, FN—the foundation of all metrics
  • Accuracy is misleading with imbalanced classes
  • Precision measures false positive rate; recall measures false negative rate
  • There's a tradeoff between precision and recall controlled by the classification threshold
  • F1 score balances precision and recall; Fβ allows weighting
  • ROC-AUC measures ranking ability across all thresholds
  • PR-AUC is more informative for imbalanced data
  • Log loss evaluates probability calibration
  • Choose metrics based on business costs and class balance
  • Always examine the confusion matrix and use multiple metrics

6.1 Clustering Fundamentals Intermediate

Clustering Fundamentals

Clustering is the task of grouping similar objects together without any labeled examples to guide the process. Unlike supervised learning where we have target labels, clustering discovers structure in data purely from the features. This makes it a foundational technique in unsupervised learning—learning patterns from unlabeled data.

Clustering appears everywhere: customer segmentation in marketing, document organization, image compression, gene expression analysis, social network communities, and anomaly detection. Understanding clustering algorithms and their assumptions is essential for any data scientist.

What is Clustering?

Given a dataset of $n$ examples, each described by $d$ features, clustering assigns each example to one of $K$ groups (clusters) such that examples within the same cluster are more similar to each other than to examples in other clusters.

But what does "similar" mean? This is the fundamental question in clustering. Different algorithms make different assumptions about cluster shapes, sizes, and densities. The choice of algorithm should match your assumptions about the data's structure.

Clustering has no ground truth labels, making evaluation challenging. A clustering might be "good" for one purpose but poor for another. This subjectivity distinguishes clustering from classification, where we can objectively measure prediction accuracy.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Generate clustered data
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Visualize the natural clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50, alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Data with Natural Cluster Structure')
plt.colorbar(label='True Cluster')
plt.show()

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"True number of clusters: {len(np.unique(y_true))}")

K-Means Clustering

K-Means is the most widely used clustering algorithm due to its simplicity and efficiency. It partitions data into exactly $K$ clusters, where each cluster is represented by its centroid (mean of all points in the cluster).

The algorithm iterates between two steps:

  1. Assignment: Assign each point to the nearest centroid
  2. Update: Recalculate centroids as the mean of assigned points

This continues until assignments no longer change (convergence).

K-Means minimizes the within-cluster sum of squares (WCSS), also called inertia:

$$\text{WCSS} = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2$$

where $C_k$ is cluster $k$ and $\mu_k$ is its centroid.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans.fit(X)

# Visualize results
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', s=50, alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', marker='X', s=200, edgecolors='black', linewidths=2)
plt.title('K-Means Clustering (K=4)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Show convergence
plt.subplot(1, 2, 2)
inertias = []
for k in range(1, 10):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(range(1, 10), inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method for Optimal K')

plt.tight_layout()
plt.show()

print(f"Inertia: {kmeans.inertia_:.2f}")
print(f"Iterations to converge: {kmeans.n_iter_}")

Choosing K: The Elbow Method

K-Means requires specifying $K$ in advance. The elbow method plots inertia against $K$ and looks for an "elbow"—a point where adding more clusters provides diminishing returns.

However, the elbow isn't always clear. Other methods include:

Silhouette Score: Measures how similar points are to their own cluster versus other clusters. Ranges from -1 to 1; higher is better.

Gap Statistic: Compares inertia to that expected from uniformly distributed data.

PYTHON
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

print("Silhouette Scores for Different K:")
print(f"{'K':<5} {'Silhouette':<12} {'Inertia':<12}")
print("-" * 30)

for k in range(2, 8):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    silhouette = silhouette_score(X, labels)
    print(f"{k:<5} {silhouette:<12.4f} {kmeans.inertia_:<12.2f}")

K-Means Limitations

K-Means makes strong assumptions that don't always hold:

Spherical clusters: K-Means assumes clusters are roughly spherical (isotropic). It struggles with elongated or irregular shapes.

Equal-sized clusters: The algorithm tends to create clusters of similar size, even when natural clusters differ in size.

Sensitivity to initialization: Different random initializations can lead to different results. The K-Means++ initialization (default in scikit-learn) helps but doesn't eliminate this issue.

Sensitivity to outliers: Outliers pull centroids toward them, distorting clusters.

Fixed K: You must specify the number of clusters in advance.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_moons

# Generate non-spherical data (K-Means will struggle)
X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50)
axes[0].set_title('True Clusters (Non-Spherical)')

axes[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=50)
axes[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                c='red', marker='X', s=200)
axes[1].set_title('K-Means Result (Struggles with Shape)')

plt.tight_layout()
plt.show()

Hierarchical Clustering

Hierarchical clustering builds a tree of clusters, either bottom-up (agglomerative) or top-down (divisive). The agglomerative approach is more common:

  1. Start with each point as its own cluster
  2. Repeatedly merge the two closest clusters
  3. Continue until all points form one cluster

The result is a dendrogram—a tree showing the hierarchy of merges. You can cut the dendrogram at any level to get different numbers of clusters.

The key choice is how to measure distance between clusters (linkage):

  • Single linkage: Distance between closest points (can create elongated clusters)
  • Complete linkage: Distance between farthest points (tends to create compact clusters)
  • Average linkage: Average distance between all pairs
  • Ward's method: Minimizes variance increase when merging (tends to create equal-sized clusters)

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs

# Generate data
X, _ = make_blobs(n_samples=50, centers=3, cluster_std=0.60, random_state=42)

# Compute linkage matrix for dendrogram
Z = linkage(X, method='ward')

# Plot dendrogram
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
dendrogram(Z, truncate_mode='level', p=5)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')

# Apply clustering with 3 clusters
plt.subplot(1, 2, 2)
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Agglomerative Clustering (K=3)')

plt.tight_layout()
plt.show()

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions separated by sparse regions. Unlike K-Means, it:

  • Discovers the number of clusters automatically
  • Can find arbitrarily shaped clusters
  • Identifies outliers as noise points

DBSCAN has two parameters:

  • eps (ε): Maximum distance between two points to be considered neighbors
  • minsamples: Minimum points required to form a dense region

Points are classified as:

  • Core points: Have at least minsamples neighbors within eps
  • Border points: Within eps of a core point but not core themselves
  • Noise points: Neither core nor border (labeled as -1)

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Generate non-spherical data
X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Visualize
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50)
plt.title('True Clusters')

plt.subplot(1, 2, 2)
# Color noise points differently
colors = ['gray' if l == -1 else plt.cm.viridis(l / max(labels)) for l in labels]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=50)
plt.title(f'DBSCAN (Found {len(set(labels)) - (1 if -1 in labels else 0)} clusters)')

plt.tight_layout()
plt.show()

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

Choosing DBSCAN Parameters

Selecting eps and minsamples requires understanding your data:

k-distance plot: Plot the distance to the k-th nearest neighbor for each point (sorted). Look for an "elbow" to suggest eps.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_moons

X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)

# Compute k-nearest neighbor distances
k = 5  # Use min_samples
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X)
distances, _ = neighbors.kneighbors(X)

# Sort distances to k-th neighbor
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(8, 5))
plt.plot(k_distances)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'Distance to {k}th Nearest Neighbor')
plt.title('K-Distance Plot for Choosing eps')
plt.axhline(y=0.2, color='r', linestyle='--', label='Suggested eps=0.2')
plt.legend()
plt.show()

Gaussian Mixture Models

Gaussian Mixture Models (GMM) assume data comes from a mixture of $K$ Gaussian distributions. Unlike K-Means which makes hard assignments, GMM provides soft assignments—probability of belonging to each cluster.

GMM learns:

  • Means <!--MATHBLOCK12--> of each Gaussian
  • Covariances <!--MATHBLOCK13--> (shape of each cluster)
  • Mixing weights <!--MATHBLOCK14--> (relative size of each cluster)

The Expectation-Maximization (EM) algorithm iterates:

  1. E-step: Compute probability each point belongs to each cluster
  2. M-step: Update parameters based on weighted assignments

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

# Generate data with clusters of different shapes
np.random.seed(42)
X1 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], 100)
X2 = np.random.multivariate_normal([4, 4], [[1, -0.5], [-0.5, 1]], 100)
X3 = np.random.multivariate_normal([0, 4], [[0.5, 0], [0, 2]], 100)
X = np.vstack([X1, X2, X3])

# Fit GMM
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
probs = gmm.predict_proba(X)

# Visualize
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50, alpha=0.7)
plt.scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='red', marker='X', s=200)
plt.title('GMM Hard Assignments')

plt.subplot(1, 2, 2)
# Show uncertainty: points colored by max probability
max_prob = probs.max(axis=1)
plt.scatter(X[:, 0], X[:, 1], c=max_prob, cmap='RdYlGn', s=50, alpha=0.7)
plt.colorbar(label='Assignment Confidence')
plt.title('GMM Soft Assignment Confidence')

plt.tight_layout()
plt.show()

print(f"Cluster sizes: {np.bincount(labels)}")
print(f"Average assignment confidence: {max_prob.mean():.3f}")

Comparing Clustering Algorithms

Different algorithms suit different data characteristics:

| Algorithm | Cluster Shape | Outliers | Scalability | Parameters | |-----------|--------------|----------|-------------|------------| | K-Means | Spherical | Sensitive | Excellent | K | | Hierarchical | Any | Moderate | Poor for large data | Linkage, K | | DBSCAN | Arbitrary | Robust | Good | eps, minsamples | | GMM | Elliptical | Moderate | Good | K, covariance type |

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_moons, make_blobs
import warnings
warnings.filterwarnings('ignore')

# Create different datasets
datasets = {
    'Blobs': make_blobs(n_samples=300, centers=3, cluster_std=0.5, random_state=42)[0],
    'Moons': make_moons(n_samples=300, noise=0.05, random_state=42)[0],
}

algorithms = {
    'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
    'DBSCAN': DBSCAN(eps=0.3, min_samples=5),
    'Hierarchical': AgglomerativeClustering(n_clusters=3),
    'GMM': GaussianMixture(n_components=3, random_state=42),
}

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

for i, (data_name, X) in enumerate(datasets.items()):
    for j, (alg_name, alg) in enumerate(algorithms.items()):
        ax = axes[i, j]
        labels = alg.fit_predict(X)
        ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=20)
        ax.set_title(f'{data_name}: {alg_name}')
        ax.set_xticks([])
        ax.set_yticks([])

plt.tight_layout()
plt.show()

Evaluating Clustering Quality

Without labels, evaluation is challenging. Common metrics:

Silhouette Score: Measures cohesion (within-cluster distance) vs separation (between-cluster distance). Range: -1 to 1.

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

where $a(i)$ is mean distance to same-cluster points, $b(i)$ is mean distance to nearest other cluster.

Calinski-Harabasz Index: Ratio of between-cluster variance to within-cluster variance. Higher is better.

Davies-Bouldin Index: Average similarity between clusters. Lower is better.

PYTHON
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

print("Clustering Evaluation Metrics:")
print(f"{'K':<5} {'Silhouette':<12} {'Calinski-Harabasz':<18} {'Davies-Bouldin':<15}")
print("-" * 55)

for k in range(2, 8):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)

    sil = silhouette_score(X, labels)
    ch = calinski_harabasz_score(X, labels)
    db = davies_bouldin_score(X, labels)

    print(f"{k:<5} {sil:<12.4f} {ch:<18.2f} {db:<15.4f}")

Practical Tips

Preprocess your data: Scale features so all contribute equally to distance calculations.

Visualize first: Use dimensionality reduction (PCA, t-SNE) to visualize clusters before and after.

Try multiple algorithms: Different algorithms may reveal different structure.

Validate with domain knowledge: Do the clusters make sense for your application?

Consider stability: Run clustering multiple times with different initializations.

PYTHON
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load and scale data
iris = load_iris()
X = StandardScaler().fit_transform(iris.data)

# Cluster
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

# Visualize with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target, cmap='viridis', s=50)
plt.title('True Labels')

plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', s=50)
plt.title('K-Means Clusters')

plt.tight_layout()
plt.show()

Key Takeaways

  • Clustering groups similar objects without labeled examples—a core unsupervised learning task
  • K-Means is fast and effective for spherical clusters but requires specifying K
  • The elbow method and silhouette score help choose the number of clusters
  • Hierarchical clustering builds a tree of clusters; dendrograms show the hierarchy
  • DBSCAN finds arbitrary-shaped clusters and identifies outliers automatically
  • GMM provides soft cluster assignments and handles elliptical cluster shapes
  • Evaluation without labels uses metrics like silhouette score, but domain knowledge is crucial
  • Feature scaling is essential for distance-based clustering algorithms
  • Different algorithms suit different data structures—experiment and validate

6.2 Dimensionality Reduction Intermediate

Dimensionality Reduction

High-dimensional data is everywhere: images with millions of pixels, text documents with thousands of word counts, gene expression profiles with tens of thousands of genes. Working directly with such data is challenging—visualization is impossible, distances become meaningless (the "curse of dimensionality"), and models overfit easily.

Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving important structure. This enables visualization, speeds up learning algorithms, reduces storage requirements, and can even improve model performance by removing noise.

Why Reduce Dimensions?

The curse of dimensionality refers to phenomena that arise in high dimensions:

Sparsity: As dimensions increase, data points spread out. The volume of the space grows exponentially, making data sparse. Most of the space is empty.

Distance concentration: In high dimensions, distances between points become similar. The difference between the nearest and farthest neighbor shrinks, making distance-based methods less effective.

Computational cost: Many algorithms scale poorly with dimensionality. Reducing dimensions speeds up training and inference.

Visualization: Humans can only perceive 2D or 3D. Reducing to 2-3 dimensions enables visualization of high-dimensional patterns.

Noise reduction: Real data often contains redundant or noisy features. Dimensionality reduction can filter these out, improving signal-to-noise ratio.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

# Demonstrate distance concentration in high dimensions
np.random.seed(42)

dims = [2, 10, 50, 100, 500, 1000]
n_points = 1000

print("Distance Concentration in High Dimensions:")
print(f"{'Dim':<8} {'Mean Dist':<12} {'Std Dist':<12} {'Ratio (std/mean)'}")
print("-" * 45)

for d in dims:
    X = np.random.randn(n_points, d)
    # Compute pairwise distances
    dists = np.sqrt(((X[:100, np.newaxis] - X[np.newaxis, :100]) ** 2).sum(axis=2))
    dists = dists[np.triu_indices(100, k=1)]  # Upper triangle, no diagonal

    print(f"{d:<8} {dists.mean():<12.2f} {dists.std():<12.2f} {dists.std()/dists.mean():<12.4f}")

print("\nAs dimensions increase, distances become more uniform (lower ratio).")

Principal Component Analysis (PCA)

PCA is the most fundamental dimensionality reduction technique. It finds new axes (principal components) that capture maximum variance in the data. The first component captures the most variance, the second captures the most remaining variance orthogonal to the first, and so on.

Mathematically, PCA finds the eigenvectors of the covariance matrix. These eigenvectors become the new coordinate axes, and their eigenvalues indicate the variance explained by each component.

To reduce from $d$ to $k$ dimensions, we keep only the top $k$ principal components. The projection loses information, but by choosing $k$ such that most variance is retained, we preserve the essential structure.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# Load digits dataset (64-dimensional: 8x8 images)
digits = load_digits()
X, y = digits.data, digits.target

print(f"Original shape: {X.shape}")

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print(f"Reduced shape: {X_pca.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")

# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=10, alpha=0.7)
plt.colorbar(scatter, label='Digit')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA of Handwritten Digits (64D → 2D)')
plt.show()

Choosing the Number of Components

How many components to keep? Two common approaches:

Cumulative explained variance: Keep enough components to explain a target percentage (e.g., 95%) of total variance.

Scree plot: Plot variance explained by each component. Look for an "elbow" where additional components add little.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

X = load_digits().data

# Fit PCA with all components
pca = PCA()
pca.fit(X)

# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Scree plot
axes[0].bar(range(1, 21), pca.explained_variance_ratio_[:20])
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained')
axes[0].set_title('Scree Plot')

# Cumulative variance
cumvar = np.cumsum(pca.explained_variance_ratio_)
axes[1].plot(range(1, len(cumvar)+1), cumvar, 'b-')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Variance Explained')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()

plt.tight_layout()
plt.show()

# Find components needed for 95% variance
n_95 = np.argmax(cumvar >= 0.95) + 1
print(f"Components needed for 95% variance: {n_95} (out of {X.shape[1]})")

PCA for Preprocessing

PCA is often used as a preprocessing step to reduce dimensionality before training models. This can speed up training and reduce overfitting, especially with limited data.

PYTHON
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import time

X, y = load_digits().data, load_digits().target

print("Effect of PCA on Classification:")
print(f"{'Components':<12} {'CV Accuracy':<15} {'Time (s)'}")
print("-" * 40)

for n_comp in [10, 20, 30, 64]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=n_comp if n_comp < 64 else None)),
        ('clf', LogisticRegression(max_iter=1000, random_state=42))
    ])

    start = time.time()
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    elapsed = time.time() - start

    n_str = str(n_comp) if n_comp < 64 else "64 (no PCA)"
    print(f"{n_str:<12} {scores.mean():.4f}         {elapsed:.2f}")

t-SNE: Visualizing High-Dimensional Data

t-SNE (t-distributed Stochastic Neighbor Embedding) is designed specifically for visualization. Unlike PCA, which preserves global structure, t-SNE focuses on preserving local structure—keeping similar points close together.

t-SNE converts high-dimensional distances into probabilities, then finds a low-dimensional arrangement that matches these probabilities. It's particularly good at revealing clusters and is widely used for visualizing embeddings, single-cell RNA data, and neural network activations.

However, t-SNE has limitations:

  • Results depend on hyperparameters (perplexity, learning rate)
  • Different runs can give different results (non-deterministic)
  • Distances between clusters aren't meaningful
  • Slow for large datasets

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

X, y = load_digits().data, load_digits().target

# Compare PCA and t-SNE
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=10, alpha=0.7)
axes[0].set_title('PCA')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X)
scatter = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=10, alpha=0.7)
axes[1].set_title('t-SNE')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')

plt.colorbar(scatter, ax=axes[1], label='Digit')
plt.tight_layout()
plt.show()

print("t-SNE reveals clearer cluster separation for visualization")
print("But: distances between clusters aren't meaningful in t-SNE")

UMAP: A Modern Alternative

UMAP (Uniform Manifold Approximation and Projection) is a newer technique that often produces better visualizations than t-SNE while being faster and preserving more global structure.

UMAP is based on manifold learning and topological data analysis. It works well for both visualization (2-3D) and general dimensionality reduction (higher dimensions).

PYTHON
try:
    import umap
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_digits

    X, y = load_digits().data, load_digits().target

    # Apply UMAP
    reducer = umap.UMAP(n_components=2, random_state=42)
    X_umap = reducer.fit_transform(X)

    plt.figure(figsize=(8, 6))
    plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=10, alpha=0.7)
    plt.colorbar(label='Digit')
    plt.title('UMAP of Handwritten Digits')
    plt.xlabel('UMAP 1')
    plt.ylabel('UMAP 2')
    plt.show()
except ImportError:
    print("UMAP not installed. Install with: pip install umap-learn")

Linear Discriminant Analysis (LDA)

While PCA is unsupervised, LDA is a supervised dimensionality reduction technique. It finds projections that maximize class separation—the ratio of between-class variance to within-class variance.

LDA is useful when you have labels and want dimensions that are most discriminative for classification. It can reduce to at most $K-1$ dimensions for $K$ classes.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

# Load wine dataset (3 classes)
wine = load_wine()
X, y = wine.data, wine.target

# Compare PCA vs LDA
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# PCA (unsupervised)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', s=50)
axes[0].set_title('PCA (Unsupervised)')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

# LDA (supervised)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
scatter = axes[1].scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='viridis', s=50)
axes[1].set_title('LDA (Supervised)')
axes[1].set_xlabel('LD1')
axes[1].set_ylabel('LD2')

plt.colorbar(scatter, ax=axes[1], label='Class')
plt.tight_layout()
plt.show()

print("LDA maximizes class separation, often giving better clustering by class")

Autoencoders for Nonlinear Reduction

Neural network autoencoders learn nonlinear dimensionality reduction. An autoencoder has an encoder that compresses input to a low-dimensional "bottleneck" and a decoder that reconstructs the original from this compressed representation.

The bottleneck layer contains the reduced representation. By training to minimize reconstruction error, the autoencoder learns to capture the most important information in fewer dimensions.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

# Simple autoencoder concept (using sklearn's MLPRegressor as approximation)
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

X = load_digits().data
y = load_digits().target

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train autoencoder (encoder-decoder with bottleneck)
# Architecture: 64 -> 32 -> 2 -> 32 -> 64
# This is a simplified demonstration
class SimpleAutoencoder:
    def __init__(self, encoding_dim=2):
        self.encoder = MLPRegressor(
            hidden_layer_sizes=(32, encoding_dim),
            activation='relu',
            max_iter=500,
            random_state=42
        )
        self.encoding_dim = encoding_dim

    def fit_transform(self, X):
        # Train to reconstruct input
        self.encoder.fit(X, X)
        # Get activations at bottleneck (simplified)
        # In practice, use proper deep learning frameworks
        return X @ np.random.randn(X.shape[1], self.encoding_dim)  # Placeholder

print("Autoencoders learn nonlinear dimensionality reduction")
print("For production use, implement with PyTorch or TensorFlow:")
print("  - Encoder: Input -> Hidden layers -> Bottleneck")
print("  - Decoder: Bottleneck -> Hidden layers -> Reconstruction")
print("  - Loss: Reconstruction error (MSE between input and output)")

Comparing Methods

| Method | Type | Preserves | Speed | Use Case | |--------|------|-----------|-------|----------| | PCA | Linear | Global variance | Fast | Preprocessing, noise reduction | | t-SNE | Nonlinear | Local structure | Slow | Visualization only | | UMAP | Nonlinear | Local + some global | Fast | Visualization, general reduction | | LDA | Linear, supervised | Class separation | Fast | Classification preprocessing | | Autoencoders | Nonlinear | Learned features | Varies | Complex nonlinear patterns |

PYTHON
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import time

X, y = load_digits().data, load_digits().target

print("Comparison of Dimensionality Reduction Methods:")
print(f"{'Method':<15} {'Dim':<8} {'Time (s)':<12} {'Class. Accuracy'}")
print("-" * 50)

methods = [
    ('PCA', PCA(n_components=10)),
    ('PCA', PCA(n_components=30)),
    ('LDA', LinearDiscriminantAnalysis(n_components=9)),  # max K-1 = 9
]

for name, reducer in methods:
    start = time.time()
    if name == 'LDA':
        X_reduced = reducer.fit_transform(X, y)
    else:
        X_reduced = reducer.fit_transform(X)
    elapsed = time.time() - start

    # Evaluate with logistic regression
    clf = LogisticRegression(max_iter=1000, random_state=42)
    scores = cross_val_score(clf, X_reduced, y, cv=5)

    dim = X_reduced.shape[1]
    print(f"{name:<15} {dim:<8} {elapsed:<12.3f} {scores.mean():.4f}")

Practical Guidelines

Start with PCA: It's fast, deterministic, and often sufficient. Check explained variance to choose components.

For visualization, use t-SNE or UMAP: They reveal cluster structure better than PCA but aren't suitable for preprocessing.

Scale your data: Most methods assume features are on similar scales. Standardize first.

For classification tasks, consider LDA: It incorporates label information to find discriminative dimensions.

Check reconstruction error: If using PCA for compression, verify that important information is retained by checking reconstruction.

PYTHON
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

X = load_digits().data

# Standard pipeline
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA with different components
for n in [10, 20, 30, 50]:
    pca = PCA(n_components=n)
    X_reduced = pca.fit_transform(X_scaled)
    X_reconstructed = pca.inverse_transform(X_reduced)

    # Reconstruction error
    mse = np.mean((X_scaled - X_reconstructed) ** 2)
    var_explained = pca.explained_variance_ratio_.sum()

    print(f"n={n:2d}: Variance explained={var_explained:.2%}, Reconstruction MSE={mse:.4f}")

Key Takeaways

  • Dimensionality reduction transforms high-dimensional data to lower dimensions while preserving structure
  • The curse of dimensionality makes high-dimensional data challenging: sparsity, distance concentration, computational cost
  • PCA finds principal components capturing maximum variance—use for preprocessing and noise reduction
  • Choose components using explained variance (e.g., keep 95%) or the scree plot elbow
  • t-SNE and UMAP excel at visualization by preserving local structure
  • LDA is supervised, finding projections that maximize class separation
  • Autoencoders learn nonlinear reductions through reconstruction
  • Always scale data before applying dimensionality reduction
  • PCA for preprocessing; t-SNE/UMAP for visualization; LDA when labels are available

6.3 Anomaly Detection Intermediate

Anomaly Detection

Anomaly detection identifies data points that deviate significantly from the expected pattern. These outliers, novelties, or anomalies may indicate fraud, system failures, disease, or other rare but important events. Unlike classification where we have labeled examples of each class, anomaly detection typically works with mostly normal data and must identify the rare abnormal instances.

This asymmetry—abundant normal data but scarce anomalies—makes anomaly detection a unique and challenging problem with important real-world applications.

What is an Anomaly?

An anomaly (or outlier) is an observation that deviates so much from other observations that it raises suspicions it was generated by a different mechanism. The challenge is defining "deviate so much"—this depends on context and the assumed model of normal behavior.

Anomalies come in different types:

Point anomalies: Individual instances that are anomalous with respect to the rest of the data. Example: a single fraudulent credit card transaction.

Contextual anomalies: Instances anomalous in a specific context but not otherwise. Example: 80°F temperature is normal in summer but anomalous in winter.

Collective anomalies: Collections of instances that are anomalous together. Example: a sequence of low-value transactions that together constitute money laundering.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

# Generate normal data with a few anomalies
np.random.seed(42)

# Normal data: clustered around (0, 0)
normal = np.random.randn(200, 2)

# Anomalies: scattered far from the cluster
anomalies = np.array([
    [4, 4], [-4, 3], [3, -4], [-3, -3], [5, 0], [0, 5]
])

# Combine
X = np.vstack([normal, anomalies])
y = np.array([0] * 200 + [1] * 6)  # 0=normal, 1=anomaly

plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Normal', alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', s=100, marker='x', label='Anomaly')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Normal Data with Point Anomalies')
plt.legend()
plt.show()

print(f"Dataset: {len(X)} points, {sum(y)} anomalies ({100*sum(y)/len(y):.1f}%)")

Statistical Approaches

The simplest anomaly detection methods use statistical properties of the data:

Z-score method: Flag points more than $k$ standard deviations from the mean.

$$z = \frac{x - \mu}{\sigma}$$

Points with $|z| > k$ (typically $k = 3$) are considered anomalies.

IQR method: Use the interquartile range. Points below $Q_1 - 1.5 \times IQR$ or above $Q_3 + 1.5 \times IQR$ are anomalies.

These methods work well for univariate data with approximately normal distributions but struggle with multivariate data or complex distributions.

PYTHON
import numpy as np
from scipy import stats

# Generate data with outliers
np.random.seed(42)
normal_data = np.random.randn(100) * 10 + 50
outliers = np.array([150, 160, -20, -30])
data = np.concatenate([normal_data, outliers])

# Z-score method
z_scores = np.abs(stats.zscore(data))
z_anomalies = z_scores > 3

# IQR method
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
iqr_anomalies = (data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)

print("Statistical Anomaly Detection:")
print(f"Data range: [{data.min():.1f}, {data.max():.1f}]")
print(f"Mean: {data.mean():.1f}, Std: {data.std():.1f}")
print(f"\nZ-score method (|z| > 3): {sum(z_anomalies)} anomalies detected")
print(f"IQR method: {sum(iqr_anomalies)} anomalies detected")
print(f"IQR bounds: [{Q1 - 1.5*IQR:.1f}, {Q3 + 1.5*IQR:.1f}]")

Isolation Forest

Isolation Forest is based on the principle that anomalies are "few and different"—they're easier to isolate than normal points.

The algorithm builds random trees that recursively split the data. Anomalies, being rare and distinct, require fewer splits to isolate (shorter path from root to leaf). Normal points, being clustered, require more splits.

The anomaly score is based on the average path length across many trees. Shorter paths indicate anomalies.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# Generate data
np.random.seed(42)
normal = np.random.randn(200, 2)
anomalies = np.array([[4, 4], [-4, 3], [3, -4], [-3, -3], [5, 0], [0, 5]])
X = np.vstack([normal, anomalies])

# Fit Isolation Forest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
predictions = iso_forest.fit_predict(X)  # -1 for anomaly, 1 for normal
scores = iso_forest.decision_function(X)  # Lower = more anomalous

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predictions
axes[0].scatter(X[predictions==1, 0], X[predictions==1, 1],
                c='blue', label='Normal', alpha=0.5)
axes[0].scatter(X[predictions==-1, 0], X[predictions==-1, 1],
                c='red', s=100, marker='x', label='Detected Anomaly')
axes[0].set_title('Isolation Forest Predictions')
axes[0].legend()

# Anomaly scores
scatter = axes[1].scatter(X[:, 0], X[:, 1], c=scores, cmap='RdYlBu', s=50)
plt.colorbar(scatter, ax=axes[1], label='Anomaly Score')
axes[1].set_title('Anomaly Scores (lower = more anomalous)')

plt.tight_layout()
plt.show()

n_detected = sum(predictions == -1)
print(f"Detected {n_detected} anomalies")

One-Class SVM

One-Class SVM learns a boundary around normal data. It finds the smallest region containing most of the training data, then classifies points outside this region as anomalies.

Using the RBF kernel, One-Class SVM can learn complex, nonlinear boundaries around normal data clusters.

The key parameter is nu, which controls the trade-off between the fraction of outliers and the tightness of the boundary.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import OneClassSVM

# Generate data
np.random.seed(42)
normal = np.random.randn(200, 2)
anomalies = np.array([[4, 4], [-4, 3], [3, -4], [-3, -3]])
X_train = normal  # Train only on normal data
X_test = np.vstack([normal, anomalies])

# Fit One-Class SVM
ocsvm = OneClassSVM(kernel='rbf', gamma='scale', nu=0.05)
ocsvm.fit(X_train)

# Predict on all data
predictions = ocsvm.predict(X_test)

# Visualize decision boundary
xx, yy = np.meshgrid(np.linspace(-5, 6, 100), np.linspace(-5, 6, 100))
Z = ocsvm.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap='Blues_r', alpha=0.5)
plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='black')
plt.scatter(X_test[predictions==1, 0], X_test[predictions==1, 1],
            c='blue', label='Normal', alpha=0.5)
plt.scatter(X_test[predictions==-1, 0], X_test[predictions==-1, 1],
            c='red', s=100, marker='x', label='Anomaly')
plt.title('One-Class SVM Decision Boundary')
plt.legend()
plt.show()

Local Outlier Factor (LOF)

Local Outlier Factor measures the local density deviation of a point compared to its neighbors. A point in a sparse region surrounded by dense regions has a high LOF score and is likely an anomaly.

LOF is effective at detecting local anomalies—points that are anomalous relative to their neighborhood even if they're not globally extreme.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor

# Generate data with local anomalies
np.random.seed(42)
# Two clusters with different densities
cluster1 = np.random.randn(100, 2) * 0.5
cluster2 = np.random.randn(100, 2) * 0.5 + [4, 0]
# Anomaly between clusters
anomalies = np.array([[2, 0], [2, 0.5], [2, -0.5]])

X = np.vstack([cluster1, cluster2, anomalies])

# Fit LOF
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.02)
predictions = lof.fit_predict(X)
scores = -lof.negative_outlier_factor_  # Convert to positive (higher = more anomalous)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(X[predictions==1, 0], X[predictions==1, 1],
                c='blue', label='Normal', alpha=0.5)
axes[0].scatter(X[predictions==-1, 0], X[predictions==-1, 1],
                c='red', s=100, marker='x', label='Anomaly')
axes[0].set_title('LOF Predictions')
axes[0].legend()

scatter = axes[1].scatter(X[:, 0], X[:, 1], c=scores, cmap='RdYlBu_r', s=50)
plt.colorbar(scatter, ax=axes[1], label='LOF Score')
axes[1].set_title('LOF Scores (higher = more anomalous)')

plt.tight_layout()
plt.show()

Gaussian Mixture Models for Anomaly Detection

GMM can model the distribution of normal data. Points with low probability under the learned distribution are flagged as anomalies.

This approach works well when normal data follows a mixture of Gaussian distributions.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

# Generate data
np.random.seed(42)
normal = np.vstack([
    np.random.randn(100, 2) * 0.5 + [-2, 0],
    np.random.randn(100, 2) * 0.5 + [2, 0]
])
anomalies = np.array([[0, 3], [0, -3], [4, 2], [-4, -2]])
X = np.vstack([normal, anomalies])

# Fit GMM on all data
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(normal)  # Train on normal data only

# Compute log-probability
log_probs = gmm.score_samples(X)

# Threshold: points with low probability are anomalies
threshold = np.percentile(log_probs, 2)  # Bottom 2%
predictions = (log_probs < threshold).astype(int)

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[predictions==0, 0], X[predictions==0, 1],
            c='blue', label='Normal', alpha=0.5)
plt.scatter(X[predictions==1, 0], X[predictions==1, 1],
            c='red', s=100, marker='x', label='Anomaly')
plt.title(f'GMM Anomaly Detection (threshold: log_prob < {threshold:.2f})')
plt.legend()
plt.show()

print(f"Detected {sum(predictions)} anomalies")

Autoencoders for Anomaly Detection

Autoencoders learn to compress and reconstruct normal data. When presented with an anomaly, the reconstruction is poor (high reconstruction error), allowing detection.

This approach is powerful for complex, high-dimensional data like images or time series.

PYTHON
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

# Simulate autoencoder with MLP (for demonstration)
np.random.seed(42)

# Normal data: samples from a specific pattern
normal = np.column_stack([
    np.sin(np.linspace(0, 4*np.pi, 100)),
    np.cos(np.linspace(0, 4*np.pi, 100))
]) + np.random.randn(100, 2) * 0.1

# Anomalies: random points
anomalies = np.random.uniform(-2, 2, (10, 2))

# Scale data
scaler = StandardScaler()
normal_scaled = scaler.fit_transform(normal)
anomalies_scaled = scaler.transform(anomalies)

# Train autoencoder (MLP that reconstructs input)
autoencoder = MLPRegressor(hidden_layer_sizes=(8, 2, 8), max_iter=1000,
                           random_state=42, activation='tanh')
autoencoder.fit(normal_scaled, normal_scaled)

# Compute reconstruction error
def reconstruction_error(model, X):
    X_reconstructed = model.predict(X)
    return np.mean((X - X_reconstructed) ** 2, axis=1)

normal_errors = reconstruction_error(autoencoder, normal_scaled)
anomaly_errors = reconstruction_error(autoencoder, anomalies_scaled)

print("Autoencoder Reconstruction Errors:")
print(f"Normal data:   mean={normal_errors.mean():.4f}, max={normal_errors.max():.4f}")
print(f"Anomalies:     mean={anomaly_errors.mean():.4f}, max={anomaly_errors.max():.4f}")
print("\nAnomalies have much higher reconstruction error!")

Comparing Methods

| Method | Strengths | Weaknesses | When to Use | |--------|-----------|------------|-------------| | Z-score/IQR | Simple, interpretable | Univariate, assumes normality | Quick baseline, simple data | | Isolation Forest | Fast, handles high dimensions | Struggles with local anomalies | General purpose, first choice | | One-Class SVM | Flexible boundaries | Requires parameter tuning | Known normal data distribution | | LOF | Detects local anomalies | Slow for large datasets | Local density variations | | GMM | Probabilistic interpretation | Assumes Gaussian mixture | Multi-modal normal data | | Autoencoder | Handles complex patterns | Requires more data, tuning | High-dimensional, complex data |

PYTHON
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import precision_score, recall_score, f1_score

# Generate data with known anomalies
np.random.seed(42)
normal = np.random.randn(500, 5)
anomalies = np.random.uniform(-5, 5, (25, 5))
X = np.vstack([normal, anomalies])
y_true = np.array([0]*500 + [1]*25)  # 0=normal, 1=anomaly

print("Anomaly Detection Methods Comparison:")
print(f"{'Method':<20} {'Precision':<12} {'Recall':<12} {'F1':<12}")
print("-" * 55)

methods = [
    ('Isolation Forest', IsolationForest(contamination=0.05, random_state=42)),
    ('One-Class SVM', OneClassSVM(nu=0.05)),
    ('LOF', LocalOutlierFactor(n_neighbors=20, contamination=0.05)),
]

for name, model in methods:
    if name == 'LOF':
        y_pred = model.fit_predict(X)
    else:
        y_pred = model.fit_predict(X)

    # Convert: -1 (anomaly) -> 1, 1 (normal) -> 0
    y_pred = (y_pred == -1).astype(int)

    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    print(f"{name:<20} {prec:<12.3f} {rec:<12.3f} {f1:<12.3f}")

Practical Considerations

Contamination parameter: Most methods require specifying the expected fraction of anomalies. If unknown, start with a small value (1-5%).

Training data: Some methods (One-Class SVM, autoencoders) assume training data is mostly normal. Others (Isolation Forest, LOF) can handle some contamination.

Feature scaling: Most methods are sensitive to feature scales. Standardize features before applying.

Threshold selection: Many methods output scores. Choosing the threshold requires domain knowledge or validation data.

Evaluation: Without labeled anomalies, evaluation is difficult. Use domain expertise or inject synthetic anomalies.

PYTHON
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Best practices example
np.random.seed(42)
X = np.random.randn(1000, 10)
X[::100] = X[::100] * 5  # Add some anomalies

# 1. Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Try different contamination values
print("Effect of contamination parameter:")
for contamination in [0.01, 0.05, 0.10]:
    iso = IsolationForest(contamination=contamination, random_state=42)
    predictions = iso.fit_predict(X_scaled)
    n_anomalies = sum(predictions == -1)
    print(f"  contamination={contamination}: {n_anomalies} anomalies detected")

# 3. Use scores for flexible thresholding
iso = IsolationForest(random_state=42)
iso.fit(X_scaled)
scores = iso.decision_function(X_scaled)

print(f"\nAnomaly scores range: [{scores.min():.3f}, {scores.max():.3f}]")
print("Use score distribution to set custom threshold based on domain knowledge")

Key Takeaways

  • Anomaly detection identifies rare, unusual observations that deviate from expected patterns
  • Point anomalies are individual outliers; contextual and collective anomalies depend on context
  • Statistical methods (Z-score, IQR) are simple baselines for univariate data
  • Isolation Forest is fast, effective, and a good first choice for most problems
  • One-Class SVM learns a boundary around normal data using kernel methods
  • LOF detects local anomalies based on density deviation from neighbors
  • Autoencoders detect anomalies through high reconstruction error
  • The contamination parameter specifies expected anomaly fraction
  • Feature scaling is important for distance-based methods
  • Evaluation is challenging without labeled anomalies—use domain knowledge

6.4 Association Rules Intermediate

Association Rules

Association rule mining discovers interesting relationships between variables in large datasets. The classic example is market basket analysis: finding which products are frequently purchased together. If customers who buy bread also tend to buy butter, a retailer might place these items near each other or offer bundle discounts.

Beyond retail, association rules apply to web usage patterns, medical diagnosis, recommendation systems, and any domain where discovering co-occurrence patterns is valuable.

The Market Basket Problem

Imagine a supermarket with thousands of transactions, each listing the items purchased. We want to find rules like "customers who buy {diapers} also buy {beer}"—patterns that might not be obvious but are statistically significant.

Each transaction is a set of items. We look for itemsets (combinations of items) that appear frequently, then derive rules about which items imply the purchase of others.

PYTHON
# Sample transaction data
transactions = [
    ['bread', 'milk', 'eggs'],
    ['bread', 'butter'],
    ['milk', 'butter', 'eggs'],
    ['bread', 'milk', 'butter'],
    ['bread', 'milk', 'eggs', 'butter'],
    ['milk', 'eggs'],
    ['bread', 'milk'],
    ['bread', 'butter', 'eggs'],
    ['milk', 'butter'],
    ['bread', 'milk', 'butter', 'eggs']
]

print("Sample Transactions:")
for i, t in enumerate(transactions[:5]):
    print(f"  Transaction {i+1}: {t}")
print(f"  ... ({len(transactions)} total transactions)")

# Count item frequencies
from collections import Counter
all_items = [item for t in transactions for item in t]
item_counts = Counter(all_items)

print("\nItem Frequencies:")
for item, count in item_counts.most_common():
    print(f"  {item}: {count} ({100*count/len(transactions):.0f}%)")

Key Metrics: Support, Confidence, Lift

Three metrics evaluate the interestingness of association rules:

Support: How frequently an itemset appears in all transactions.

$$\text{Support}(X) = \frac{\text{Transactions containing } X}{\text{Total transactions}}$$

Support measures the significance of an itemset. Low support means the pattern is rare.

Confidence: Given X, how often does Y also appear? This is a conditional probability.

$$\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} = P(Y|X)$$

High confidence means Y usually appears when X appears, but this can be misleading if Y is already common.

Lift: Does X actually increase the likelihood of Y, or is Y just common anyway?

$$\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)} = \frac{P(X \cap Y)}{P(X) \cdot P(Y)}$$

  • Lift > 1: X and Y appear together more than expected (positive association)
  • Lift = 1: X and Y are independent
  • Lift < 1: X and Y appear together less than expected (negative association)

PYTHON
def calculate_metrics(transactions, X, Y):
    """Calculate support, confidence, and lift for rule X -> Y."""
    n = len(transactions)
    X_set = set(X) if isinstance(X, list) else {X}
    Y_set = set(Y) if isinstance(Y, list) else {Y}
    XY_set = X_set | Y_set

    # Count occurrences
    X_count = sum(1 for t in transactions if X_set.issubset(set(t)))
    Y_count = sum(1 for t in transactions if Y_set.issubset(set(t)))
    XY_count = sum(1 for t in transactions if XY_set.issubset(set(t)))

    # Calculate metrics
    support_X = X_count / n
    support_Y = Y_count / n
    support_XY = XY_count / n
    confidence = support_XY / support_X if support_X > 0 else 0
    lift = confidence / support_Y if support_Y > 0 else 0

    return {
        'support_X': support_X,
        'support_Y': support_Y,
        'support_XY': support_XY,
        'confidence': confidence,
        'lift': lift
    }

# Example rules
rules = [
    (['bread'], ['milk']),
    (['milk'], ['bread']),
    (['bread', 'milk'], ['butter']),
    (['butter'], ['eggs']),
]

print("Association Rule Metrics:")
print(f"{'Rule':<30} {'Support':<10} {'Confidence':<12} {'Lift':<8}")
print("-" * 60)

for X, Y in rules:
    metrics = calculate_metrics(transactions, X, Y)
    rule_str = f"{X} -> {Y}"
    print(f"{rule_str:<30} {metrics['support_XY']:<10.2f} {metrics['confidence']:<12.2f} {metrics['lift']:<8.2f}")

The Apriori Algorithm

Finding all association rules is computationally expensive—there are $2^n$ possible itemsets for $n$ items. The Apriori algorithm uses a clever pruning strategy based on the downward closure property: if an itemset is infrequent, all its supersets must also be infrequent.

The algorithm proceeds level-by-level:

  1. Find all 1-itemsets with support ≥ minimum threshold
  2. Generate candidate 2-itemsets from frequent 1-itemsets
  3. Prune candidates that contain infrequent subsets
  4. Count support for remaining candidates
  5. Repeat for larger itemsets until no frequent itemsets remain

PYTHON
from itertools import combinations

def apriori(transactions, min_support):
    """Simple Apriori implementation."""
    n = len(transactions)
    min_count = min_support * n

    # Convert transactions to sets
    transaction_sets = [set(t) for t in transactions]

    # Find frequent 1-itemsets
    all_items = set(item for t in transactions for item in t)
    frequent = {}

    # Level 1: single items
    for item in all_items:
        count = sum(1 for t in transaction_sets if item in t)
        if count >= min_count:
            frequent[frozenset([item])] = count / n

    k = 2
    current_frequent = list(frequent.keys())

    while current_frequent:
        # Generate candidates of size k
        items = set(item for itemset in current_frequent for item in itemset)
        candidates = [frozenset(c) for c in combinations(items, k)]

        # Prune: remove candidates with infrequent subsets
        candidates = [c for c in candidates
                     if all(frozenset(sub) in frequent
                           for sub in combinations(c, k-1))]

        # Count support
        new_frequent = []
        for candidate in candidates:
            count = sum(1 for t in transaction_sets if candidate.issubset(t))
            if count >= min_count:
                frequent[candidate] = count / n
                new_frequent.append(candidate)

        current_frequent = new_frequent
        k += 1

    return frequent

# Run Apriori
min_support = 0.3
frequent_itemsets = apriori(transactions, min_support)

print(f"Frequent Itemsets (min_support={min_support}):")
for itemset, support in sorted(frequent_itemsets.items(), key=lambda x: (-len(x[0]), -x[1])):
    print(f"  {set(itemset)}: support={support:.2f}")

Using mlxtend for Association Rules

The mlxtend library provides efficient implementations of Apriori and related algorithms:

PYTHON
try:
    import pandas as pd
    from mlxtend.frequent_patterns import apriori, association_rules
    from mlxtend.preprocessing import TransactionEncoder

    # Prepare data
    transactions = [
        ['bread', 'milk', 'eggs'],
        ['bread', 'butter'],
        ['milk', 'butter', 'eggs'],
        ['bread', 'milk', 'butter'],
        ['bread', 'milk', 'eggs', 'butter'],
        ['milk', 'eggs'],
        ['bread', 'milk'],
        ['bread', 'butter', 'eggs'],
        ['milk', 'butter'],
        ['bread', 'milk', 'butter', 'eggs']
    ]

    # Encode transactions
    te = TransactionEncoder()
    te_array = te.fit_transform(transactions)
    df = pd.DataFrame(te_array, columns=te.columns_)

    # Find frequent itemsets
    frequent = apriori(df, min_support=0.3, use_colnames=True)
    print("Frequent Itemsets:")
    print(frequent)

    # Generate association rules
    rules = association_rules(frequent, metric='lift', min_threshold=1.0)
    print("\nAssociation Rules (lift >= 1):")
    print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

except ImportError:
    print("mlxtend not installed. Install with: pip install mlxtend")

FP-Growth: A Faster Alternative

FP-Growth (Frequent Pattern Growth) is more efficient than Apriori because it avoids candidate generation. It compresses the database into a compact data structure called an FP-tree, then mines patterns directly from this tree.

FP-Growth typically runs faster than Apriori, especially on large datasets with many frequent itemsets.

PYTHON
try:
    from mlxtend.frequent_patterns import fpgrowth, association_rules
    from mlxtend.preprocessing import TransactionEncoder
    import pandas as pd

    transactions = [
        ['bread', 'milk', 'eggs'],
        ['bread', 'butter'],
        ['milk', 'butter', 'eggs'],
        ['bread', 'milk', 'butter'],
        ['bread', 'milk', 'eggs', 'butter'],
        ['milk', 'eggs'],
        ['bread', 'milk'],
        ['bread', 'butter', 'eggs'],
        ['milk', 'butter'],
        ['bread', 'milk', 'butter', 'eggs']
    ]

    te = TransactionEncoder()
    df = pd.DataFrame(te.fit_transform(transactions), columns=te.columns_)

    # FP-Growth
    frequent = fpgrowth(df, min_support=0.3, use_colnames=True)
    print("FP-Growth Frequent Itemsets:")
    print(frequent)

    rules = association_rules(frequent, metric='confidence', min_threshold=0.6)
    print("\nRules with confidence >= 0.6:")
    print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

except ImportError:
    print("mlxtend not installed. Install with: pip install mlxtend")

Interpreting Rules

Not all statistically significant rules are useful. Consider:

Actionability: Can you act on the rule? "Customers who buy {milk} buy {bread}" is actionable (cross-promotions); "customers who buy {product A} buy {product A}" is not.

Domain relevance: Does the rule make sense? A rule might be statistically valid but represent spurious correlation.

Redundancy: Avoid redundant rules. If {A, B} → {C} is strong, do we also need {A} → {C} and {B} → {C}?

Rule interestingness: Beyond lift, other measures like conviction, leverage, and Zhang's metric capture different aspects of interestingness.

PYTHON
def evaluate_rule(support_X, support_Y, support_XY, n=1000):
    """Calculate various interestingness measures."""
    confidence = support_XY / support_X if support_X > 0 else 0
    lift = confidence / support_Y if support_Y > 0 else 0

    # Leverage: difference between observed and expected
    leverage = support_XY - (support_X * support_Y)

    # Conviction: how much X depends on Y
    if confidence < 1:
        conviction = (1 - support_Y) / (1 - confidence)
    else:
        conviction = float('inf')

    return {
        'confidence': confidence,
        'lift': lift,
        'leverage': leverage,
        'conviction': conviction
    }

# Example
metrics = evaluate_rule(0.5, 0.6, 0.4)
print("Rule Interestingness Measures:")
for name, value in metrics.items():
    print(f"  {name}: {value:.3f}")

print("\nInterpretation:")
print("  - Lift > 1: Positive association")
print("  - Leverage > 0: Co-occurrence above independence")
print("  - Conviction > 1: Rule is not due to chance")

Applications Beyond Market Basket

Association rules apply to many domains:

Web mining: Pages frequently visited together, click patterns.

Medical diagnosis: Symptoms that co-occur with diseases.

Bioinformatics: Genes that are expressed together.

Recommendation systems: Items that users frequently rate together.

Fraud detection: Transaction patterns associated with fraud.

PYTHON
# Example: Web page associations
web_sessions = [
    ['home', 'products', 'cart', 'checkout'],
    ['home', 'products', 'product_detail'],
    ['home', 'about', 'contact'],
    ['home', 'products', 'cart', 'checkout', 'confirmation'],
    ['home', 'products', 'product_detail', 'cart'],
    ['home', 'blog', 'about'],
    ['home', 'products', 'cart'],
    ['home', 'products', 'product_detail', 'cart', 'checkout'],
]

print("Web Session Analysis:")
print("Finding pages that lead to checkout...")

# Count checkout sessions
checkout_sessions = [s for s in web_sessions if 'checkout' in s]
non_checkout = [s for s in web_sessions if 'checkout' not in s]

print(f"Sessions with checkout: {len(checkout_sessions)}/{len(web_sessions)}")

# What pages precede checkout?
from collections import Counter
pages_before_checkout = Counter()
for session in checkout_sessions:
    checkout_idx = session.index('checkout')
    for page in session[:checkout_idx]:
        pages_before_checkout[page] += 1

print("\nPages frequently appearing before checkout:")
for page, count in pages_before_checkout.most_common():
    pct = 100 * count / len(checkout_sessions)
    print(f"  {page}: {count} times ({pct:.0f}% of checkout sessions)")

Handling Large Datasets

For large datasets, consider:

Sampling: Mine rules from a representative sample.

Distributed computing: Use Spark's MLlib for distributed Apriori/FP-Growth.

Approximate algorithms: Trade accuracy for speed.

Incremental mining: Update rules as new transactions arrive.

PYTHON
# Sampling approach for large datasets
import random

def sample_transactions(transactions, sample_size):
    """Sample transactions for approximate rule mining."""
    if len(transactions) <= sample_size:
        return transactions
    return random.sample(transactions, sample_size)

# Simulated large dataset
large_transactions = [['item_' + str(random.randint(1, 100)) for _ in range(random.randint(2, 10))]
                      for _ in range(100000)]

print(f"Original dataset: {len(large_transactions)} transactions")

# Sample for faster mining
sample = sample_transactions(large_transactions, 5000)
print(f"Sample size: {len(sample)} transactions")
print("Mine rules from sample, then validate on full dataset")

Practical Guidelines

Start with reasonable thresholds: Begin with minsupport=0.01-0.05 and minconfidence=0.5. Adjust based on results.

Filter by lift: Rules with lift < 1 indicate negative association—usually not actionable.

Consider the business context: A rule with low support but high confidence for high-value items may be more valuable than a high-support rule for low-value items.

Validate rules: Statistical significance doesn't guarantee practical significance. Test rules with A/B experiments.

Update regularly: Consumer behavior changes. Re-mine rules periodically.

PYTHON
# Practical workflow
def mine_actionable_rules(transactions, min_support=0.01, min_confidence=0.5, min_lift=1.2):
    """Mine and filter actionable association rules."""
    try:
        from mlxtend.frequent_patterns import fpgrowth, association_rules
        from mlxtend.preprocessing import TransactionEncoder
        import pandas as pd

        te = TransactionEncoder()
        df = pd.DataFrame(te.fit_transform(transactions), columns=te.columns_)

        frequent = fpgrowth(df, min_support=min_support, use_colnames=True)
        rules = association_rules(frequent, metric='confidence', min_threshold=min_confidence)

        # Filter by lift
        actionable = rules[rules['lift'] >= min_lift].copy()

        # Sort by a combined score
        actionable['score'] = actionable['support'] * actionable['lift']
        actionable = actionable.sort_values('score', ascending=False)

        return actionable

    except ImportError:
        print("mlxtend required for this function")
        return None

print("Actionable rules would be filtered by:")
print("  - Minimum support (rules must be common enough)")
print("  - Minimum confidence (rules must be reliable)")
print("  - Minimum lift (rules must show positive association)")

Key Takeaways

  • Association rules discover co-occurrence patterns in transactional data
  • Support measures frequency; confidence measures reliability; lift measures strength of association
  • Lift > 1 indicates positive association; lift = 1 indicates independence
  • Apriori uses the downward closure property to efficiently prune candidates
  • FP-Growth is faster by avoiding candidate generation
  • Not all rules are useful—filter by lift, validate with domain knowledge
  • Applications include market basket analysis, recommendations, web mining, and medical diagnosis
  • For large datasets, use sampling, distributed computing, or approximate algorithms
  • Regularly update rules as patterns change over time