Enthusiast Intermediate 90 min read

Chapter 8: Unsupervised Learning

Clustering, dimensionality reduction, and anomaly detection.

Libraries covered: Scikit-learn

Learning Objectives

["Apply k-means and hierarchical clustering", "Reduce dimensions with PCA and t-SNE", "Detect anomalies"]


8.1 Feature Selection Intermediate

Feature Selection

Real-world datasets often contain dozens, hundreds, or even thousands of features. Not all of them are useful. Some features are redundant (highly correlated with others), some are irrelevant (no relationship with the target), and some are noise (random variation that hurts generalization). Feature selection identifies which features to keep and which to discard.

Good feature selection improves model performance, reduces overfitting, speeds up training, and makes models more interpretable. This section covers the three main approaches: filter methods, wrapper methods, and embedded methods.

Why Feature Selection Matters

Adding more features seems like it should help—more information is better, right? But in practice, irrelevant features hurt performance. They add noise that the model might accidentally fit, leading to overfitting. They increase computational cost. They make the model harder to interpret.

The curse of dimensionality makes this worse. As the number of features grows, the data becomes increasingly sparse in the high-dimensional space. Points that seem nearby in low dimensions become far apart when you add irrelevant dimensions. Models need exponentially more data to maintain the same density of coverage.

Feature selection addresses these problems by identifying a subset of features that captures the signal while discarding the noise.

PYTHON
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Create dataset: 10 informative features, 90 noise features
X, y = make_classification(
    n_samples=500,
    n_features=100,
    n_informative=10,
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=42
)

# Model with all features
model_all = LogisticRegression(max_iter=1000, random_state=42)
scores_all = cross_val_score(model_all, X, y, cv=5)

# Model with only informative features
X_informative = X[:, :10]
model_subset = LogisticRegression(max_iter=1000, random_state=42)
scores_subset = cross_val_score(model_subset, X_informative, y, cv=5)

print("Impact of Irrelevant Features:")
print(f"All 100 features:         {scores_all.mean():.4f} (+/- {scores_all.std()*2:.4f})")
print(f"Only 10 informative:      {scores_subset.mean():.4f} (+/- {scores_subset.std()*2:.4f})")
print("\nFewer features can mean better performance!")

Filter Methods

Filter methods evaluate features independently of any specific model. They compute a score for each feature based on its relationship with the target, then select the top-scoring features. Filter methods are fast and model-agnostic.

The downside is that they ignore feature interactions and don't consider how features work together within a specific model.

Variance Threshold removes features with low variance. If a feature has nearly the same value for all samples, it provides no discriminating power.

PYTHON
from sklearn.feature_selection import VarianceThreshold
import numpy as np

# Create data with some constant features
np.random.seed(42)
X = np.random.randn(100, 10)
X[:, 5] = 1.0  # Constant feature
X[:, 8] = X[:, 8] * 0.01  # Near-constant feature

print("Original feature variances:")
for i, var in enumerate(np.var(X, axis=0)):
    print(f"  Feature {i}: variance = {var:.4f}")

# Remove features with variance below threshold
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)

print(f"\nFeatures before: {X.shape[1]}")
print(f"Features after:  {X_selected.shape[1]}")
print(f"Removed features: {np.where(~selector.get_support())[0]}")

Correlation-based selection removes highly correlated features. When two features are strongly correlated, they provide redundant information. Keeping both adds complexity without adding predictive power.

PYTHON
import numpy as np
import pandas as pd

# Create correlated features
np.random.seed(42)
n_samples = 200

# Independent features
X1 = np.random.randn(n_samples)
X2 = np.random.randn(n_samples)
X3 = np.random.randn(n_samples)

# Correlated features
X4 = X1 + np.random.randn(n_samples) * 0.1  # Highly correlated with X1
X5 = X2 * 0.5 + np.random.randn(n_samples) * 0.5  # Moderately correlated with X2

df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'X4': X4, 'X5': X5})

# Correlation matrix
corr_matrix = df.corr().abs()
print("Correlation Matrix:")
print(corr_matrix.round(3))

# Find highly correlated pairs
def find_correlated_features(corr_matrix, threshold=0.8):
    """Find pairs of features with correlation above threshold."""
    pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if corr_matrix.iloc[i, j] > threshold:
                pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))
    return pairs

high_corr = find_correlated_features(corr_matrix, threshold=0.8)
print(f"\nHighly correlated pairs (>0.8):")
for f1, f2, corr in high_corr:
    print(f"  {f1} and {f2}: {corr:.3f}")

Mutual Information measures how much information a feature provides about the target. Unlike correlation, it captures non-linear relationships. A feature with high mutual information is predictive of the target, even if the relationship isn't linear.

PYTHON
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.datasets import make_classification
import numpy as np

# Create dataset with different types of relationships
np.random.seed(42)
n_samples = 1000

# Linear relationship
X1 = np.random.randn(n_samples)
# Non-linear relationship (quadratic)
X2 = np.random.randn(n_samples)
# No relationship
X3 = np.random.randn(n_samples)

# Target based on X1 (linear) and X2 (quadratic)
y = (X1 > 0).astype(int) ^ (X2**2 > 1).astype(int)

X = np.column_stack([X1, X2, X3])

# Compute mutual information
mi_scores = mutual_info_classif(X, y, random_state=42)

print("Mutual Information Scores:")
for i, score in enumerate(mi_scores):
    relationship = ["linear", "non-linear (quadratic)", "none"][i]
    print(f"  Feature {i} ({relationship}): {score:.4f}")

print("\nMutual information captures both linear and non-linear relationships.")

Chi-squared test measures dependence between categorical features and a categorical target. It's commonly used for text classification where features are word counts.

PYTHON
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Load text data (subset)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))

# Convert to word counts
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Chi-squared feature selection
chi2_scores, p_values = chi2(X, y)

# Get feature names
feature_names = vectorizer.get_feature_names_out()

# Top features by chi-squared score
top_indices = np.argsort(chi2_scores)[-10:][::-1]

print("Top 10 Features by Chi-Squared Score:")
for idx in top_indices:
    print(f"  {feature_names[idx]:<15}: chi2 = {chi2_scores[idx]:.2f}, p = {p_values[idx]:.2e}")

Wrapper Methods

Wrapper methods evaluate feature subsets by training a model and measuring its performance. They search through possible feature combinations, keeping those that improve the model. Wrapper methods consider feature interactions and are tailored to the specific model being used.

The downside is computational cost—training a model for each candidate subset is expensive. Wrapper methods also risk overfitting to the validation set used for evaluation.

Forward Selection starts with no features and adds one at a time. At each step, it tries adding each remaining feature and keeps the one that improves performance most. This continues until adding features no longer helps.

PYTHON
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import numpy as np

# Create dataset
X, y = make_classification(n_samples=300, n_features=20, n_informative=5, random_state=42)

# Forward selection
model = LogisticRegression(max_iter=1000, random_state=42)

forward_selector = SequentialFeatureSelector(
    model,
    n_features_to_select=5,
    direction='forward',
    cv=3,
    scoring='accuracy'
)

forward_selector.fit(X, y)
selected_features = np.where(forward_selector.get_support())[0]

print("Forward Selection:")
print(f"Selected features: {selected_features}")

# Compare performance
X_selected = forward_selector.transform(X)
scores_all = cross_val_score(model, X, y, cv=5)
scores_selected = cross_val_score(model, X_selected, y, cv=5)

print(f"All 20 features:      {scores_all.mean():.4f}")
print(f"Selected 5 features:  {scores_selected.mean():.4f}")

Backward Elimination starts with all features and removes one at a time. At each step, it tries removing each feature and removes the one whose absence hurts performance least. This continues until removing features starts hurting performance significantly.

PYTHON
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Same dataset
X, y = make_classification(n_samples=300, n_features=20, n_informative=5, random_state=42)

# Backward elimination
model = RandomForestClassifier(n_estimators=50, random_state=42)

backward_selector = SequentialFeatureSelector(
    model,
    n_features_to_select=5,
    direction='backward',
    cv=3,
    scoring='accuracy'
)

backward_selector.fit(X, y)
selected_features = np.where(backward_selector.get_support())[0]

print("Backward Elimination:")
print(f"Selected features: {selected_features}")

# Compare with forward selection results
print("\nNote: Forward and backward selection may select different features")
print("depending on feature interactions and the order of evaluation.")

Recursive Feature Elimination (RFE) trains a model, ranks features by importance, removes the least important, and repeats. It's more efficient than exhaustive forward/backward selection because it uses feature importance from the model rather than retraining for each feature.

PYTHON
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

# Create dataset
X, y = make_classification(n_samples=300, n_features=20, n_informative=5, random_state=42)

# RFE with cross-validation to find optimal number of features
model = RandomForestClassifier(n_estimators=50, random_state=42)

rfecv = RFECV(
    estimator=model,
    step=1,          # Remove one feature at a time
    cv=5,
    scoring='accuracy',
    min_features_to_select=1
)

rfecv.fit(X, y)

print("Recursive Feature Elimination with CV:")
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {np.where(rfecv.support_)[0]}")
print(f"Feature ranking: {rfecv.ranking_}")

Embedded Methods

Embedded methods perform feature selection as part of the model training process. They're more efficient than wrappers because selection happens during training rather than requiring separate model evaluations.

L1 Regularization (Lasso) shrinks some coefficients exactly to zero, effectively removing those features from the model. The regularization strength controls how aggressively features are eliminated.

PYTHON
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
import numpy as np

# Create regression data with sparse true coefficients
np.random.seed(42)
n_samples, n_features = 200, 50
X = np.random.randn(n_samples, n_features)

# Only 5 features matter
true_coef = np.zeros(n_features)
true_coef[:5] = [3, -2, 1, 0.5, -1.5]
y = X @ true_coef + np.random.randn(n_samples) * 0.5

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal regularization with CV
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)

# Features selected (non-zero coefficients)
selected = np.where(lasso_cv.coef_ != 0)[0]
n_selected = len(selected)

print("L1 Regularization (Lasso) Feature Selection:")
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Features selected: {n_selected} out of {n_features}")
print(f"Selected feature indices: {selected}")
print(f"True informative features: [0, 1, 2, 3, 4]")

# Show coefficients for first 10 features
print("\nCoefficients (first 10 features):")
for i in range(10):
    true = true_coef[i]
    learned = lasso_cv.coef_[i]
    print(f"  Feature {i}: true = {true:6.2f}, learned = {learned:6.2f}")

Tree-based feature importance measures how much each feature contributes to reducing impurity (for classification) or variance (for regression) across all trees in an ensemble. Features that frequently appear in splits near the root are considered more important.

PYTHON
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
import numpy as np

# Create dataset
X, y = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=5,
    n_redundant=2,
    random_state=42
)

# Random Forest importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Sort by importance
importance_rf = rf.feature_importances_
indices_rf = np.argsort(importance_rf)[::-1]

print("Random Forest Feature Importance:")
print(f"{'Rank':<6} {'Feature':<10} {'Importance':<12}")
print("-" * 28)
for rank, idx in enumerate(indices_rf[:10], 1):
    print(f"{rank:<6} Feature {idx:<3} {importance_rf[idx]:<12.4f}")

# Select top features
threshold = 0.05
selected = np.where(importance_rf > threshold)[0]
print(f"\nFeatures with importance > {threshold}: {selected}")

Permutation Importance measures feature importance by shuffling each feature and observing how much model performance degrades. Unlike built-in importance, it works with any model and measures actual predictive contribution rather than how often a feature is used.

PYTHON
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Create dataset
X, y = make_classification(n_samples=500, n_features=20, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Compute permutation importance on test set
perm_importance = permutation_importance(
    model, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Sort by importance
sorted_idx = perm_importance.importances_mean.argsort()[::-1]

print("Permutation Importance (on test set):")
print(f"{'Rank':<6} {'Feature':<10} {'Importance':<12} {'Std':<10}")
print("-" * 38)
for rank, idx in enumerate(sorted_idx[:10], 1):
    mean = perm_importance.importances_mean[idx]
    std = perm_importance.importances_std[idx]
    print(f"{rank:<6} Feature {idx:<3} {mean:<12.4f} {std:<10.4f}")

Comparing Selection Methods

Different methods have different strengths. Filter methods are fast but ignore interactions. Wrapper methods consider interactions but are slow. Embedded methods balance both concerns.

PYTHON
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np
import time

# Create dataset
X, y = make_classification(
    n_samples=500,
    n_features=50,
    n_informative=10,
    random_state=42
)

# Scale for Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

methods = {}

# Filter: Mutual Information
start = time.time()
mi_selector = SelectKBest(mutual_info_classif, k=10)
X_mi = mi_selector.fit_transform(X, y)
methods['Filter (MI)'] = {
    'features': np.where(mi_selector.get_support())[0],
    'time': time.time() - start,
    'X_selected': X_mi
}

# Wrapper: RFE
start = time.time()
rfe = RFE(LogisticRegression(max_iter=1000), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
methods['Wrapper (RFE)'] = {
    'features': np.where(rfe.support_)[0],
    'time': time.time() - start,
    'X_selected': X_rfe
}

# Embedded: Lasso
start = time.time()
lasso = LassoCV(cv=3)
lasso.fit(X_scaled, y)
# Get top 10 by absolute coefficient
top_10 = np.argsort(np.abs(lasso.coef_))[-10:]
X_lasso = X_scaled[:, top_10]
methods['Embedded (Lasso)'] = {
    'features': top_10,
    'time': time.time() - start,
    'X_selected': X_lasso
}

# Compare
print("Feature Selection Method Comparison:")
print(f"{'Method':<20} {'Time (s)':<12} {'CV Accuracy':<12}")
print("-" * 44)

model = LogisticRegression(max_iter=1000)
for name, data in methods.items():
    score = cross_val_score(model, data['X_selected'], y, cv=5).mean()
    print(f"{name:<20} {data['time']:<12.4f} {score:<12.4f}")

# Baseline with all features
score_all = cross_val_score(model, X, y, cv=5).mean()
print(f"{'All features':<20} {'-':<12} {score_all:<12.4f}")

Practical Guidelines

Start simple: Begin with variance threshold and correlation analysis to remove obvious noise and redundancy.

Match method to scale: Filter methods for high-dimensional data (fast). Wrapper/embedded methods for moderate dimensions (more accurate).

Use cross-validation: Always evaluate feature selection within cross-validation to avoid overfitting to the feature selection process.

Consider domain knowledge: Statistical methods complement but don't replace human understanding. A feature with low statistical importance might still be essential for interpretability.

Beware of data leakage: Feature selection must happen inside cross-validation folds, not before. Selecting features on the full dataset before splitting leaks information.

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# CORRECT: Feature selection inside pipeline (inside CV)
pipeline = Pipeline([
    ('select', SelectKBest(mutual_info_classif, k=10)),
    ('classify', LogisticRegression(max_iter=1000))
])

# Cross-validation will apply feature selection separately on each fold
scores = cross_val_score(pipeline, X, y, cv=5)
print("Correct approach (selection inside CV):")
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

# WRONG: Feature selection before cross-validation
# selector = SelectKBest(mutual_info_classif, k=10)
# X_selected = selector.fit_transform(X, y)  # Sees all data!
# scores_wrong = cross_val_score(LogisticRegression(), X_selected, y, cv=5)
# This gives optimistically biased results

print("\nAlways use pipelines to prevent data leakage in feature selection!")

Key Takeaways

  • Feature selection removes irrelevant, redundant, and noisy features
  • Too many features leads to overfitting and increased computational cost
  • Filter methods (variance, correlation, mutual information) are fast and model-agnostic
  • Wrapper methods (forward/backward selection, RFE) consider interactions but are slower
  • Embedded methods (L1 regularization, tree importance) perform selection during training
  • Permutation importance measures actual predictive contribution for any model
  • Always perform feature selection inside cross-validation to prevent leakage
  • Combine statistical methods with domain knowledge for best results
  • Use pipelines to keep feature selection inside the training process

8.2 Feature Extraction Intermediate

Feature Extraction

Feature selection keeps a subset of existing features. Feature extraction creates new features by transforming or combining the original ones. The goal is to represent the data more effectively—capturing the essential information in fewer dimensions or in a form more suitable for the learning algorithm.

Feature extraction is particularly valuable when dealing with high-dimensional data (images, text, time series) or when the original features don't directly represent the patterns you want to learn.

Selection vs Extraction

Feature selection answers: "Which of these features should I use?"

Feature extraction answers: "What new features can I create that better represent the data?"

Selection keeps interpretability—you know which original features matter. Extraction often sacrifices interpretability for performance—the new features are combinations that may not have intuitive meaning.

Both approaches reduce dimensionality, but extraction can capture patterns that no single original feature represents.

PYTHON
import numpy as np
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Load digit images (64 features = 8x8 pixels)
digits = load_digits()
X, y = digits.data, digits.target

# Feature Selection: keep 10 best original pixels
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Feature Extraction: create 10 principal components
pca = PCA(n_components=10)
X_extracted = pca.fit_transform(X)

# Compare
model = LogisticRegression(max_iter=5000, random_state=42)

score_all = cross_val_score(model, X, y, cv=5).mean()
score_selected = cross_val_score(model, X_selected, y, cv=5).mean()
score_extracted = cross_val_score(model, X_extracted, y, cv=5).mean()

print("Selection vs Extraction (Digit Recognition):")
print(f"All 64 features:         {score_all:.4f}")
print(f"10 selected pixels:      {score_selected:.4f}")
print(f"10 PCA components:       {score_extracted:.4f}")
print("\nExtraction often captures more information in fewer dimensions.")

Principal Component Analysis (PCA)

PCA finds new axes (principal components) that capture the maximum variance in the data. The first component points in the direction of highest variance, the second in the direction of second-highest variance (orthogonal to the first), and so on.

By keeping only the top components, you reduce dimensionality while preserving most of the data's variance. PCA is unsupervised—it doesn't use labels, only the structure of the feature space.

PCA works best when the data's variance corresponds to its information content. This isn't always true (a noisy feature has high variance but low information), but it's a reasonable assumption in many cases.

PYTHON
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# Load data
digits = load_digits()
X = digits.data  # 64 features

# Fit PCA
pca = PCA()
pca.fit(X)

# Explained variance ratio
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

print("PCA Explained Variance:")
print(f"First component:    {explained_var[0]:.1%}")
print(f"First 5 components: {cumulative_var[4]:.1%}")
print(f"First 10 components: {cumulative_var[9]:.1%}")
print(f"First 20 components: {cumulative_var[19]:.1%}")

# How many components for 95% variance?
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_95} (out of {X.shape[1]})")

Choosing the Number of Components

There's no single right answer for how many components to keep. Common approaches:

Explained variance threshold: Keep enough components to explain a target percentage (e.g., 95%) of variance.

Elbow method: Plot explained variance vs number of components and look for an "elbow" where adding more components yields diminishing returns.

Cross-validation: Choose the number that maximizes downstream model performance.

PYTHON
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

# Test different numbers of components
n_components_range = [5, 10, 20, 30, 40, 50, 64]
scores = []

for n in n_components_range:
    pipeline = Pipeline([
        ('pca', PCA(n_components=n)),
        ('clf', LogisticRegression(max_iter=5000, random_state=42))
    ])
    score = cross_val_score(pipeline, X, y, cv=5).mean()
    scores.append(score)

print("Cross-Validation Accuracy by Number of PCA Components:")
print(f"{'Components':<12} {'Accuracy':<12}")
print("-" * 24)
for n, score in zip(n_components_range, scores):
    print(f"{n:<12} {score:.4f}")

best_n = n_components_range[np.argmax(scores)]
print(f"\nBest number of components: {best_n}")

Linear Discriminant Analysis (LDA)

Unlike PCA (unsupervised), LDA is supervised—it uses class labels to find components that maximize class separability. LDA projects data onto axes that maximize the ratio of between-class variance to within-class variance.

LDA produces at most (number of classes - 1) components. For binary classification, that's just one dimension. This makes LDA particularly useful when you want to reduce to very few dimensions while preserving discriminative information.

PYTHON
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Load wine data (3 classes, 13 features)
wine = load_wine()
X, y = wine.data, wine.target

# Compare PCA and LDA
n_components = 2  # LDA can produce at most 2 for 3 classes

# PCA (unsupervised)
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)

# LDA (supervised)
lda = LinearDiscriminantAnalysis(n_components=n_components)
X_lda = lda.fit_transform(X, y)

# Evaluate
model = LogisticRegression(random_state=42)

score_original = cross_val_score(model, X, y, cv=5).mean()
score_pca = cross_val_score(model, X_pca, y, cv=5).mean()
score_lda = cross_val_score(model, X_lda, y, cv=5).mean()

print("PCA vs LDA (Wine Dataset):")
print(f"Original (13 features): {score_original:.4f}")
print(f"PCA (2 components):     {score_pca:.4f}")
print(f"LDA (2 components):     {score_lda:.4f}")
print("\nLDA uses class information to find better projections for classification.")

t-SNE for Visualization

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear technique designed for visualization. It maps high-dimensional data to 2D or 3D while preserving local structure—points that are similar in the original space remain similar in the reduced space.

t-SNE is not suitable for feature extraction in a pipeline because it's computationally expensive, non-deterministic, and doesn't provide a transformation that can be applied to new data. Use it only for visualization and exploration.

PYTHON
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

# Load digit images
digits = load_digits()
X, y = digits.data, digits.target

# Apply t-SNE
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    n_iter=1000
)

X_tsne = tsne.fit_transform(X)

print("t-SNE Visualization:")
print(f"Original dimensions: {X.shape[1]}")
print(f"Reduced dimensions:  {X_tsne.shape[1]}")
print("\nt-SNE reveals cluster structure for visualization.")
print("Note: t-SNE is for visualization only, not for pipeline feature extraction.")

Text Feature Extraction

Text data requires special feature extraction. Raw text can't be fed directly to most models—it must be converted to numerical features.

Bag of Words represents documents as vectors of word counts. Each unique word becomes a feature.

TF-IDF (Term Frequency-Inverse Document Frequency) weights words by their importance. Common words get lower weights; distinctive words get higher weights.

PYTHON
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing handles text data.",
    "Computer vision processes images and videos."
]

# Bag of Words
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(documents)

print("Bag of Words:")
print(f"Documents: {len(documents)}")
print(f"Vocabulary size: {len(bow_vectorizer.get_feature_names_out())}")
print(f"Feature matrix shape: {X_bow.shape}")

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)

print("\nTF-IDF:")
print(f"Feature matrix shape: {X_tfidf.shape}")

Word Embeddings

Word embeddings represent words as dense vectors where similar words have similar vectors. Unlike TF-IDF (sparse, high-dimensional), embeddings are dense and low-dimensional (typically 50-300 dimensions).

Pre-trained embeddings (Word2Vec, GloVe, FastText) capture semantic relationships learned from massive text corpora. The famous example: king - man + woman = queen.

PYTHON
import numpy as np

# Simulating word embeddings (in practice, load pre-trained)
embeddings = {
    'king': np.array([0.8, 0.2, 0.9]),
    'queen': np.array([0.7, 0.8, 0.9]),
    'man': np.array([0.9, 0.1, 0.2]),
    'woman': np.array([0.8, 0.9, 0.2]),
}

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("Word Embedding Similarities:")
pairs = [('king', 'queen'), ('king', 'man')]
for w1, w2 in pairs:
    sim = cosine_similarity(embeddings[w1], embeddings[w2])
    print(f"  {w1} - {w2}: {sim:.4f}")

Image Feature Extraction

Images are high-dimensional (hundreds of thousands of pixels) but the raw pixel values are often not the best features. Feature extraction creates more useful representations.

Traditional methods like HOG (Histogram of Oriented Gradients) extract edge and texture patterns.

Deep learning features use pre-trained CNNs as feature extractors. The activations from intermediate layers capture hierarchical visual patterns.

PYTHON
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

digits = load_digits()
X = digits.data

def extract_gradient_features(images):
    n_samples = images.shape[0]
    images_2d = images.reshape(n_samples, 8, 8)
    features = []
    for img in images_2d:
        h_grad = np.abs(np.diff(img, axis=1)).mean()
        v_grad = np.abs(np.diff(img, axis=0)).mean()
        features.append([h_grad, v_grad])
    return np.array(features)

X_gradient = extract_gradient_features(X)

print("Image Feature Extraction:")
print(f"Original pixels: {X.shape[1]}")
print(f"Gradient features: {X_gradient.shape[1]}")

Domain-Specific Feature Engineering

Often the most powerful features come from domain knowledge.

Time series: Lag features, rolling statistics, Fourier coefficients, trend indicators.

Tabular data: Ratios, interactions, polynomial features, binning.

PYTHON
import numpy as np
import pandas as pd

np.random.seed(42)
prices = 100 + np.cumsum(np.random.randn(100) * 2)
df = pd.DataFrame({'price': prices})

df['lag1'] = df['price'].shift(1)
df['ma5'] = df['price'].rolling(5).mean()
df['momentum'] = df['price'] - df['price'].shift(5)

print("Domain-Specific Features:")
for col in df.columns:
    print(f"  - {col}")

Pipeline Integration

Feature extraction should be part of your pipeline to prevent data leakage.

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('classifier', LogisticRegression(max_iter=5000))
])

scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

Key Takeaways

  • Feature extraction creates new features by transforming existing ones
  • PCA finds orthogonal directions of maximum variance (unsupervised)
  • LDA finds directions that maximize class separation (supervised)
  • t-SNE/UMAP are for visualization and exploring cluster structure
  • Text features: Bag of Words, TF-IDF, Word Embeddings
  • Image features: Raw pixels, HOG, Deep learning (CNN) features
  • Domain knowledge often yields the most powerful features
  • Transfer learning uses pre-trained models as feature extractors
  • Always include feature extraction in pipelines to prevent data leakage

8.3 Handling Missing Data Intermediate

Handling Missing Data

Real-world data is messy. Sensors fail, users skip questions, records get corrupted, and databases have gaps. Missing data is the norm, not the exception. How you handle these gaps significantly affects model performance—ignoring the problem or handling it poorly can introduce bias, reduce statistical power, or lead to models that fail in production.

This section covers why data goes missing, how to detect it, and strategies for dealing with it.

Why Data Goes Missing

Understanding why data is missing helps you choose the right strategy. Statisticians categorize missing data into three types:

Missing Completely at Random (MCAR): The probability of data being missing is unrelated to any observed or unobserved data. A sensor randomly fails regardless of what it's measuring. MCAR is the least problematic case—the missing data is essentially a random sample of the complete data.

Missing at Random (MAR): The probability of data being missing depends on observed data but not on the missing values themselves. Older patients might skip certain questions more often, but within any age group, missingness is random. MAR can be handled well if you condition on the related variables.

Missing Not at Random (MNAR): The probability of data being missing depends on the missing values themselves. People with high incomes might be less likely to report their income. MNAR is the hardest case—the missingness pattern contains information about the missing values.

PYTHON
import numpy as np
import pandas as pd

np.random.seed(42)
n = 1000

# Create complete data
data = {
    'age': np.random.randint(20, 80, n),
    'income': np.random.lognormal(10, 1, n),
    'satisfaction': np.random.randint(1, 11, n)
}
df = pd.DataFrame(data)

# MCAR: Random 10% missing
df_mcar = df.copy()
mask_mcar = np.random.random(n) < 0.1
df_mcar.loc[mask_mcar, 'income'] = np.nan

# MAR: Older people more likely to have missing income
df_mar = df.copy()
prob_missing = (df['age'] - 20) / 100
mask_mar = np.random.random(n) < prob_missing
df_mar.loc[mask_mar, 'income'] = np.nan

# MNAR: High earners more likely to hide income
df_mnar = df.copy()
prob_missing = (df['income'] - df['income'].min()) / (df['income'].max() - df['income'].min())
mask_mnar = np.random.random(n) < prob_missing * 0.3
df_mnar.loc[mask_mnar, 'income'] = np.nan

print("Missing Data Patterns:")
print(f"MCAR: {df_mcar['income'].isna().sum()} missing ({df_mcar['income'].isna().mean():.1%})")
print(f"MAR:  {df_mar['income'].isna().sum()} missing ({df_mar['income'].isna().mean():.1%})")
print(f"MNAR: {df_mnar['income'].isna().sum()} missing ({df_mnar['income'].isna().mean():.1%})")

Detecting Missing Data

Before handling missing data, you need to understand its extent and pattern.

PYTHON
import numpy as np
import pandas as pd

# Sample dataset with missing values
np.random.seed(42)
df = pd.DataFrame({
    'age': np.random.randint(20, 70, 100),
    'income': np.random.choice([np.nan, 50000, 75000, 100000], 100, p=[0.15, 0.3, 0.35, 0.2]),
    'education': np.random.choice(['HS', 'BS', 'MS', np.nan], 100, p=[0.3, 0.35, 0.25, 0.1]),
    'score': np.random.choice([np.nan, 60, 70, 80, 90], 100, p=[0.2, 0.2, 0.25, 0.25, 0.1])
})

# Summary of missing data
print("Missing Data Summary:")
print(df.isnull().sum())
print(f"\nTotal cells: {df.size}")
print(f"Missing cells: {df.isnull().sum().sum()}")
print(f"Missing percentage: {df.isnull().sum().sum() / df.size:.1%}")

# Rows with any missing data
print(f"\nRows with any missing: {df.isnull().any(axis=1).sum()}")
print(f"Complete rows: {df.dropna().shape[0]}")

Strategy 1: Deletion

The simplest approach is to remove rows or columns with missing values. This is appropriate when missing data is MCAR and the proportion is small.

Listwise deletion (complete case analysis) removes any row with missing values.

Column deletion removes features with too many missing values.

PYTHON
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'A': np.random.choice([1, 2, 3, np.nan], 100, p=[0.3, 0.3, 0.35, 0.05]),
    'B': np.random.choice([10, 20, np.nan], 100, p=[0.4, 0.35, 0.25]),
    'C': np.random.choice([100, 200, 300, np.nan], 100, p=[0.25, 0.25, 0.25, 0.25]),
    'target': np.random.randint(0, 2, 100)
})

print("Original data:")
print(f"Shape: {df.shape}")
print(f"Missing per column: {df.isnull().sum().to_dict()}")

# Listwise deletion
df_listwise = df.dropna()
print(f"\nAfter listwise deletion: {df_listwise.shape}")

# Drop columns with >20% missing
threshold = 0.2
cols_to_keep = df.columns[df.isnull().mean() < threshold]
df_colwise = df[cols_to_keep]
print(f"After column deletion (>{threshold:.0%} missing): {df_colwise.shape}")

print("\nDeletion is simple but loses information.")
print("Use only when missing is MCAR and proportion is small.")

Strategy 2: Simple Imputation

Imputation fills in missing values with estimated values. Simple imputation uses basic statistics from the observed data.

Mean/Median imputation replaces missing values with the column mean (continuous) or median (robust to outliers).

Mode imputation replaces missing values with the most frequent value (categorical).

Constant imputation replaces with a fixed value (e.g., 0, "Unknown").

PYTHON
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

np.random.seed(42)
df = pd.DataFrame({
    'age': np.random.choice([25, 35, 45, 55, np.nan], 20, p=[0.2, 0.25, 0.25, 0.2, 0.1]),
    'income': np.random.choice([30000, 50000, 80000, np.nan], 20, p=[0.3, 0.35, 0.2, 0.15]),
    'category': np.random.choice(['A', 'B', 'C', np.nan], 20, p=[0.35, 0.35, 0.2, 0.1])
})

print("Before imputation:")
print(df.head(10))

# Numeric imputation with mean
imputer_mean = SimpleImputer(strategy='mean')
df_numeric = df[['age', 'income']]
df_imputed_numeric = pd.DataFrame(
    imputer_mean.fit_transform(df_numeric),
    columns=df_numeric.columns
)

# Categorical imputation with mode
imputer_mode = SimpleImputer(strategy='most_frequent')
df_cat = df[['category']]
df_imputed_cat = pd.DataFrame(
    imputer_mode.fit_transform(df_cat),
    columns=df_cat.columns
)

print("\nAfter mean/mode imputation:")
print(pd.concat([df_imputed_numeric, df_imputed_cat], axis=1).head(10))

Problems with Simple Imputation

Simple imputation has significant drawbacks:

Reduced variance: Imputing with the mean creates values at the center, underestimating the true spread.

Distorted correlations: The relationship between features gets weakened because imputed values don't reflect the actual covariance structure.

Biased estimates: If data is MAR or MNAR, simple imputation introduces systematic bias.

PYTHON
import numpy as np
import pandas as pd

np.random.seed(42)

# True data: age and income are correlated
n = 500
age = np.random.normal(40, 10, n)
income = 1000 * age + np.random.normal(0, 5000, n)

df_true = pd.DataFrame({'age': age, 'income': income})

# Create missing data (MAR: older people more likely missing income)
df_missing = df_true.copy()
prob_missing = (df_true['age'] - df_true['age'].min()) / (df_true['age'].max() - df_true['age'].min())
mask = np.random.random(n) < prob_missing * 0.3
df_missing.loc[mask, 'income'] = np.nan

# Mean imputation
df_imputed = df_missing.copy()
df_imputed['income'] = df_imputed['income'].fillna(df_imputed['income'].mean())

print("Impact of Mean Imputation:")
print(f"\nTrue correlation: {df_true['age'].corr(df_true['income']):.3f}")
print(f"After imputation: {df_imputed['age'].corr(df_imputed['income']):.3f}")

print(f"\nTrue income std: {df_true['income'].std():.0f}")
print(f"After imputation: {df_imputed['income'].std():.0f}")

print("\nMean imputation understates variance and weakens correlations.")

Strategy 3: Multivariate Imputation

Multivariate imputation uses relationships between features to predict missing values. Instead of using just the column mean, it uses other columns as predictors.

Iterative imputation (MICE - Multiple Imputation by Chained Equations) models each feature with missing values as a function of other features, iterating until convergence.

PYTHON
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

np.random.seed(42)

# Correlated features
n = 200
age = np.random.normal(40, 10, n)
income = 1000 * age + np.random.normal(0, 5000, n)
education_years = 0.2 * age + np.random.normal(12, 2, n)

df = pd.DataFrame({
    'age': age,
    'income': income,
    'education': education_years
})

# Create missing values
df_missing = df.copy()
df_missing.loc[np.random.choice(n, 30, replace=False), 'income'] = np.nan
df_missing.loc[np.random.choice(n, 20, replace=False), 'education'] = np.nan

# Simple imputation
from sklearn.impute import SimpleImputer
simple_imputer = SimpleImputer(strategy='mean')
df_simple = pd.DataFrame(simple_imputer.fit_transform(df_missing), columns=df.columns)

# Iterative imputation
iter_imputer = IterativeImputer(random_state=42, max_iter=10)
df_iterative = pd.DataFrame(iter_imputer.fit_transform(df_missing), columns=df.columns)

print("Comparison of Imputation Methods:")
print(f"\nTrue age-income correlation:      {df['age'].corr(df['income']):.3f}")
print(f"Simple imputation correlation:    {df_simple['age'].corr(df_simple['income']):.3f}")
print(f"Iterative imputation correlation: {df_iterative['age'].corr(df_iterative['income']):.3f}")

print("\nIterative imputation better preserves relationships.")

Strategy 4: KNN Imputation

K-Nearest Neighbors imputation finds the k most similar complete rows and uses their values to impute the missing ones. This captures local patterns in the data.

PYTHON
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

np.random.seed(42)

# Create data with cluster structure
n = 200
cluster = np.random.choice([0, 1], n, p=[0.5, 0.5])
feature1 = np.where(cluster == 0, np.random.normal(10, 2, n), np.random.normal(20, 2, n))
feature2 = np.where(cluster == 0, np.random.normal(100, 10, n), np.random.normal(200, 10, n))

df = pd.DataFrame({'f1': feature1, 'f2': feature2, 'cluster': cluster})

# Add missing values
df_missing = df.copy()
df_missing.loc[np.random.choice(n, 30, replace=False), 'f2'] = np.nan

# KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
df_knn = pd.DataFrame(knn_imputer.fit_transform(df_missing[['f1', 'f2']]), columns=['f1', 'f2'])
df_knn['cluster'] = df_missing['cluster']

# Compare to simple mean imputation
df_mean = df_missing.copy()
df_mean['f2'] = df_mean['f2'].fillna(df_mean['f2'].mean())

print("KNN vs Mean Imputation with Cluster Structure:")
print(f"\nTrue f2 std: {df['f2'].std():.1f}")
print(f"Mean imputation f2 std: {df_mean['f2'].std():.1f}")
print(f"KNN imputation f2 std: {df_knn['f2'].std():.1f}")

print("\nKNN better preserves cluster structure.")

Strategy 5: Indicator Variables

Sometimes the fact that data is missing carries information. You can add a binary indicator variable that flags whether each value was missing before imputation.

PYTHON
import numpy as np
import pandas as pd

np.random.seed(42)
n = 100

# Income missing more often for self-employed
employed = np.random.choice([0, 1], n, p=[0.3, 0.7])
income = np.where(employed == 1, np.random.normal(60000, 15000, n), np.random.normal(50000, 20000, n))

# Self-employed more likely to have missing income (MNAR)
prob_missing = np.where(employed == 0, 0.4, 0.1)
mask = np.random.random(n) < prob_missing
income_with_missing = income.copy()
income_with_missing[mask] = np.nan

df = pd.DataFrame({
    'employed': employed,
    'income': income_with_missing
})

# Add missing indicator
df['income_missing'] = df['income'].isna().astype(int)

# Impute the missing values
df['income_imputed'] = df['income'].fillna(df['income'].mean())

print("Missing Indicator Variables:")
print(df.head(15))

print(f"\nMissing rate by employment:")
print(f"Employed:     {df[df['employed']==1]['income'].isna().mean():.1%}")
print(f"Self-employed: {df[df['employed']==0]['income'].isna().mean():.1%}")

print("\nThe indicator captures the relationship between employment and missingness.")

Imputation in Pipelines

Imputation must happen inside your pipeline to prevent data leakage. The imputation statistics (mean, median, etc.) should be learned from training data only.

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
import numpy as np

# Create data with missing values
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
mask = np.random.random(X.shape) < 0.1
X[mask] = np.nan

# Pipeline handles imputation inside CV
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

scores = cross_val_score(pipeline, X, y, cv=5)
print("Imputation in Pipeline:")
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

print("\nPipeline ensures imputation stats are learned only from training folds.")

Practical Guidelines

Understand the mechanism: Investigate why data is missing. Different mechanisms require different strategies.

Examine patterns: Look at which combinations of values are missing together. Use visualization and statistics.

Don't over-impute: If too much data is missing, imputation estimates become unreliable. Consider dropping features or rows.

Use appropriate methods: Simple imputation for MCAR, multivariate methods for MAR, be cautious with MNAR.

Add missing indicators: Especially when missingness itself may be informative.

Include in pipeline: Always impute inside the training process to prevent leakage.

Validate: Compare model performance with different imputation strategies.

PYTHON
print("Quick Reference - Handling Missing Data:")
print("\n" + "="*60)
print(f"{'Scenario':<30} {'Recommended Approach':<30}")
print("="*60)
print(f"{'< 5% missing, MCAR':<30} {'Listwise deletion or mean':<30}")
print(f"{'5-20% missing, MCAR':<30} {'Mean/median imputation':<30}")
print(f"{'Any amount, MAR':<30} {'Multivariate imputation':<30}")
print(f"{'Cluster structure':<30} {'KNN imputation':<30}")
print(f"{'Missingness informative':<30} {'Add indicator variables':<30}")
print(f"{'> 50% missing in column':<30} {'Consider dropping column':<30}")
print("="*60)

Key Takeaways

  • Missing data is common—you need a strategy
  • Understand the missing mechanism: MCAR, MAR, or MNAR
  • Deletion is simple but loses information; use for small MCAR
  • Simple imputation (mean/median/mode) is fast but distorts variance and correlations
  • Multivariate imputation uses feature relationships for better estimates
  • KNN imputation captures local structure
  • Missing indicators preserve information when missingness is informative
  • Always impute inside pipelines to prevent data leakage
  • Validate your imputation strategy with downstream model performance

8.4 Encoding Categorical Variables Intermediate

Encoding Categorical Variables

Machine learning algorithms work with numbers, but real-world data includes categories. Converting these categorical variables into numerical form is called encoding.

Types of Categorical Variables

Nominal variables have no inherent order. Colors, countries, or product categories are nominal.

Ordinal variables have a meaningful order. Education levels or satisfaction ratings are ordinal.

Label Encoding

Label encoding assigns a unique integer to each category. It's simple but implies an ordering.

PYTHON
from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue']})
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['color'])
print(df)

One-Hot Encoding

One-hot encoding creates a binary column for each category. Standard for nominal variables.

PYTHON
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue']})
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['color']])
print(encoded)

Target Encoding

Target encoding replaces categories with the mean of the target variable.

PYTHON
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'city': np.random.choice(['NYC', 'LA', 'Chicago'], 100),
    'target': np.random.randint(0, 2, 100)
})
target_means = df.groupby('city')['target'].mean()
df['city_encoded'] = df['city'].map(target_means)
print(df.head())

Encoding in Pipelines

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'age': np.random.randint(20, 60, 500),
    'city': np.random.choice(['NYC', 'LA', 'Chicago'], 500),
    'target': np.random.randint(0, 2, 500)
})

X = df.drop('target', axis=1)
y = df['target']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city'])
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV Accuracy: {scores.mean():.4f}")

Key Takeaways

  • Nominal variables: Use one-hot encoding
  • Ordinal variables: Use ordinal encoding with explicit order
  • Label encoding implies order—problematic for nominal variables
  • Target encoding for high cardinality (with regularization)
  • Include encoding in pipelines to prevent leakage

8.5 Feature Scaling Intermediate

Feature Scaling

Features in real-world datasets often have different scales. Age might range from 0 to 100, income from 20,000 to 500,000, and temperature from -40 to 120. Many machine learning algorithms are sensitive to these scale differences—features with larger ranges can dominate the learning process, even if they're not more important.

Feature scaling transforms features to comparable ranges. This section covers why scaling matters, the main scaling methods, and when to use each.

Why Scaling Matters

Gradient-based algorithms (neural networks, logistic regression, SVM) update parameters proportionally to feature values. Large-scale features produce large gradients, dominating the optimization and potentially causing slow or unstable convergence.

Distance-based algorithms (k-NN, K-means, SVM with RBF kernel) compute distances between points. A feature with range 0-1000 will dominate distance calculations over a feature with range 0-1.

Regularization penalizes large coefficients. If features are on different scales, regularization affects them unequally.

PYTHON
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X[:, 0] = X[:, 0] * 1000  # Scale first feature to large range

# Without scaling
model = LogisticRegression(max_iter=100, random_state=42)
score_unscaled = cross_val_score(model, X, y, cv=5).mean()

# With scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
score_scaled = cross_val_score(model, X_scaled, y, cv=5).mean()

print(f"Without scaling: {score_unscaled:.4f}")
print(f"With scaling:    {score_scaled:.4f}")

Standardization (Z-Score Normalization)

Standardization transforms features to have zero mean and unit variance:

$$z = \frac{x - \mu}{\sigma}$$

After standardization, features have mean 0 and standard deviation 1. This is the most common scaling method.

PYTHON
from sklearn.preprocessing import StandardScaler
import numpy as np

X = np.array([[1, 100], [2, 200], [3, 300], [4, 400], [5, 500]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Original data:")
print(f"Mean: {X.mean(axis=0)}, Std: {X.std(axis=0)}")
print("\nAfter standardization:")
print(f"Mean: {X_scaled.mean(axis=0).round(2)}, Std: {X_scaled.std(axis=0).round(2)}")

Min-Max Normalization

Min-Max scaling transforms features to a fixed range, typically [0, 1]:

$$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$$

PYTHON
from sklearn.preprocessing import MinMaxScaler
import numpy as np

X = np.array([[1, 100], [2, 200], [3, 300], [4, 400], [5, 500]])

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print("After Min-Max scaling:")
print(f"Min: {X_scaled.min(axis=0)}, Max: {X_scaled.max(axis=0)}")
print(X_scaled)

Robust Scaling

Robust scaling uses median and IQR instead of mean and std, making it robust to outliers:

$$x_{scaled} = \frac{x - median}{IQR}$$

PYTHON
from sklearn.preprocessing import RobustScaler, StandardScaler
import numpy as np

# Data with outliers
X = np.array([[1], [2], [3], [4], [5], [100]])  # 100 is an outlier

standard = StandardScaler().fit_transform(X)
robust = RobustScaler().fit_transform(X)

print("Data with outlier (100):")
print(f"StandardScaler range: [{standard.min():.2f}, {standard.max():.2f}]")
print(f"RobustScaler range:   [{robust.min():.2f}, {robust.max():.2f}]")
print("\nRobust scaling is less affected by the outlier.")

When Not to Scale

Tree-based models (Decision Trees, Random Forest, Gradient Boosting) are invariant to feature scaling. They make splits based on thresholds, not distances or gradients.

Naive Bayes treats features independently and doesn't benefit from scaling.

PYTHON
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X[:, 0] = X[:, 0] * 1000  # Large scale feature

# Random Forest - scaling doesn't matter
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_unscaled = cross_val_score(rf, X, y, cv=5).mean()
rf_scaled = cross_val_score(rf, StandardScaler().fit_transform(X), y, cv=5).mean()

print("Random Forest (tree-based):")
print(f"Unscaled: {rf_unscaled:.4f}, Scaled: {rf_scaled:.4f}")
print("Scaling doesn't affect tree-based models.")

Scaling in Pipelines

Fit the scaler only on training data to prevent data leakage.

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV Accuracy: {scores.mean():.4f}")
print("\nPipeline ensures scaler is fit only on training data each fold.")

Choosing a Scaling Method

| Method | Best For | Notes | |--------|----------|-------| | StandardScaler | General use | Assumes roughly Gaussian distribution | | MinMaxScaler | Neural networks, bounded features | Sensitive to outliers | | RobustScaler | Data with outliers | Uses median and IQR | | None | Tree-based models | Trees don't need scaling |

Key Takeaways

  • Gradient-based and distance-based algorithms need scaling
  • Standardization (z-score) is the most common method
  • Min-Max scaling bounds features to [0, 1]
  • Robust scaling handles outliers better
  • Tree-based models don't need scaling
  • Always scale inside pipelines to prevent data leakage
  • Fit scaler on training data only, transform both train and test