Feature Selection
Real-world datasets often contain dozens, hundreds, or even thousands of features. Not all of them are useful. Some features are redundant (highly correlated with others), some are irrelevant (no relationship with the target), and some are noise (random variation that hurts generalization). Feature selection identifies which features to keep and which to discard.
Good feature selection improves model performance, reduces overfitting, speeds up training, and makes models more interpretable. This section covers the three main approaches: filter methods, wrapper methods, and embedded methods.
Why Feature Selection Matters
Adding more features seems like it should help—more information is better, right? But in practice, irrelevant features hurt performance. They add noise that the model might accidentally fit, leading to overfitting. They increase computational cost. They make the model harder to interpret.
The curse of dimensionality makes this worse. As the number of features grows, the data becomes increasingly sparse in the high-dimensional space. Points that seem nearby in low dimensions become far apart when you add irrelevant dimensions. Models need exponentially more data to maintain the same density of coverage.
Feature selection addresses these problems by identifying a subset of features that captures the signal while discarding the noise.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Create dataset: 10 informative features, 90 noise features
X, y = make_classification(
n_samples=500,
n_features=100,
n_informative=10,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)
# Model with all features
model_all = LogisticRegression(max_iter=1000, random_state=42)
scores_all = cross_val_score(model_all, X, y, cv=5)
# Model with only informative features
X_informative = X[:, :10]
model_subset = LogisticRegression(max_iter=1000, random_state=42)
scores_subset = cross_val_score(model_subset, X_informative, y, cv=5)
print("Impact of Irrelevant Features:")
print(f"All 100 features: {scores_all.mean():.4f} (+/- {scores_all.std()*2:.4f})")
print(f"Only 10 informative: {scores_subset.mean():.4f} (+/- {scores_subset.std()*2:.4f})")
print("\nFewer features can mean better performance!")Filter Methods
Filter methods evaluate features independently of any specific model. They compute a score for each feature based on its relationship with the target, then select the top-scoring features. Filter methods are fast and model-agnostic.
The downside is that they ignore feature interactions and don't consider how features work together within a specific model.
Variance Threshold removes features with low variance. If a feature has nearly the same value for all samples, it provides no discriminating power.
from sklearn.feature_selection import VarianceThreshold
import numpy as np
# Create data with some constant features
np.random.seed(42)
X = np.random.randn(100, 10)
X[:, 5] = 1.0 # Constant feature
X[:, 8] = X[:, 8] * 0.01 # Near-constant feature
print("Original feature variances:")
for i, var in enumerate(np.var(X, axis=0)):
print(f" Feature {i}: variance = {var:.4f}")
# Remove features with variance below threshold
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)
print(f"\nFeatures before: {X.shape[1]}")
print(f"Features after: {X_selected.shape[1]}")
print(f"Removed features: {np.where(~selector.get_support())[0]}")Correlation-based selection removes highly correlated features. When two features are strongly correlated, they provide redundant information. Keeping both adds complexity without adding predictive power.
import numpy as np
import pandas as pd
# Create correlated features
np.random.seed(42)
n_samples = 200
# Independent features
X1 = np.random.randn(n_samples)
X2 = np.random.randn(n_samples)
X3 = np.random.randn(n_samples)
# Correlated features
X4 = X1 + np.random.randn(n_samples) * 0.1 # Highly correlated with X1
X5 = X2 * 0.5 + np.random.randn(n_samples) * 0.5 # Moderately correlated with X2
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'X4': X4, 'X5': X5})
# Correlation matrix
corr_matrix = df.corr().abs()
print("Correlation Matrix:")
print(corr_matrix.round(3))
# Find highly correlated pairs
def find_correlated_features(corr_matrix, threshold=0.8):
"""Find pairs of features with correlation above threshold."""
pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if corr_matrix.iloc[i, j] > threshold:
pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))
return pairs
high_corr = find_correlated_features(corr_matrix, threshold=0.8)
print(f"\nHighly correlated pairs (>0.8):")
for f1, f2, corr in high_corr:
print(f" {f1} and {f2}: {corr:.3f}")Mutual Information measures how much information a feature provides about the target. Unlike correlation, it captures non-linear relationships. A feature with high mutual information is predictive of the target, even if the relationship isn't linear.
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.datasets import make_classification
import numpy as np
# Create dataset with different types of relationships
np.random.seed(42)
n_samples = 1000
# Linear relationship
X1 = np.random.randn(n_samples)
# Non-linear relationship (quadratic)
X2 = np.random.randn(n_samples)
# No relationship
X3 = np.random.randn(n_samples)
# Target based on X1 (linear) and X2 (quadratic)
y = (X1 > 0).astype(int) ^ (X2**2 > 1).astype(int)
X = np.column_stack([X1, X2, X3])
# Compute mutual information
mi_scores = mutual_info_classif(X, y, random_state=42)
print("Mutual Information Scores:")
for i, score in enumerate(mi_scores):
relationship = ["linear", "non-linear (quadratic)", "none"][i]
print(f" Feature {i} ({relationship}): {score:.4f}")
print("\nMutual information captures both linear and non-linear relationships.")Chi-squared test measures dependence between categorical features and a categorical target. It's commonly used for text classification where features are word counts.
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Load text data (subset)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
# Convert to word counts
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target
# Chi-squared feature selection
chi2_scores, p_values = chi2(X, y)
# Get feature names
feature_names = vectorizer.get_feature_names_out()
# Top features by chi-squared score
top_indices = np.argsort(chi2_scores)[-10:][::-1]
print("Top 10 Features by Chi-Squared Score:")
for idx in top_indices:
print(f" {feature_names[idx]:<15}: chi2 = {chi2_scores[idx]:.2f}, p = {p_values[idx]:.2e}")Wrapper Methods
Wrapper methods evaluate feature subsets by training a model and measuring its performance. They search through possible feature combinations, keeping those that improve the model. Wrapper methods consider feature interactions and are tailored to the specific model being used.
The downside is computational cost—training a model for each candidate subset is expensive. Wrapper methods also risk overfitting to the validation set used for evaluation.
Forward Selection starts with no features and adds one at a time. At each step, it tries adding each remaining feature and keeps the one that improves performance most. This continues until adding features no longer helps.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
import numpy as np
# Create dataset
X, y = make_classification(n_samples=300, n_features=20, n_informative=5, random_state=42)
# Forward selection
model = LogisticRegression(max_iter=1000, random_state=42)
forward_selector = SequentialFeatureSelector(
model,
n_features_to_select=5,
direction='forward',
cv=3,
scoring='accuracy'
)
forward_selector.fit(X, y)
selected_features = np.where(forward_selector.get_support())[0]
print("Forward Selection:")
print(f"Selected features: {selected_features}")
# Compare performance
X_selected = forward_selector.transform(X)
scores_all = cross_val_score(model, X, y, cv=5)
scores_selected = cross_val_score(model, X_selected, y, cv=5)
print(f"All 20 features: {scores_all.mean():.4f}")
print(f"Selected 5 features: {scores_selected.mean():.4f}")Backward Elimination starts with all features and removes one at a time. At each step, it tries removing each feature and removes the one whose absence hurts performance least. This continues until removing features starts hurting performance significantly.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Same dataset
X, y = make_classification(n_samples=300, n_features=20, n_informative=5, random_state=42)
# Backward elimination
model = RandomForestClassifier(n_estimators=50, random_state=42)
backward_selector = SequentialFeatureSelector(
model,
n_features_to_select=5,
direction='backward',
cv=3,
scoring='accuracy'
)
backward_selector.fit(X, y)
selected_features = np.where(backward_selector.get_support())[0]
print("Backward Elimination:")
print(f"Selected features: {selected_features}")
# Compare with forward selection results
print("\nNote: Forward and backward selection may select different features")
print("depending on feature interactions and the order of evaluation.")Recursive Feature Elimination (RFE) trains a model, ranks features by importance, removes the least important, and repeats. It's more efficient than exhaustive forward/backward selection because it uses feature importance from the model rather than retraining for each feature.
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
# Create dataset
X, y = make_classification(n_samples=300, n_features=20, n_informative=5, random_state=42)
# RFE with cross-validation to find optimal number of features
model = RandomForestClassifier(n_estimators=50, random_state=42)
rfecv = RFECV(
estimator=model,
step=1, # Remove one feature at a time
cv=5,
scoring='accuracy',
min_features_to_select=1
)
rfecv.fit(X, y)
print("Recursive Feature Elimination with CV:")
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {np.where(rfecv.support_)[0]}")
print(f"Feature ranking: {rfecv.ranking_}")Embedded Methods
Embedded methods perform feature selection as part of the model training process. They're more efficient than wrappers because selection happens during training rather than requiring separate model evaluations.
L1 Regularization (Lasso) shrinks some coefficients exactly to zero, effectively removing those features from the model. The regularization strength controls how aggressively features are eliminated.
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
import numpy as np
# Create regression data with sparse true coefficients
np.random.seed(42)
n_samples, n_features = 200, 50
X = np.random.randn(n_samples, n_features)
# Only 5 features matter
true_coef = np.zeros(n_features)
true_coef[:5] = [3, -2, 1, 0.5, -1.5]
y = X @ true_coef + np.random.randn(n_samples) * 0.5
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find optimal regularization with CV
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
# Features selected (non-zero coefficients)
selected = np.where(lasso_cv.coef_ != 0)[0]
n_selected = len(selected)
print("L1 Regularization (Lasso) Feature Selection:")
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Features selected: {n_selected} out of {n_features}")
print(f"Selected feature indices: {selected}")
print(f"True informative features: [0, 1, 2, 3, 4]")
# Show coefficients for first 10 features
print("\nCoefficients (first 10 features):")
for i in range(10):
true = true_coef[i]
learned = lasso_cv.coef_[i]
print(f" Feature {i}: true = {true:6.2f}, learned = {learned:6.2f}")Tree-based feature importance measures how much each feature contributes to reducing impurity (for classification) or variance (for regression) across all trees in an ensemble. Features that frequently appear in splits near the root are considered more important.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
import numpy as np
# Create dataset
X, y = make_classification(
n_samples=500,
n_features=20,
n_informative=5,
n_redundant=2,
random_state=42
)
# Random Forest importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Sort by importance
importance_rf = rf.feature_importances_
indices_rf = np.argsort(importance_rf)[::-1]
print("Random Forest Feature Importance:")
print(f"{'Rank':<6} {'Feature':<10} {'Importance':<12}")
print("-" * 28)
for rank, idx in enumerate(indices_rf[:10], 1):
print(f"{rank:<6} Feature {idx:<3} {importance_rf[idx]:<12.4f}")
# Select top features
threshold = 0.05
selected = np.where(importance_rf > threshold)[0]
print(f"\nFeatures with importance > {threshold}: {selected}")Permutation Importance measures feature importance by shuffling each feature and observing how much model performance degrades. Unlike built-in importance, it works with any model and measures actual predictive contribution rather than how often a feature is used.
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
# Create dataset
X, y = make_classification(n_samples=500, n_features=20, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Compute permutation importance on test set
perm_importance = permutation_importance(
model, X_test, y_test,
n_repeats=10,
random_state=42,
n_jobs=-1
)
# Sort by importance
sorted_idx = perm_importance.importances_mean.argsort()[::-1]
print("Permutation Importance (on test set):")
print(f"{'Rank':<6} {'Feature':<10} {'Importance':<12} {'Std':<10}")
print("-" * 38)
for rank, idx in enumerate(sorted_idx[:10], 1):
mean = perm_importance.importances_mean[idx]
std = perm_importance.importances_std[idx]
print(f"{rank:<6} Feature {idx:<3} {mean:<12.4f} {std:<10.4f}")Comparing Selection Methods
Different methods have different strengths. Filter methods are fast but ignore interactions. Wrapper methods consider interactions but are slow. Embedded methods balance both concerns.
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np
import time
# Create dataset
X, y = make_classification(
n_samples=500,
n_features=50,
n_informative=10,
random_state=42
)
# Scale for Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
methods = {}
# Filter: Mutual Information
start = time.time()
mi_selector = SelectKBest(mutual_info_classif, k=10)
X_mi = mi_selector.fit_transform(X, y)
methods['Filter (MI)'] = {
'features': np.where(mi_selector.get_support())[0],
'time': time.time() - start,
'X_selected': X_mi
}
# Wrapper: RFE
start = time.time()
rfe = RFE(LogisticRegression(max_iter=1000), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
methods['Wrapper (RFE)'] = {
'features': np.where(rfe.support_)[0],
'time': time.time() - start,
'X_selected': X_rfe
}
# Embedded: Lasso
start = time.time()
lasso = LassoCV(cv=3)
lasso.fit(X_scaled, y)
# Get top 10 by absolute coefficient
top_10 = np.argsort(np.abs(lasso.coef_))[-10:]
X_lasso = X_scaled[:, top_10]
methods['Embedded (Lasso)'] = {
'features': top_10,
'time': time.time() - start,
'X_selected': X_lasso
}
# Compare
print("Feature Selection Method Comparison:")
print(f"{'Method':<20} {'Time (s)':<12} {'CV Accuracy':<12}")
print("-" * 44)
model = LogisticRegression(max_iter=1000)
for name, data in methods.items():
score = cross_val_score(model, data['X_selected'], y, cv=5).mean()
print(f"{name:<20} {data['time']:<12.4f} {score:<12.4f}")
# Baseline with all features
score_all = cross_val_score(model, X, y, cv=5).mean()
print(f"{'All features':<20} {'-':<12} {score_all:<12.4f}")Practical Guidelines
Start simple: Begin with variance threshold and correlation analysis to remove obvious noise and redundancy.
Match method to scale: Filter methods for high-dimensional data (fast). Wrapper/embedded methods for moderate dimensions (more accurate).
Use cross-validation: Always evaluate feature selection within cross-validation to avoid overfitting to the feature selection process.
Consider domain knowledge: Statistical methods complement but don't replace human understanding. A feature with low statistical importance might still be essential for interpretability.
Beware of data leakage: Feature selection must happen inside cross-validation folds, not before. Selecting features on the full dataset before splitting leaks information.
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# CORRECT: Feature selection inside pipeline (inside CV)
pipeline = Pipeline([
('select', SelectKBest(mutual_info_classif, k=10)),
('classify', LogisticRegression(max_iter=1000))
])
# Cross-validation will apply feature selection separately on each fold
scores = cross_val_score(pipeline, X, y, cv=5)
print("Correct approach (selection inside CV):")
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# WRONG: Feature selection before cross-validation
# selector = SelectKBest(mutual_info_classif, k=10)
# X_selected = selector.fit_transform(X, y) # Sees all data!
# scores_wrong = cross_val_score(LogisticRegression(), X_selected, y, cv=5)
# This gives optimistically biased results
print("\nAlways use pipelines to prevent data leakage in feature selection!")Key Takeaways
- Feature selection removes irrelevant, redundant, and noisy features
- Too many features leads to overfitting and increased computational cost
- Filter methods (variance, correlation, mutual information) are fast and model-agnostic
- Wrapper methods (forward/backward selection, RFE) consider interactions but are slower
- Embedded methods (L1 regularization, tree importance) perform selection during training
- Permutation importance measures actual predictive contribution for any model
- Always perform feature selection inside cross-validation to prevent leakage
- Combine statistical methods with domain knowledge for best results
- Use pipelines to keep feature selection inside the training process