Feature Selection

More features is not always better. Irrelevant features add noise, increase training time, and can hurt model performance — especially for algorithms sensitive to the curse of dimensionality. Feature selection identifies which features actually contribute to prediction.

Why Feature Selection Matters

100 features, 1000 samples:
  Genuine predictors: 10 features
  Noise: 90 random features

Without selection: model learns spurious correlations from noise features
With selection:    model focuses on the 10 genuine predictors → better generalization
                   + faster training + more interpretable

Filter Methods (Univariate)

Evaluate each feature independently of the model:

from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import f_classif, mutual_info_classif, chi2

# F-test (linear associations)
selector_f = SelectKBest(score_func=f_classif, k=20)
X_train_selected = selector_f.fit_transform(X_train, y_train)

# Mutual information (captures nonlinear dependencies)
selector_mi = SelectKBest(score_func=mutual_info_classif, k=20)
X_train_selected = selector_mi.fit_transform(X_train, y_train)

# Chi-squared (for non-negative features, classification)
selector_chi2 = SelectKBest(score_func=chi2, k=20)

# View scores
import pandas as pd
scores = pd.Series(selector_mi.scores_, index=feature_names).sort_values(ascending=False)
print(scores.head(20))

# Apply to test data
X_test_selected = selector_f.transform(X_test)

Variance Threshold

Remove features with low variance — they’re nearly constant and provide little information:

from sklearn.feature_selection import VarianceThreshold

# Remove features with <1% variance
selector = VarianceThreshold(threshold=0.01)
X_reduced = selector.fit_transform(X_train)

print(f"Kept {selector.get_support().sum()} of {X_train.shape[1]} features")

Recursive Feature Elimination (RFE)

Wrapper method: trains model, removes least important features, repeats:

from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier

# RFE: specify number of features to select
rfe = RFE(
    estimator=RandomForestClassifier(n_estimators=50, random_state=42),
    n_features_to_select=20,
    step=5  # Remove 5 features per iteration
)
rfe.fit(X_train, y_train)
X_train_rfe = rfe.transform(X_train)

# RFECV: cross-validated RFE (finds optimal number of features)
rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=50, random_state=42),
    step=1, cv=5, scoring='roc_auc', min_features_to_select=5, n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")

Embedded Methods: Feature Importance from Trees

Tree-based models compute feature importance as a byproduct of training:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Train model
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Select features above median importance
selector = SelectFromModel(rf, threshold='median', prefit=True)
X_train_selected = selector.transform(X_train)

print(f"Selected {selector.get_support().sum()} features")

SHAP-Based Feature Selection

SHAP (SHapley Additive exPlanations) provides the most reliable feature importance:

import shap

# Compute SHAP values
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_train)

# Mean |SHAP| across samples and classes
mean_shap = np.abs(shap_values[1]).mean(axis=0)  # [1] for positive class
feature_importance = pd.Series(mean_shap, index=feature_names).sort_values(ascending=False)

# Select top K by SHAP
top_k = 30
selected_features = feature_importance.head(top_k).index.tolist()
X_train_shap = X_train[selected_features]

# Summary plot
shap.summary_plot(shap_values[1], X_train, feature_names=feature_names, plot_type='bar')

Correlation-Based Removal

Remove one of each pair of highly correlated features — they provide redundant information:

import pandas as pd

corr_matrix = pd.DataFrame(X_train, columns=feature_names).corr().abs()

# Upper triangle
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Remove features with correlation > 0.95
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
print(f"Removing {len(to_drop)} highly correlated features: {to_drop}")

X_train_uncorrelated = X_train.drop(columns=to_drop)

Feature Selection in a Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel

feature_selection_pipeline = Pipeline([
    ('selector', SelectFromModel(
        RandomForestClassifier(n_estimators=100, random_state=42), threshold='median'
    )),
    ('model', GradientBoostingClassifier(n_estimators=200))
])

scores = cross_val_score(feature_selection_pipeline, X, y, cv=5, scoring='roc_auc')
print(f"AUC with feature selection: {scores.mean():.4f} ± {scores.std():.4f}")

Feature selection is most valuable when you have many features (>50), limited data (<10,000 samples), or when interpretability is required. With large datasets and gradient boosting, the algorithm’s built-in regularization often makes explicit feature selection unnecessary.