Feature Selection
More features is not always better. Irrelevant features add noise, increase training time, and can hurt model performance — especially for algorithms sensitive to the curse of dimensionality. Feature selection identifies which features actually contribute to prediction.
Why Feature Selection Matters
100 features, 1000 samples: Genuine predictors: 10 features Noise: 90 random features
Without selection: model learns spurious correlations from noise featuresWith selection: model focuses on the 10 genuine predictors → better generalization + faster training + more interpretableFilter Methods (Univariate)
Evaluate each feature independently of the model:
from sklearn.feature_selection import SelectKBest, SelectPercentilefrom sklearn.feature_selection import f_classif, mutual_info_classif, chi2
# F-test (linear associations)selector_f = SelectKBest(score_func=f_classif, k=20)X_train_selected = selector_f.fit_transform(X_train, y_train)
# Mutual information (captures nonlinear dependencies)selector_mi = SelectKBest(score_func=mutual_info_classif, k=20)X_train_selected = selector_mi.fit_transform(X_train, y_train)
# Chi-squared (for non-negative features, classification)selector_chi2 = SelectKBest(score_func=chi2, k=20)
# View scoresimport pandas as pdscores = pd.Series(selector_mi.scores_, index=feature_names).sort_values(ascending=False)print(scores.head(20))
# Apply to test dataX_test_selected = selector_f.transform(X_test)Variance Threshold
Remove features with low variance — they’re nearly constant and provide little information:
from sklearn.feature_selection import VarianceThreshold
# Remove features with <1% varianceselector = VarianceThreshold(threshold=0.01)X_reduced = selector.fit_transform(X_train)
print(f"Kept {selector.get_support().sum()} of {X_train.shape[1]} features")Recursive Feature Elimination (RFE)
Wrapper method: trains model, removes least important features, repeats:
from sklearn.feature_selection import RFE, RFECVfrom sklearn.ensemble import RandomForestClassifier
# RFE: specify number of features to selectrfe = RFE( estimator=RandomForestClassifier(n_estimators=50, random_state=42), n_features_to_select=20, step=5 # Remove 5 features per iteration)rfe.fit(X_train, y_train)X_train_rfe = rfe.transform(X_train)
# RFECV: cross-validated RFE (finds optimal number of features)rfecv = RFECV( estimator=RandomForestClassifier(n_estimators=50, random_state=42), step=1, cv=5, scoring='roc_auc', min_features_to_select=5, n_jobs=-1)rfecv.fit(X_train, y_train)print(f"Optimal number of features: {rfecv.n_features_}")Embedded Methods: Feature Importance from Trees
Tree-based models compute feature importance as a byproduct of training:
from sklearn.ensemble import RandomForestClassifierfrom sklearn.feature_selection import SelectFromModel
# Train modelrf = RandomForestClassifier(n_estimators=200, random_state=42)rf.fit(X_train, y_train)
# Select features above median importanceselector = SelectFromModel(rf, threshold='median', prefit=True)X_train_selected = selector.transform(X_train)
print(f"Selected {selector.get_support().sum()} features")SHAP-Based Feature Selection
SHAP (SHapley Additive exPlanations) provides the most reliable feature importance:
import shap
# Compute SHAP valuesexplainer = shap.TreeExplainer(rf)shap_values = explainer.shap_values(X_train)
# Mean |SHAP| across samples and classesmean_shap = np.abs(shap_values[1]).mean(axis=0) # [1] for positive classfeature_importance = pd.Series(mean_shap, index=feature_names).sort_values(ascending=False)
# Select top K by SHAPtop_k = 30selected_features = feature_importance.head(top_k).index.tolist()X_train_shap = X_train[selected_features]
# Summary plotshap.summary_plot(shap_values[1], X_train, feature_names=feature_names, plot_type='bar')Correlation-Based Removal
Remove one of each pair of highly correlated features — they provide redundant information:
import pandas as pd
corr_matrix = pd.DataFrame(X_train, columns=feature_names).corr().abs()
# Upper triangleupper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Remove features with correlation > 0.95to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]print(f"Removing {len(to_drop)} highly correlated features: {to_drop}")
X_train_uncorrelated = X_train.drop(columns=to_drop)Feature Selection in a Pipeline
from sklearn.pipeline import Pipelinefrom sklearn.feature_selection import SelectFromModel
feature_selection_pipeline = Pipeline([ ('selector', SelectFromModel( RandomForestClassifier(n_estimators=100, random_state=42), threshold='median' )), ('model', GradientBoostingClassifier(n_estimators=200))])
scores = cross_val_score(feature_selection_pipeline, X, y, cv=5, scoring='roc_auc')print(f"AUC with feature selection: {scores.mean():.4f} ± {scores.std():.4f}")Feature selection is most valuable when you have many features (>50), limited data (<10,000 samples), or when interpretability is required. With large datasets and gradient boosting, the algorithm’s built-in regularization often makes explicit feature selection unnecessary.