Feature Selection: Identifying the Most Predictive Variables

Learn feature selection methods — filter methods, wrapper methods, embedded methods, SHAP-based selection, and how to remove irrelevant features for better model performance.

Feature Selection

More features is not always better. Irrelevant features add noise, increase training time, and can hurt model performance — especially for algorithms sensitive to the curse of dimensionality. Feature selection identifies which features actually contribute to prediction.


Why Feature Selection Matters

100 features, 1000 samples:
Genuine predictors: 10 features
Noise: 90 random features
Without selection: model learns spurious correlations from noise features
With selection: model focuses on the 10 genuine predictors → better generalization
+ faster training + more interpretable

Filter Methods (Univariate)

Evaluate each feature independently of the model:

from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import f_classif, mutual_info_classif, chi2
# F-test (linear associations)
selector_f = SelectKBest(score_func=f_classif, k=20)
X_train_selected = selector_f.fit_transform(X_train, y_train)
# Mutual information (captures nonlinear dependencies)
selector_mi = SelectKBest(score_func=mutual_info_classif, k=20)
X_train_selected = selector_mi.fit_transform(X_train, y_train)
# Chi-squared (for non-negative features, classification)
selector_chi2 = SelectKBest(score_func=chi2, k=20)
# View scores
import pandas as pd
scores = pd.Series(selector_mi.scores_, index=feature_names).sort_values(ascending=False)
print(scores.head(20))
# Apply to test data
X_test_selected = selector_f.transform(X_test)

Variance Threshold

Remove features with low variance — they’re nearly constant and provide little information:

from sklearn.feature_selection import VarianceThreshold
# Remove features with <1% variance
selector = VarianceThreshold(threshold=0.01)
X_reduced = selector.fit_transform(X_train)
print(f"Kept {selector.get_support().sum()} of {X_train.shape[1]} features")

Recursive Feature Elimination (RFE)

Wrapper method: trains model, removes least important features, repeats:

from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
# RFE: specify number of features to select
rfe = RFE(
estimator=RandomForestClassifier(n_estimators=50, random_state=42),
n_features_to_select=20,
step=5 # Remove 5 features per iteration
)
rfe.fit(X_train, y_train)
X_train_rfe = rfe.transform(X_train)
# RFECV: cross-validated RFE (finds optimal number of features)
rfecv = RFECV(
estimator=RandomForestClassifier(n_estimators=50, random_state=42),
step=1, cv=5, scoring='roc_auc', min_features_to_select=5, n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")

Embedded Methods: Feature Importance from Trees

Tree-based models compute feature importance as a byproduct of training:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
# Train model
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# Select features above median importance
selector = SelectFromModel(rf, threshold='median', prefit=True)
X_train_selected = selector.transform(X_train)
print(f"Selected {selector.get_support().sum()} features")

SHAP-Based Feature Selection

SHAP (SHapley Additive exPlanations) provides the most reliable feature importance:

import shap
# Compute SHAP values
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_train)
# Mean |SHAP| across samples and classes
mean_shap = np.abs(shap_values[1]).mean(axis=0) # [1] for positive class
feature_importance = pd.Series(mean_shap, index=feature_names).sort_values(ascending=False)
# Select top K by SHAP
top_k = 30
selected_features = feature_importance.head(top_k).index.tolist()
X_train_shap = X_train[selected_features]
# Summary plot
shap.summary_plot(shap_values[1], X_train, feature_names=feature_names, plot_type='bar')

Correlation-Based Removal

Remove one of each pair of highly correlated features — they provide redundant information:

import pandas as pd
corr_matrix = pd.DataFrame(X_train, columns=feature_names).corr().abs()
# Upper triangle
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Remove features with correlation > 0.95
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
print(f"Removing {len(to_drop)} highly correlated features: {to_drop}")
X_train_uncorrelated = X_train.drop(columns=to_drop)

Feature Selection in a Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
feature_selection_pipeline = Pipeline([
('selector', SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42), threshold='median'
)),
('model', GradientBoostingClassifier(n_estimators=200))
])
scores = cross_val_score(feature_selection_pipeline, X, y, cv=5, scoring='roc_auc')
print(f"AUC with feature selection: {scores.mean():.4f} ± {scores.std():.4f}")

Feature selection is most valuable when you have many features (>50), limited data (<10,000 samples), or when interpretability is required. With large datasets and gradient boosting, the algorithm’s built-in regularization often makes explicit feature selection unnecessary.