Missing Data Handling
Missing data is the most common data quality problem in real-world ML projects. How you handle it matters — wrong imputation can introduce bias or destroy predictive signals. The right approach depends on how much data is missing and why it’s missing.
Types of Missingness
MCAR (Missing Completely At Random): The probability of missing is unrelated to observed or unobserved values. Example: sensor failure randomly drops readings. Safe to impute without bias.
MAR (Missing At Random): The probability of missing depends on observed data, not the missing value itself. Example: older patients less likely to record weight (depends on age, not weight). Can be imputed using other features.
MNAR (Missing Not At Random): The probability of missing depends on the value itself. Example: high-income earners skip the income field. Hardest to handle — may require domain knowledge or additional data.Exploratory Analysis of Missing Data
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns
df = pd.read_csv('data.csv')
# Summary of missingnessmissing_summary = pd.DataFrame({ 'count': df.isnull().sum(), 'pct': df.isnull().mean() * 100}).sort_values('pct', ascending=False)print(missing_summary[missing_summary['count'] > 0])
# Missingness heatmapplt.figure(figsize=(12, 6))sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')plt.title('Missing Data Heatmap')
# Are missing values correlated? (suggests MAR)missing_corr = df.isnull().corr()sns.heatmap(missing_corr, annot=True, cmap='coolwarm')plt.title('Missing Value Correlation')Simple Imputation
from sklearn.impute import SimpleImputer
# Numeric featuresnum_imputer = SimpleImputer(strategy='median') # Robust to outliers# Alternative strategies: 'mean', 'constant', 'most_frequent'
X_num_imputed = num_imputer.fit_transform(X_numeric_train)X_num_test = num_imputer.transform(X_numeric_test)
# Categorical featurescat_imputer = SimpleImputer(strategy='most_frequent') # Most common value# Or use constant: SimpleImputer(strategy='constant', fill_value='Unknown')
X_cat_imputed = cat_imputer.fit_transform(X_categorical_train)KNN Imputation
Uses the K nearest (non-missing) neighbors to estimate missing values — better for correlated features:
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer( n_neighbors=5, weights='uniform', # Or 'distance' — closer neighbors get more weight metric='nan_euclidean' # Handles NaN in distance computation)
X_imputed = knn_imputer.fit_transform(X_train)When to use: Features are correlated (knowing similar patients’ values helps estimate the missing one). Slower than SimpleImputer for large datasets.
Multivariate Imputation (MICE)
The most sophisticated approach: models each column with missing data as a function of all other columns, iteratively:
from sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputerfrom sklearn.ensemble import RandomForestRegressor
# Default: Bayesian Ridge for each columnmice_imputer = IterativeImputer(max_iter=10, random_state=42)
# Custom estimator for each columnmice_rf = IterativeImputer( estimator=RandomForestRegressor(n_estimators=10, random_state=42), max_iter=10, random_state=42)
X_imputed = mice_rf.fit_transform(X_train)Missing Indicator Features
For MNAR data, the fact that a value is missing is itself informative. Add a binary “was this missing?” flag:
from sklearn.impute import MissingIndicator
indicator = MissingIndicator(features='missing-only') # Only add flags for columns with NaNmissing_flags = indicator.fit_transform(X_train)
import numpy as npX_with_flags = np.hstack([X_imputed, missing_flags])
# Or use add_indicator parameter in SimpleImputerimputer_with_flags = SimpleImputer(strategy='median', add_indicator=True)X_imputed_with_flags = imputer_with_flags.fit_transform(X_train)When to Drop vs. Impute
# Drop columns with too much missing datathreshold = 0.50 # Drop if >50% missingcols_to_drop = [col for col in df.columns if df[col].isnull().mean() > threshold]df_cleaned = df.drop(columns=cols_to_drop)
# Drop rows where target is missing (never impute target)df_cleaned = df_cleaned.dropna(subset=['target'])
# Keep but impute features with moderate missingness (<50%)# The missing indicator captures the signal from the missingness patternDecision guide:
-
80% missing: drop the column
- 30–80% missing: impute + add missing indicator
- <30% missing: simple or KNN imputation usually sufficient
- Target variable missing: drop the row
- MNAR data: domain knowledge required; consider adding missing indicator
Cross-Validation Safe Imputation
from sklearn.pipeline import Pipelinefrom sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import GradientBoostingClassifier
pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('model', GradientBoostingClassifier())])
# cross_val_score correctly re-fits imputer on each training foldscores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')Using a Pipeline ensures your imputation statistics (mean, median, etc.) are computed from training data only during cross-validation — avoiding leakage from validation folds.