Data Preprocessing
Raw data is rarely ready for machine learning. It contains missing values, mixed scales, categorical strings, and outliers. Preprocessing transforms raw data into a format that models can learn from. Done correctly inside a pipeline, it prevents data leakage and makes the workflow reproducible.
The Preprocessing Workflow
Raw Data → Handle Missing → Encode Categoricals → Scale Numerics → Feature Engineering → ModelEach step must be fitted on training data only, then applied identically to validation and test data.
Checking Your Data
import pandas as pdimport numpy as np
df = pd.read_csv('data.csv')
# Overviewprint(df.dtypes)print(df.describe())print(df.isnull().sum())print(df.isnull().mean() * 100) # % missing per column
# Unique values in categorical columnsfor col in df.select_dtypes('object').columns: print(f"{col}: {df[col].nunique()} unique values") print(df[col].value_counts().head())Handling Missing Values
from sklearn.impute import SimpleImputer, KNNImputerfrom sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputer
# Simple imputationnumeric_imputer = SimpleImputer(strategy='median') # Robust to outlierscategorical_imputer = SimpleImputer(strategy='most_frequent')
# KNN imputation (uses similar samples)knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')
# Multivariate imputation (MICE)iterative_imputer = IterativeImputer(max_iter=10, random_state=42)
# Add indicator for which values were imputedfrom sklearn.impute import MissingIndicatorindicator = MissingIndicator(features='missing-only') # Adds binary flagsWhen to drop vs. impute:
- Drop columns with >80% missing (rarely informative)
- Drop rows with missing target variable
- Impute feature columns when missing < 30% and missingness is random
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# StandardScaler: zero mean, unit variance# Use for: SVM, KNN, PCA, neural networks, logistic regressionstandard = StandardScaler() # (x - mean) / std
# MinMaxScaler: scales to [0, 1]# Use for: neural networks with sigmoid/tanh activations, bounded featuresminmax = MinMaxScaler() # (x - min) / (max - min)
# RobustScaler: uses median and IQR instead of mean/std# Use for: data with outliers (doesn't get pulled by extreme values)robust = RobustScaler() # (x - median) / IQR
# Never scale: tree-based models, Naive Bayes (scale doesn't affect splits)Encoding Categorical Variables
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoderfrom sklearn.preprocessing import TargetEncoder
# One-Hot Encoding: for nominal categories (no order)# Creates N binary columns for N categoriesohe = OneHotEncoder(drop='first', # Avoid multicollinearity sparse_output=False, handle_unknown='ignore') # Ignore unseen categories at test time
# Ordinal Encoding: for ordinal categories (have a natural order)# Low, Medium, High → 0, 1, 2ord_enc = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
# Target Encoding: replaces category with mean target value# Use for high-cardinality categoricals (hundreds of categories)target_enc = TargetEncoder(cv=5, random_state=42) # sklearn 1.3+
# Label Encoding: for the target variable only (not features)label_enc = LabelEncoder()y_encoded = label_enc.fit_transform(y)Building a Preprocessing Pipeline
from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.ensemble import GradientBoostingClassifier
# Define column groupsnumeric_features = ['age', 'income', 'credit_score']categorical_features = ['education', 'job_type', 'marital_status']ordinal_features = ['satisfaction_level']
# Numeric pipelinenumeric_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', RobustScaler())])
# Categorical pipelinecategorical_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False))])
# Combine all transformationspreprocessor = ColumnTransformer([ ('numeric', numeric_pipeline, numeric_features), ('categorical', categorical_pipeline, categorical_features),], remainder='drop')
# Full ML pipelinefull_pipeline = Pipeline([ ('preprocessor', preprocessor), ('model', GradientBoostingClassifier(n_estimators=200, random_state=42))])
full_pipeline.fit(X_train, y_train)y_pred = full_pipeline.predict(X_test)The ColumnTransformer correctly applies each transformation to the right columns, and the Pipeline ensures preprocessing is fitted only on training data during cross-validation — eliminating data leakage.