Data Preprocessing: Preparing Raw Data for Machine Learning

Learn data preprocessing for ML — handling missing values, scaling, encoding categoricals, outlier detection, pipelines, and building reproducible preprocessing workflows.

Data Preprocessing

Raw data is rarely ready for machine learning. It contains missing values, mixed scales, categorical strings, and outliers. Preprocessing transforms raw data into a format that models can learn from. Done correctly inside a pipeline, it prevents data leakage and makes the workflow reproducible.


The Preprocessing Workflow

Raw Data → Handle Missing → Encode Categoricals → Scale Numerics → Feature Engineering → Model

Each step must be fitted on training data only, then applied identically to validation and test data.


Checking Your Data

import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# Overview
print(df.dtypes)
print(df.describe())
print(df.isnull().sum())
print(df.isnull().mean() * 100) # % missing per column
# Unique values in categorical columns
for col in df.select_dtypes('object').columns:
print(f"{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head())

Handling Missing Values

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Simple imputation
numeric_imputer = SimpleImputer(strategy='median') # Robust to outliers
categorical_imputer = SimpleImputer(strategy='most_frequent')
# KNN imputation (uses similar samples)
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')
# Multivariate imputation (MICE)
iterative_imputer = IterativeImputer(max_iter=10, random_state=42)
# Add indicator for which values were imputed
from sklearn.impute import MissingIndicator
indicator = MissingIndicator(features='missing-only') # Adds binary flags

When to drop vs. impute:

  • Drop columns with >80% missing (rarely informative)
  • Drop rows with missing target variable
  • Impute feature columns when missing < 30% and missingness is random

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# StandardScaler: zero mean, unit variance
# Use for: SVM, KNN, PCA, neural networks, logistic regression
standard = StandardScaler() # (x - mean) / std
# MinMaxScaler: scales to [0, 1]
# Use for: neural networks with sigmoid/tanh activations, bounded features
minmax = MinMaxScaler() # (x - min) / (max - min)
# RobustScaler: uses median and IQR instead of mean/std
# Use for: data with outliers (doesn't get pulled by extreme values)
robust = RobustScaler() # (x - median) / IQR
# Never scale: tree-based models, Naive Bayes (scale doesn't affect splits)

Encoding Categorical Variables

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import TargetEncoder
# One-Hot Encoding: for nominal categories (no order)
# Creates N binary columns for N categories
ohe = OneHotEncoder(drop='first', # Avoid multicollinearity
sparse_output=False,
handle_unknown='ignore') # Ignore unseen categories at test time
# Ordinal Encoding: for ordinal categories (have a natural order)
# Low, Medium, High → 0, 1, 2
ord_enc = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
# Target Encoding: replaces category with mean target value
# Use for high-cardinality categoricals (hundreds of categories)
target_enc = TargetEncoder(cv=5, random_state=42) # sklearn 1.3+
# Label Encoding: for the target variable only (not features)
label_enc = LabelEncoder()
y_encoded = label_enc.fit_transform(y)

Building a Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
# Define column groups
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'job_type', 'marital_status']
ordinal_features = ['satisfaction_level']
# Numeric pipeline
numeric_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', RobustScaler())
])
# Categorical pipeline
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False))
])
# Combine all transformations
preprocessor = ColumnTransformer([
('numeric', numeric_pipeline, numeric_features),
('categorical', categorical_pipeline, categorical_features),
], remainder='drop')
# Full ML pipeline
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', GradientBoostingClassifier(n_estimators=200, random_state=42))
])
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)

The ColumnTransformer correctly applies each transformation to the right columns, and the Pipeline ensures preprocessing is fitted only on training data during cross-validation — eliminating data leakage.