Outlier Detection
Outliers are data points that differ significantly from other observations. They can be errors (miscoded measurements), rare but valid events (fraud), or important signals (equipment failure). Before removing outliers, understand what they represent — removing valid outliers destroys real information.
Statistical Methods
IQR (Interquartile Range)
Robust to skewed distributions — works by defining outliers as values beyond 1.5× the IQR:
import pandas as pdimport numpy as np
def iqr_outliers(series, factor=1.5): Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - factor * IQR upper = Q3 + factor * IQR return (series < lower) | (series > upper)
# Apply to each numeric columnfor col in numeric_cols: mask = iqr_outliers(df[col]) print(f"{col}: {mask.sum()} outliers ({mask.mean()*100:.1f}%)")Z-Score
Identifies points more than N standard deviations from the mean. Sensitive to outliers themselves (they inflate std):
from scipy import stats
z_scores = np.abs(stats.zscore(X_numeric))outlier_mask = (z_scores > 3).any(axis=1)print(f"Outliers (Z>3): {outlier_mask.sum()} rows")
# More robust: Modified Z-score using median absolute deviationdef modified_zscore(series): median = series.median() mad = (series - median).abs().median() return 0.6745 * (series - median) / (mad + 1e-8)
mz = modified_zscore(df['column']).abs()outliers = mz > 3.5Machine Learning Methods
Isolation Forest
Anomalies are isolated by random splits — outliers require fewer splits to isolate:
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest( n_estimators=100, contamination=0.05, # Expected fraction of outliers (0.05 = 5%) random_state=42)
labels = iso_forest.fit_predict(X)# -1 = outlier, 1 = inlier
scores = iso_forest.score_samples(X) # More negative = more anomalous
outlier_mask = labels == -1print(f"Detected {outlier_mask.sum()} outliers ({outlier_mask.mean()*100:.1f}%)")Local Outlier Factor (LOF)
Compares local density of each point to its neighbors. Good for datasets where outliers are local — normal in one region, anomalous in another:
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor( n_neighbors=20, contamination=0.05, metric='euclidean')
labels = lof.fit_predict(X) # -1 = outlierlof_scores = lof.negative_outlier_factor_ # More negative = more anomalousOne-Class SVM
Learns a boundary around normal data, flags everything outside:
from sklearn.svm import OneClassSVM
oc_svm = OneClassSVM(nu=0.05, kernel='rbf', gamma='scale')labels = oc_svm.fit_predict(X_train) # Fit on normal data onlytest_labels = oc_svm.predict(X_test) # -1 = anomalyVisualizing Outliers
import matplotlib.pyplot as pltimport seaborn as sns
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for idx, col in enumerate(numeric_cols[:6]): ax = axes[idx // 3][idx % 3]
# Box plot ax.boxplot(df[col].dropna()) ax.set_title(f'{col}')
# Identify outliers with IQR q1, q3 = df[col].quantile([0.25, 0.75]) iqr = q3 - q1 outliers = df[(df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)][col] ax.set_xlabel(f'{len(outliers)} outliers')
plt.tight_layout()Handling Detected Outliers
# Option 1: Remove outliers (only if you're confident they're errors)X_clean = X[~outlier_mask]y_clean = y[~outlier_mask]
# Option 2: Cap (Winsorize) — replace with boundary valuesfrom scipy.stats.mstats import winsorize
# Cap at 5th and 95th percentilesdf['capped_col'] = winsorize(df['col'], limits=[0.05, 0.05])
# Option 3: Transform — log/sqrt reduces impact of outliersdf['log_income'] = np.log1p(df['income']) # log1p handles 0 values
# Option 4: Use robust algorithms (ignore outliers naturally)# → Random Forest, Gradient Boosting, RobustScaler + RidgeDecision Framework
Outlier detected → Is it a data entry error? Yes → Correct or remove No → Is it rare but real? Yes → Keep it (contains signal) Rare and the model shouldn't generalize to it? → Remove from training, flag in deploymentNever remove outliers in a test set — your model will see them in production. The goal is a model that handles outliers gracefully, not one that only works on clean data.