Outlier Detection

Outliers are data points that differ significantly from other observations. They can be errors (miscoded measurements), rare but valid events (fraud), or important signals (equipment failure). Before removing outliers, understand what they represent — removing valid outliers destroys real information.

Statistical Methods

IQR (Interquartile Range)

Robust to skewed distributions — works by defining outliers as values beyond 1.5× the IQR:

import pandas as pd
import numpy as np

def iqr_outliers(series, factor=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR
    return (series < lower) | (series > upper)

# Apply to each numeric column
for col in numeric_cols:
    mask = iqr_outliers(df[col])
    print(f"{col}: {mask.sum()} outliers ({mask.mean()*100:.1f}%)")

Z-Score

Identifies points more than N standard deviations from the mean. Sensitive to outliers themselves (they inflate std):

from scipy import stats

z_scores = np.abs(stats.zscore(X_numeric))
outlier_mask = (z_scores > 3).any(axis=1)
print(f"Outliers (Z>3): {outlier_mask.sum()} rows")

# More robust: Modified Z-score using median absolute deviation
def modified_zscore(series):
    median = series.median()
    mad = (series - median).abs().median()
    return 0.6745 * (series - median) / (mad + 1e-8)

mz = modified_zscore(df['column']).abs()
outliers = mz > 3.5

Machine Learning Methods

Isolation Forest

Anomalies are isolated by random splits — outliers require fewer splits to isolate:

from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.05,  # Expected fraction of outliers (0.05 = 5%)
    random_state=42
)

labels = iso_forest.fit_predict(X)
# -1 = outlier, 1 = inlier

scores = iso_forest.score_samples(X)  # More negative = more anomalous

outlier_mask = labels == -1
print(f"Detected {outlier_mask.sum()} outliers ({outlier_mask.mean()*100:.1f}%)")

Local Outlier Factor (LOF)

Compares local density of each point to its neighbors. Good for datasets where outliers are local — normal in one region, anomalous in another:

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.05,
    metric='euclidean'
)

labels = lof.fit_predict(X)  # -1 = outlier
lof_scores = lof.negative_outlier_factor_  # More negative = more anomalous

One-Class SVM

Learns a boundary around normal data, flags everything outside:

from sklearn.svm import OneClassSVM

oc_svm = OneClassSVM(nu=0.05, kernel='rbf', gamma='scale')
labels = oc_svm.fit_predict(X_train)  # Fit on normal data only
test_labels = oc_svm.predict(X_test)   # -1 = anomaly

Visualizing Outliers

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for idx, col in enumerate(numeric_cols[:6]):
    ax = axes[idx // 3][idx % 3]

    # Box plot
    ax.boxplot(df[col].dropna())
    ax.set_title(f'{col}')

    # Identify outliers with IQR
    q1, q3 = df[col].quantile([0.25, 0.75])
    iqr = q3 - q1
    outliers = df[(df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)][col]
    ax.set_xlabel(f'{len(outliers)} outliers')

plt.tight_layout()

Handling Detected Outliers

# Option 1: Remove outliers (only if you're confident they're errors)
X_clean = X[~outlier_mask]
y_clean = y[~outlier_mask]

# Option 2: Cap (Winsorize) — replace with boundary values
from scipy.stats.mstats import winsorize

# Cap at 5th and 95th percentiles
df['capped_col'] = winsorize(df['col'], limits=[0.05, 0.05])

# Option 3: Transform — log/sqrt reduces impact of outliers
df['log_income'] = np.log1p(df['income'])  # log1p handles 0 values

# Option 4: Use robust algorithms (ignore outliers naturally)
# → Random Forest, Gradient Boosting, RobustScaler + Ridge

Decision Framework

Outlier detected →  Is it a data entry error?
                        Yes → Correct or remove
                        No  → Is it rare but real?
                                  Yes → Keep it (contains signal)
                                  Rare and the model shouldn't generalize to it?
                                        → Remove from training, flag in deployment

Never remove outliers in a test set — your model will see them in production. The goal is a model that handles outliers gracefully, not one that only works on clean data.