Outlier Detection: Identifying and Handling Anomalous Data Points

Learn outlier detection methods — IQR, Z-score, Isolation Forest, Local Outlier Factor, DBSCAN, and how to decide whether to remove, cap, or keep outliers in ML.

Outlier Detection

Outliers are data points that differ significantly from other observations. They can be errors (miscoded measurements), rare but valid events (fraud), or important signals (equipment failure). Before removing outliers, understand what they represent — removing valid outliers destroys real information.


Statistical Methods

IQR (Interquartile Range)

Robust to skewed distributions — works by defining outliers as values beyond 1.5× the IQR:

import pandas as pd
import numpy as np
def iqr_outliers(series, factor=1.5):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - factor * IQR
upper = Q3 + factor * IQR
return (series < lower) | (series > upper)
# Apply to each numeric column
for col in numeric_cols:
mask = iqr_outliers(df[col])
print(f"{col}: {mask.sum()} outliers ({mask.mean()*100:.1f}%)")

Z-Score

Identifies points more than N standard deviations from the mean. Sensitive to outliers themselves (they inflate std):

from scipy import stats
z_scores = np.abs(stats.zscore(X_numeric))
outlier_mask = (z_scores > 3).any(axis=1)
print(f"Outliers (Z>3): {outlier_mask.sum()} rows")
# More robust: Modified Z-score using median absolute deviation
def modified_zscore(series):
median = series.median()
mad = (series - median).abs().median()
return 0.6745 * (series - median) / (mad + 1e-8)
mz = modified_zscore(df['column']).abs()
outliers = mz > 3.5

Machine Learning Methods

Isolation Forest

Anomalies are isolated by random splits — outliers require fewer splits to isolate:

from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.05, # Expected fraction of outliers (0.05 = 5%)
random_state=42
)
labels = iso_forest.fit_predict(X)
# -1 = outlier, 1 = inlier
scores = iso_forest.score_samples(X) # More negative = more anomalous
outlier_mask = labels == -1
print(f"Detected {outlier_mask.sum()} outliers ({outlier_mask.mean()*100:.1f}%)")

Local Outlier Factor (LOF)

Compares local density of each point to its neighbors. Good for datasets where outliers are local — normal in one region, anomalous in another:

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.05,
metric='euclidean'
)
labels = lof.fit_predict(X) # -1 = outlier
lof_scores = lof.negative_outlier_factor_ # More negative = more anomalous

One-Class SVM

Learns a boundary around normal data, flags everything outside:

from sklearn.svm import OneClassSVM
oc_svm = OneClassSVM(nu=0.05, kernel='rbf', gamma='scale')
labels = oc_svm.fit_predict(X_train) # Fit on normal data only
test_labels = oc_svm.predict(X_test) # -1 = anomaly

Visualizing Outliers

import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for idx, col in enumerate(numeric_cols[:6]):
ax = axes[idx // 3][idx % 3]
# Box plot
ax.boxplot(df[col].dropna())
ax.set_title(f'{col}')
# Identify outliers with IQR
q1, q3 = df[col].quantile([0.25, 0.75])
iqr = q3 - q1
outliers = df[(df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)][col]
ax.set_xlabel(f'{len(outliers)} outliers')
plt.tight_layout()

Handling Detected Outliers

# Option 1: Remove outliers (only if you're confident they're errors)
X_clean = X[~outlier_mask]
y_clean = y[~outlier_mask]
# Option 2: Cap (Winsorize) — replace with boundary values
from scipy.stats.mstats import winsorize
# Cap at 5th and 95th percentiles
df['capped_col'] = winsorize(df['col'], limits=[0.05, 0.05])
# Option 3: Transform — log/sqrt reduces impact of outliers
df['log_income'] = np.log1p(df['income']) # log1p handles 0 values
# Option 4: Use robust algorithms (ignore outliers naturally)
# → Random Forest, Gradient Boosting, RobustScaler + Ridge

Decision Framework

Outlier detected → Is it a data entry error?
Yes → Correct or remove
No → Is it rare but real?
Yes → Keep it (contains signal)
Rare and the model shouldn't generalize to it?
→ Remove from training, flag in deployment

Never remove outliers in a test set — your model will see them in production. The goal is a model that handles outliers gracefully, not one that only works on clean data.