Imbalanced Datasets: Handling Class Imbalance in Machine Learning

Master imbalanced dataset techniques — class weights, SMOTE oversampling, undersampling, threshold tuning, proper evaluation metrics, and strategies for fraud and medical ML.

Imbalanced Datasets

Most real-world classification problems are imbalanced: fraud is 0.1% of transactions, defective products are 2% of production, cancer is rare in screening datasets. A model that predicts the majority class for every input achieves 99.9% accuracy on the first example — and is completely useless.


Why Accuracy Fails

Dataset: 9,900 normal, 100 fraud (1% fraud rate)
Naive model: predict "normal" for everything
Accuracy: 9900 / 10000 = 99%
But: Recall for fraud = 0/100 = 0% (caught zero fraud cases)
Precision for fraud = undefined
Accuracy is meaningless for imbalanced datasets.
Use: F1 score, PR-AUC, or cost-sensitive metrics

Using Class Weights

The simplest and often most effective solution. Increases the loss penalty for minority class errors:

from sklearn.utils.class_weight import compute_class_weight
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
# Compute balanced weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = dict(enumerate(class_weights))
print(f"Class weights: {weight_dict}")
# e.g., {0: 0.505, 1: 50.5} — minority class gets 100× more weight
# Apply to models
log_reg = LogisticRegression(class_weight='balanced')
rf = RandomForestClassifier(class_weight='balanced', n_estimators=100)
# For XGBoost, use scale_pos_weight
import xgboost as xgb
neg_count, pos_count = np.bincount(y_train)
model = xgb.XGBClassifier(scale_pos_weight=neg_count/pos_count)

Oversampling with SMOTE

Creates synthetic minority class samples by interpolating between existing minority samples:

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
# Basic SMOTE
smote = SMOTE(sampling_strategy=0.1, # Oversample minority to 10% of majority
k_neighbors=5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Borderline SMOTE: focuses on samples near the decision boundary
bsmote = BorderlineSMOTE(kind='borderline-1', random_state=42)
# Imbalanced-learn Pipeline (handles SMOTE correctly in CV)
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('model', GradientBoostingClassifier())
])
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=StratifiedKFold(5), scoring='roc_auc')

Undersampling

Reduces the majority class to balance ratios:

from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours
# Random undersampling
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
# Tomek Links: remove majority samples that are borderline (close to minority)
tomek = TomekLinks()
X_clean, y_clean = tomek.fit_resample(X_train, y_train)
# Combined: oversample minority + undersample majority
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=42)
X_combined, y_combined = smt.fit_resample(X_train, y_train)

Threshold Tuning

Instead of changing the data, change the decision threshold:

from sklearn.metrics import precision_recall_curve, f1_score
import numpy as np
# Get probability scores from model
y_prob = model.predict_proba(X_val)[:, 1]
# Find threshold maximizing F1
precisions, recalls, thresholds = precision_recall_curve(y_val, y_prob)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-8)
best_threshold = thresholds[f1_scores.argmax()]
print(f"Default threshold (0.5): F1 = {f1_score(y_val, y_prob >= 0.5):.4f}")
print(f"Optimal threshold ({best_threshold:.3f}): F1 = {f1_scores.max():.4f}")
# Apply custom threshold in production
y_pred = (y_prob >= best_threshold).astype(int)

Evaluation Metrics for Imbalanced Data

from sklearn.metrics import (classification_report, roc_auc_score,
average_precision_score, balanced_accuracy_score)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Never use accuracy alone
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))
# Better metrics for imbalanced data
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, y_prob):.4f}") # Better than ROC for imbalanced
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}") # Average recall per class

Strategy Selection Guide

Imbalance RatioPrimary Strategy
Mild (1:3 to 1:10)Class weights — simplest, often sufficient
Moderate (1:10 to 1:100)Class weights + threshold tuning
Severe (1:100 to 1:1000)SMOTE + class weights + threshold tuning
Extreme (1:1000+)Anomaly detection approaches (Isolation Forest, One-Class SVM)

The safest starting point: class_weight='balanced' in sklearn models. It’s free (no data modification), prevents leakage, and works well with cross-validation.