Imbalanced Datasets

Most real-world classification problems are imbalanced: fraud is 0.1% of transactions, defective products are 2% of production, cancer is rare in screening datasets. A model that predicts the majority class for every input achieves 99.9% accuracy on the first example — and is completely useless.

Why Accuracy Fails

Dataset: 9,900 normal, 100 fraud (1% fraud rate)

Naive model: predict "normal" for everything
  Accuracy: 9900 / 10000 = 99%

But: Recall for fraud = 0/100 = 0%  (caught zero fraud cases)
     Precision for fraud = undefined

Accuracy is meaningless for imbalanced datasets.
Use: F1 score, PR-AUC, or cost-sensitive metrics

Using Class Weights

The simplest and often most effective solution. Increases the loss penalty for minority class errors:

from sklearn.utils.class_weight import compute_class_weight
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# Compute balanced weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = dict(enumerate(class_weights))
print(f"Class weights: {weight_dict}")
# e.g., {0: 0.505, 1: 50.5}  — minority class gets 100× more weight

# Apply to models
log_reg = LogisticRegression(class_weight='balanced')
rf = RandomForestClassifier(class_weight='balanced', n_estimators=100)

# For XGBoost, use scale_pos_weight
import xgboost as xgb
neg_count, pos_count = np.bincount(y_train)
model = xgb.XGBClassifier(scale_pos_weight=neg_count/pos_count)

Oversampling with SMOTE

Creates synthetic minority class samples by interpolating between existing minority samples:

from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# Basic SMOTE
smote = SMOTE(sampling_strategy=0.1,  # Oversample minority to 10% of majority
              k_neighbors=5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Borderline SMOTE: focuses on samples near the decision boundary
bsmote = BorderlineSMOTE(kind='borderline-1', random_state=42)

# Imbalanced-learn Pipeline (handles SMOTE correctly in CV)
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('model', GradientBoostingClassifier())
])
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=StratifiedKFold(5), scoring='roc_auc')

Undersampling

Reduces the majority class to balance ratios:

from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours

# Random undersampling
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

# Tomek Links: remove majority samples that are borderline (close to minority)
tomek = TomekLinks()
X_clean, y_clean = tomek.fit_resample(X_train, y_train)

# Combined: oversample minority + undersample majority
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=42)
X_combined, y_combined = smt.fit_resample(X_train, y_train)

Threshold Tuning

Instead of changing the data, change the decision threshold:

from sklearn.metrics import precision_recall_curve, f1_score
import numpy as np

# Get probability scores from model
y_prob = model.predict_proba(X_val)[:, 1]

# Find threshold maximizing F1
precisions, recalls, thresholds = precision_recall_curve(y_val, y_prob)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-8)
best_threshold = thresholds[f1_scores.argmax()]

print(f"Default threshold (0.5): F1 = {f1_score(y_val, y_prob >= 0.5):.4f}")
print(f"Optimal threshold ({best_threshold:.3f}): F1 = {f1_scores.max():.4f}")

# Apply custom threshold in production
y_pred = (y_prob >= best_threshold).astype(int)

Evaluation Metrics for Imbalanced Data

from sklearn.metrics import (classification_report, roc_auc_score,
                               average_precision_score, balanced_accuracy_score)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Never use accuracy alone
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))

# Better metrics for imbalanced data
print(f"ROC-AUC:           {roc_auc_score(y_test, y_prob):.4f}")
print(f"PR-AUC:            {average_precision_score(y_test, y_prob):.4f}")  # Better than ROC for imbalanced
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")   # Average recall per class

Strategy Selection Guide

Imbalance Ratio	Primary Strategy
Mild (1:3 to 1:10)	Class weights — simplest, often sufficient
Moderate (1:10 to 1:100)	Class weights + threshold tuning
Severe (1:100 to 1:1000)	SMOTE + class weights + threshold tuning
Extreme (1:1000+)	Anomaly detection approaches (Isolation Forest, One-Class SVM)

The safest starting point: class_weight='balanced' in sklearn models. It’s free (no data modification), prevents leakage, and works well with cross-validation.