Imbalanced Datasets
Most real-world classification problems are imbalanced: fraud is 0.1% of transactions, defective products are 2% of production, cancer is rare in screening datasets. A model that predicts the majority class for every input achieves 99.9% accuracy on the first example — and is completely useless.
Why Accuracy Fails
Dataset: 9,900 normal, 100 fraud (1% fraud rate)
Naive model: predict "normal" for everything Accuracy: 9900 / 10000 = 99%
But: Recall for fraud = 0/100 = 0% (caught zero fraud cases) Precision for fraud = undefined
Accuracy is meaningless for imbalanced datasets.Use: F1 score, PR-AUC, or cost-sensitive metricsUsing Class Weights
The simplest and often most effective solution. Increases the loss penalty for minority class errors:
from sklearn.utils.class_weight import compute_class_weightfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegression
# Compute balanced weightsclass_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)weight_dict = dict(enumerate(class_weights))print(f"Class weights: {weight_dict}")# e.g., {0: 0.505, 1: 50.5} — minority class gets 100× more weight
# Apply to modelslog_reg = LogisticRegression(class_weight='balanced')rf = RandomForestClassifier(class_weight='balanced', n_estimators=100)
# For XGBoost, use scale_pos_weightimport xgboost as xgbneg_count, pos_count = np.bincount(y_train)model = xgb.XGBClassifier(scale_pos_weight=neg_count/pos_count)Oversampling with SMOTE
Creates synthetic minority class samples by interpolating between existing minority samples:
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTEfrom imblearn.pipeline import Pipeline as ImbPipeline
# Basic SMOTEsmote = SMOTE(sampling_strategy=0.1, # Oversample minority to 10% of majority k_neighbors=5, random_state=42)X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Borderline SMOTE: focuses on samples near the decision boundarybsmote = BorderlineSMOTE(kind='borderline-1', random_state=42)
# Imbalanced-learn Pipeline (handles SMOTE correctly in CV)pipeline = ImbPipeline([ ('smote', SMOTE(random_state=42)), ('model', GradientBoostingClassifier())])from sklearn.model_selection import cross_val_scorescores = cross_val_score(pipeline, X_train, y_train, cv=StratifiedKFold(5), scoring='roc_auc')Undersampling
Reduces the majority class to balance ratios:
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours
# Random undersamplingrus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
# Tomek Links: remove majority samples that are borderline (close to minority)tomek = TomekLinks()X_clean, y_clean = tomek.fit_resample(X_train, y_train)
# Combined: oversample minority + undersample majorityfrom imblearn.combine import SMOTETomeksmt = SMOTETomek(random_state=42)X_combined, y_combined = smt.fit_resample(X_train, y_train)Threshold Tuning
Instead of changing the data, change the decision threshold:
from sklearn.metrics import precision_recall_curve, f1_scoreimport numpy as np
# Get probability scores from modely_prob = model.predict_proba(X_val)[:, 1]
# Find threshold maximizing F1precisions, recalls, thresholds = precision_recall_curve(y_val, y_prob)f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-8)best_threshold = thresholds[f1_scores.argmax()]
print(f"Default threshold (0.5): F1 = {f1_score(y_val, y_prob >= 0.5):.4f}")print(f"Optimal threshold ({best_threshold:.3f}): F1 = {f1_scores.max():.4f}")
# Apply custom threshold in productiony_pred = (y_prob >= best_threshold).astype(int)Evaluation Metrics for Imbalanced Data
from sklearn.metrics import (classification_report, roc_auc_score, average_precision_score, balanced_accuracy_score)
y_pred = model.predict(X_test)y_prob = model.predict_proba(X_test)[:, 1]
# Never use accuracy aloneprint("Classification Report:")print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))
# Better metrics for imbalanced dataprint(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")print(f"PR-AUC: {average_precision_score(y_test, y_prob):.4f}") # Better than ROC for imbalancedprint(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}") # Average recall per classStrategy Selection Guide
| Imbalance Ratio | Primary Strategy |
|---|---|
| Mild (1:3 to 1:10) | Class weights — simplest, often sufficient |
| Moderate (1:10 to 1:100) | Class weights + threshold tuning |
| Severe (1:100 to 1:1000) | SMOTE + class weights + threshold tuning |
| Extreme (1:1000+) | Anomaly detection approaches (Isolation Forest, One-Class SVM) |
The safest starting point: class_weight='balanced' in sklearn models. It’s free (no data modification), prevents leakage, and works well with cross-validation.