ROC Curve and AUC
The ROC (Receiver Operating Characteristic) curve shows how a binary classifier’s true positive rate and false positive rate change as the classification threshold varies. The AUC (Area Under the Curve) summarizes this into a single number that works regardless of class imbalance.
TPR and FPR
TPR (True Positive Rate) = Recall = TP / (TP + FN)→ "What fraction of actual positives did we correctly identify?"
FPR (False Positive Rate) = FP / (FP + TN)→ "What fraction of actual negatives did we incorrectly classify as positive?"
ROC curve: TPR on Y-axis, FPR on X-axis, threshold varies from 1 → 0Plotting the ROC Curve
from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplayimport matplotlib.pyplot as plt
# Get probability scoresy_prob = model.predict_proba(X_test)[:, 1]
# Compute ROCfpr, tpr, thresholds = roc_curve(y_test, y_prob)auc = roc_auc_score(y_test, y_prob)
# Plotplt.figure(figsize=(8, 6))plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.3f})', color='steelblue')plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('ROC Curve')plt.legend()plt.grid(True, alpha=0.3)
# Or use sklearn's built-in displayRocCurveDisplay.from_predictions(y_test, y_prob).plot()Interpreting AUC
AUC = 1.0: Perfect classifier (TPR=1, FPR=0 at some threshold)AUC = 0.9: ExcellentAUC = 0.8: GoodAUC = 0.7: FairAUC = 0.6: PoorAUC = 0.5: No better than random guessingAUC < 0.5: Worse than random (flip predictions!)Probabilistic interpretation: AUC = P(random positive scores higher than random negative). For AUC=0.85: if you pick one positive and one negative sample at random, the model scores the positive higher 85% of the time.
Comparing Multiple Classifiers
from sklearn.metrics import roc_auc_scorefrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegression
models = { 'Logistic Regression': LogisticRegression(), 'Random Forest': RandomForestClassifier(n_estimators=100), 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100)}
plt.figure(figsize=(8, 6))
for name, clf in models.items(): clf.fit(X_train, y_train) y_prob = clf.predict_proba(X_test)[:, 1] fpr, tpr, _ = roc_curve(y_test, y_prob) auc = roc_auc_score(y_test, y_prob) plt.plot(fpr, tpr, label=f'{name} (AUC={auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--')plt.legend()plt.xlabel('FPR')plt.ylabel('TPR')plt.title('Model Comparison: ROC Curves')Finding the Optimal Threshold
import numpy as np
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
# Youden's J statistic: maximize TPR - FPRj_scores = tpr - fprbest_idx = np.argmax(j_scores)best_threshold = thresholds[best_idx]
print(f"Optimal threshold: {best_threshold:.4f}")print(f"TPR at threshold: {tpr[best_idx]:.4f}")print(f"FPR at threshold: {fpr[best_idx]:.4f}")
y_pred_optimal = (y_prob >= best_threshold).astype(int)Multiclass ROC: One-vs-Rest
from sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import label_binarize
# For 3-class problemclasses = [0, 1, 2]y_test_bin = label_binarize(y_test, classes=classes)y_prob_multi = model.predict_proba(X_test)
# Macro AUC: average over classesmacro_auc = roc_auc_score(y_test_bin, y_prob_multi, multi_class='ovr', average='macro')
# Weighted AUC: weighted by class supportweighted_auc = roc_auc_score(y_test_bin, y_prob_multi, multi_class='ovr', average='weighted')ROC AUC vs. PR AUC: When to Use Each
| Metric | Best When |
|---|---|
| ROC AUC | Classes are roughly balanced; false positives and false negatives have similar cost |
| PR AUC | Classes are very imbalanced; positive class is rare and important (fraud, disease) |
For highly imbalanced datasets, ROC AUC can be misleadingly high because the large number of true negatives inflates the TN count. PR AUC (Average Precision) focuses solely on the positive class and is harder to game.
from sklearn.metrics import average_precision_scorepr_auc = average_precision_score(y_test, y_prob)print(f"PR AUC: {pr_auc:.4f}")Use both metrics when reporting — they tell complementary stories about classifier performance.