ROC Curve and AUC: Threshold-Invariant Classification Evaluation

Understand ROC curves and AUC — TPR vs FPR tradeoff, threshold selection, comparing classifiers, multiclass ROC, and when AUC is preferable to accuracy.

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve shows how a binary classifier’s true positive rate and false positive rate change as the classification threshold varies. The AUC (Area Under the Curve) summarizes this into a single number that works regardless of class imbalance.


TPR and FPR

TPR (True Positive Rate) = Recall = TP / (TP + FN)
→ "What fraction of actual positives did we correctly identify?"
FPR (False Positive Rate) = FP / (FP + TN)
→ "What fraction of actual negatives did we incorrectly classify as positive?"
ROC curve: TPR on Y-axis, FPR on X-axis, threshold varies from 1 → 0

Plotting the ROC Curve

from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
import matplotlib.pyplot as plt
# Get probability scores
y_prob = model.predict_proba(X_test)[:, 1]
# Compute ROC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.3f})', color='steelblue')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
# Or use sklearn's built-in display
RocCurveDisplay.from_predictions(y_test, y_prob).plot()

Interpreting AUC

AUC = 1.0: Perfect classifier (TPR=1, FPR=0 at some threshold)
AUC = 0.9: Excellent
AUC = 0.8: Good
AUC = 0.7: Fair
AUC = 0.6: Poor
AUC = 0.5: No better than random guessing
AUC < 0.5: Worse than random (flip predictions!)

Probabilistic interpretation: AUC = P(random positive scores higher than random negative). For AUC=0.85: if you pick one positive and one negative sample at random, the model scores the positive higher 85% of the time.


Comparing Multiple Classifiers

from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
models = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(n_estimators=100),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100)
}
plt.figure(figsize=(8, 6))
for name, clf in models.items():
clf.fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
plt.plot(fpr, tpr, label=f'{name} (AUC={auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.legend()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('Model Comparison: ROC Curves')

Finding the Optimal Threshold

import numpy as np
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
# Youden's J statistic: maximize TPR - FPR
j_scores = tpr - fpr
best_idx = np.argmax(j_scores)
best_threshold = thresholds[best_idx]
print(f"Optimal threshold: {best_threshold:.4f}")
print(f"TPR at threshold: {tpr[best_idx]:.4f}")
print(f"FPR at threshold: {fpr[best_idx]:.4f}")
y_pred_optimal = (y_prob >= best_threshold).astype(int)

Multiclass ROC: One-vs-Rest

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize
# For 3-class problem
classes = [0, 1, 2]
y_test_bin = label_binarize(y_test, classes=classes)
y_prob_multi = model.predict_proba(X_test)
# Macro AUC: average over classes
macro_auc = roc_auc_score(y_test_bin, y_prob_multi, multi_class='ovr', average='macro')
# Weighted AUC: weighted by class support
weighted_auc = roc_auc_score(y_test_bin, y_prob_multi, multi_class='ovr', average='weighted')

ROC AUC vs. PR AUC: When to Use Each

MetricBest When
ROC AUCClasses are roughly balanced; false positives and false negatives have similar cost
PR AUCClasses are very imbalanced; positive class is rare and important (fraud, disease)

For highly imbalanced datasets, ROC AUC can be misleadingly high because the large number of true negatives inflates the TN count. PR AUC (Average Precision) focuses solely on the positive class and is harder to game.

from sklearn.metrics import average_precision_score
pr_auc = average_precision_score(y_test, y_prob)
print(f"PR AUC: {pr_auc:.4f}")

Use both metrics when reporting — they tell complementary stories about classifier performance.