Evaluation Metrics: Accuracy, Precision, Recall, F1, and ROC-AUC Explained
A fraud detection model that predicts “not fraud” for every single transaction achieves 99% accuracy if only 1% of transactions are actually fraudulent — and it’s also completely useless, since it catches zero fraud. This single example is why accuracy alone is one of the most dangerously misleading metrics in machine learning, and why precision, recall, F1-score, and ROC-AUC exist as more informative alternatives for exactly the situations where accuracy breaks down.
Accuracy: Simple, But Dangerous on Imbalanced Data
Accuracy is the fraction of predictions that were correct — intuitive, but only meaningful when classes are roughly balanced.
correct_predictions = 990total_predictions = 1000accuracy = correct_predictions / total_predictions # 0.99
# But if 990 of 1000 transactions are genuinely "not fraud",# a model that always predicts "not fraud" also scores 99% accuracy# while catching zero actual fraud casesWhenever your classes are imbalanced — fraud detection, disease diagnosis, defect detection — accuracy alone should be treated as close to meaningless, and the metrics below become essential.
The Confusion Matrix: The Foundation Everything Else Builds On
Every other classification metric is derived from four counts: true positives, true negatives, false positives, and false negatives.
Predicted Positive Predicted NegativeActual Positive True Positive (TP) False Negative (FN)Actual Negative False Positive (FP) True Negative (TN)from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)tn, fp, fn, tp = cm.ravel()Precision: Of Everything You Flagged, How Much Was Actually Right
Precision answers: “when the model predicts positive, how often is it actually correct?”
precision = tp / (tp + fp)High precision matters when false positives are costly — flagging a legitimate transaction as fraud (annoying a real customer) or flagging a healthy patient as sick (unnecessary invasive follow-up testing).
Recall: Of Everything That Was Actually Positive, How Much Did You Catch
Recall (also called sensitivity) answers: “of all the actual positive cases, how many did the model successfully identify?”
recall = tp / (tp + fn)High recall matters when false negatives are costly — missing an actual fraud case (direct financial loss) or missing an actual disease diagnosis (a genuinely dangerous outcome for the patient).
The Precision-Recall Tradeoff
Precision and recall typically trade off against each other — a model tuned to catch every possible fraud case (high recall) will inevitably flag more legitimate transactions as false positives too (lower precision), and vice versa.
# Adjusting the classification threshold shifts the precision-recall tradeoffprobabilities = model.predict_proba(X_test)[:, 1]
threshold_low = 0.3 # more permissive -- higher recall, lower precisionthreshold_high = 0.7 # more conservative -- higher precision, lower recall
predictions_low = (probabilities >= threshold_low).astype(int)predictions_high = (probabilities >= threshold_high).astype(int)Choosing the right threshold is a business decision, not a purely technical one — it depends entirely on the relative cost of false positives versus false negatives for your specific application.
F1-Score: Balancing Precision and Recall Into One Number
F1-score is the harmonic mean of precision and recall, useful when you need a single number that penalizes models doing poorly on either metric.
f1 = 2 * (precision * recall) / (precision + recall)The harmonic mean (rather than a simple average) specifically penalizes extreme imbalance between precision and recall — a model with 100% precision and 1% recall gets a low F1-score, correctly reflecting that it’s not actually useful despite one metric looking perfect.
ROC-AUC: Evaluating Across All Possible Thresholds at Once
The ROC curve plots the true positive rate against the false positive rate across every possible classification threshold, and the Area Under the Curve (AUC) summarizes overall discriminative ability into a single number, independent of any specific threshold choice.
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_true, probabilities)# 1.0 = perfect discrimination, 0.5 = no better than random guessingAn AUC of 0.5 means the model has no genuine discriminative power (equivalent to random guessing), while 1.0 means perfect separation between classes across every possible threshold. AUC is particularly useful for comparing two models’ fundamental discriminative ability without committing to a specific decision threshold upfront.
Choosing the Right Metric for Your Problem
| Situation | Recommended primary metric |
|---|---|
| Balanced classes, equal error cost | Accuracy |
| False positives are the bigger concern | Precision |
| False negatives are the bigger concern | Recall |
| Need one balanced number | F1-score |
| Comparing models independent of a specific threshold | ROC-AUC |
| Regression task | Mean absolute error, RMSE, R² (not classification metrics) |
Metrics for Multi-Class Problems: Macro vs. Micro Averaging
When extending precision, recall, and F1-score beyond binary classification to problems with several classes, there’s an additional choice to make: macro averaging computes the metric separately for each class and then averages those scores equally, treating every class as equally important regardless of how many examples it has. Micro averaging aggregates the raw counts (true positives, false positives, false negatives) across all classes first, then computes the metric once — effectively weighting by how frequently each class appears. On an imbalanced multi-class dataset, these two approaches can produce meaningfully different numbers: macro averaging will visibly penalize poor performance on a rare class, while micro averaging can mask that same poor performance if the rare class makes up a small fraction of the overall data. Choosing macro averaging by default is generally the safer choice whenever every class genuinely matters, regardless of its frequency in the dataset.
Summary
| Metric | Answers |
|---|---|
| Accuracy | What fraction of all predictions were correct? |
| Precision | Of predicted positives, how many were actually positive? |
| Recall | Of actual positives, how many were caught? |
| F1-score | A single balance between precision and recall |
| ROC-AUC | Overall discriminative power, across all thresholds |
Picking the wrong evaluation metric doesn’t just give you a misleading number — it can lead you to ship a model that looks great on paper (99% accuracy) while being functionally useless for the actual problem it was built to solve.