Evaluation Metrics: Accuracy, Precision, Recall, F1, and ROC-AUC Explained

When accuracy is misleading and what to use instead — precision, recall, F1-score, and ROC-AUC explained with real classification examples.

Evaluation Metrics: Accuracy, Precision, Recall, F1, and ROC-AUC Explained

A fraud detection model that predicts “not fraud” for every single transaction achieves 99% accuracy if only 1% of transactions are actually fraudulent — and it’s also completely useless, since it catches zero fraud. This single example is why accuracy alone is one of the most dangerously misleading metrics in machine learning, and why precision, recall, F1-score, and ROC-AUC exist as more informative alternatives for exactly the situations where accuracy breaks down.


Accuracy: Simple, But Dangerous on Imbalanced Data

Accuracy is the fraction of predictions that were correct — intuitive, but only meaningful when classes are roughly balanced.

correct_predictions = 990
total_predictions = 1000
accuracy = correct_predictions / total_predictions # 0.99
# But if 990 of 1000 transactions are genuinely "not fraud",
# a model that always predicts "not fraud" also scores 99% accuracy
# while catching zero actual fraud cases

Whenever your classes are imbalanced — fraud detection, disease diagnosis, defect detection — accuracy alone should be treated as close to meaningless, and the metrics below become essential.


The Confusion Matrix: The Foundation Everything Else Builds On

Every other classification metric is derived from four counts: true positives, true negatives, false positives, and false negatives.

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

Precision: Of Everything You Flagged, How Much Was Actually Right

Precision answers: “when the model predicts positive, how often is it actually correct?”

precision = tp / (tp + fp)

High precision matters when false positives are costly — flagging a legitimate transaction as fraud (annoying a real customer) or flagging a healthy patient as sick (unnecessary invasive follow-up testing).


Recall: Of Everything That Was Actually Positive, How Much Did You Catch

Recall (also called sensitivity) answers: “of all the actual positive cases, how many did the model successfully identify?”

recall = tp / (tp + fn)

High recall matters when false negatives are costly — missing an actual fraud case (direct financial loss) or missing an actual disease diagnosis (a genuinely dangerous outcome for the patient).


The Precision-Recall Tradeoff

Precision and recall typically trade off against each other — a model tuned to catch every possible fraud case (high recall) will inevitably flag more legitimate transactions as false positives too (lower precision), and vice versa.

# Adjusting the classification threshold shifts the precision-recall tradeoff
probabilities = model.predict_proba(X_test)[:, 1]
threshold_low = 0.3 # more permissive -- higher recall, lower precision
threshold_high = 0.7 # more conservative -- higher precision, lower recall
predictions_low = (probabilities >= threshold_low).astype(int)
predictions_high = (probabilities >= threshold_high).astype(int)

Choosing the right threshold is a business decision, not a purely technical one — it depends entirely on the relative cost of false positives versus false negatives for your specific application.


F1-Score: Balancing Precision and Recall Into One Number

F1-score is the harmonic mean of precision and recall, useful when you need a single number that penalizes models doing poorly on either metric.

f1 = 2 * (precision * recall) / (precision + recall)

The harmonic mean (rather than a simple average) specifically penalizes extreme imbalance between precision and recall — a model with 100% precision and 1% recall gets a low F1-score, correctly reflecting that it’s not actually useful despite one metric looking perfect.


ROC-AUC: Evaluating Across All Possible Thresholds at Once

The ROC curve plots the true positive rate against the false positive rate across every possible classification threshold, and the Area Under the Curve (AUC) summarizes overall discriminative ability into a single number, independent of any specific threshold choice.

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_true, probabilities)
# 1.0 = perfect discrimination, 0.5 = no better than random guessing

An AUC of 0.5 means the model has no genuine discriminative power (equivalent to random guessing), while 1.0 means perfect separation between classes across every possible threshold. AUC is particularly useful for comparing two models’ fundamental discriminative ability without committing to a specific decision threshold upfront.


Choosing the Right Metric for Your Problem

SituationRecommended primary metric
Balanced classes, equal error costAccuracy
False positives are the bigger concernPrecision
False negatives are the bigger concernRecall
Need one balanced numberF1-score
Comparing models independent of a specific thresholdROC-AUC
Regression taskMean absolute error, RMSE, R² (not classification metrics)

Metrics for Multi-Class Problems: Macro vs. Micro Averaging

When extending precision, recall, and F1-score beyond binary classification to problems with several classes, there’s an additional choice to make: macro averaging computes the metric separately for each class and then averages those scores equally, treating every class as equally important regardless of how many examples it has. Micro averaging aggregates the raw counts (true positives, false positives, false negatives) across all classes first, then computes the metric once — effectively weighting by how frequently each class appears. On an imbalanced multi-class dataset, these two approaches can produce meaningfully different numbers: macro averaging will visibly penalize poor performance on a rare class, while micro averaging can mask that same poor performance if the rare class makes up a small fraction of the overall data. Choosing macro averaging by default is generally the safer choice whenever every class genuinely matters, regardless of its frequency in the dataset.

Summary

MetricAnswers
AccuracyWhat fraction of all predictions were correct?
PrecisionOf predicted positives, how many were actually positive?
RecallOf actual positives, how many were caught?
F1-scoreA single balance between precision and recall
ROC-AUCOverall discriminative power, across all thresholds

Picking the wrong evaluation metric doesn’t just give you a misleading number — it can lead you to ship a model that looks great on paper (99% accuracy) while being functionally useless for the actual problem it was built to solve.