Precision, Recall, and F1 Score: Classification Metrics That Matter

Master precision, recall, F1 score, and F-beta — when to optimize each metric, macro vs micro vs weighted averaging, and precision-recall tradeoffs in imbalanced datasets.

Precision, Recall, and F1 Score

Accuracy is often the wrong metric. When classes are imbalanced — 99% negative, 1% positive — a model that predicts everything as negative gets 99% accuracy while being completely useless. Precision, recall, and F1 score focus on the class that matters.


The Metrics

Precision = TP / (TP + FP)
→ "Of all my positive predictions, how many are actually positive?"
→ Measures prediction reliability
Recall (Sensitivity) = TP / (TP + FN)
→ "Of all actual positives, how many did I find?"
→ Measures coverage
F1 Score = 2 × Precision × Recall / (Precision + Recall)
→ Harmonic mean — a single score balancing both
→ Close to the lower of the two (punishes imbalance between precision and recall)

The Precision-Recall Trade-off

Adjusting the classification threshold changes precision and recall in opposite directions:

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Get probability scores
y_prob = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
plt.plot(recalls, precisions)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.fill_between(recalls, precisions, alpha=0.2)
# Area Under PR Curve (AUPRC) — higher = better
from sklearn.metrics import average_precision_score
auprc = average_precision_score(y_test, y_prob)
print(f"AUPRC: {auprc:.4f}")

Choosing a Threshold

import numpy as np
# Find threshold that maximizes F1
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-8)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
print(f"Best threshold: {best_threshold:.3f}, F1: {f1_scores[best_idx]:.4f}")
# Apply custom threshold
y_pred_custom = (y_prob >= best_threshold).astype(int)
# Or optimize for recall (catch as many positives as possible)
# Find threshold where recall >= 0.95
recall_target_idx = np.where(recalls >= 0.95)[0][-1]
high_recall_threshold = thresholds[recall_target_idx]

sklearn Classification Report

from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))
# Output:
# precision recall f1-score support
#
# Normal 0.99 0.99 0.99 9800
# Fraud 0.72 0.68 0.70 200
#
# accuracy 0.98 10000
# macro avg 0.86 0.84 0.85 10000
# weighted avg 0.98 0.98 0.98 10000

Macro vs. Micro vs. Weighted Averaging

For multiclass problems, you need to combine per-class metrics:

from sklearn.metrics import f1_score
# Macro: simple average of per-class scores
# Use when all classes matter equally, regardless of size
macro_f1 = f1_score(y_test, y_pred, average='macro')
# Micro: aggregate TP/FP/FN across all classes, then compute
# Use when each instance matters equally
micro_f1 = f1_score(y_test, y_pred, average='micro')
# Weighted: weighted average by class support (number of samples)
# Use when class imbalance exists and larger classes matter more
weighted_f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Macro F1: {macro_f1:.4f}")
print(f"Micro F1: {micro_f1:.4f}")
print(f"Weighted F1: {weighted_f1:.4f}")

F-Beta Score: Weighting Recall vs. Precision

F_β = (1 + β²) × Precision × Recall / (β² × Precision + Recall)
β < 1: Precision matters more (penalize false positives)
β = 1: Equal weight (standard F1)
β > 1: Recall matters more (penalize false negatives)
from sklearn.metrics import fbeta_score
# Medical screening: missing a disease (FN) is much worse than a false alarm (FP)
# β=2 means recall is twice as important as precision
f2 = fbeta_score(y_test, y_pred, beta=2)
print(f"F2 Score: {f2:.4f}")
# Spam filtering: user tolerates missed spam, hates losing legitimate mail
# β=0.5 means precision is twice as important as recall
f_half = fbeta_score(y_test, y_pred, beta=0.5)

Which Metric to Optimize?

Use CasePrimary MetricWhy
Medical diagnosisRecall (sensitivity)Missing a disease is costly
Spam detectionPrecisionBlocking real emails is worse than missing spam
Fraud detectionF1 or AUPRCBalance both, use PR curve to tune threshold
Information retrievalPrecision@KUser cares about top K results
Balanced classesAccuracySimple, meaningful when balanced

When classes are imbalanced, prefer F1 or AUPRC over accuracy. When the cost of FP ≠ cost of FN, tune the threshold explicitly using the precision-recall curve.