Precision, Recall, and F1 Score
Accuracy is often the wrong metric. When classes are imbalanced — 99% negative, 1% positive — a model that predicts everything as negative gets 99% accuracy while being completely useless. Precision, recall, and F1 score focus on the class that matters.
The Metrics
Precision = TP / (TP + FP)→ "Of all my positive predictions, how many are actually positive?"→ Measures prediction reliability
Recall (Sensitivity) = TP / (TP + FN)→ "Of all actual positives, how many did I find?"→ Measures coverage
F1 Score = 2 × Precision × Recall / (Precision + Recall)→ Harmonic mean — a single score balancing both→ Close to the lower of the two (punishes imbalance between precision and recall)The Precision-Recall Trade-off
Adjusting the classification threshold changes precision and recall in opposite directions:
from sklearn.metrics import precision_recall_curveimport matplotlib.pyplot as plt
# Get probability scoresy_prob = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
plt.plot(recalls, precisions)plt.xlabel('Recall')plt.ylabel('Precision')plt.title('Precision-Recall Curve')plt.fill_between(recalls, precisions, alpha=0.2)
# Area Under PR Curve (AUPRC) — higher = betterfrom sklearn.metrics import average_precision_scoreauprc = average_precision_score(y_test, y_prob)print(f"AUPRC: {auprc:.4f}")Choosing a Threshold
import numpy as np
# Find threshold that maximizes F1f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-8)best_idx = np.argmax(f1_scores)best_threshold = thresholds[best_idx]print(f"Best threshold: {best_threshold:.3f}, F1: {f1_scores[best_idx]:.4f}")
# Apply custom thresholdy_pred_custom = (y_prob >= best_threshold).astype(int)
# Or optimize for recall (catch as many positives as possible)# Find threshold where recall >= 0.95recall_target_idx = np.where(recalls >= 0.95)[0][-1]high_recall_threshold = thresholds[recall_target_idx]sklearn Classification Report
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))
# Output:# precision recall f1-score support## Normal 0.99 0.99 0.99 9800# Fraud 0.72 0.68 0.70 200## accuracy 0.98 10000# macro avg 0.86 0.84 0.85 10000# weighted avg 0.98 0.98 0.98 10000Macro vs. Micro vs. Weighted Averaging
For multiclass problems, you need to combine per-class metrics:
from sklearn.metrics import f1_score
# Macro: simple average of per-class scores# Use when all classes matter equally, regardless of sizemacro_f1 = f1_score(y_test, y_pred, average='macro')
# Micro: aggregate TP/FP/FN across all classes, then compute# Use when each instance matters equallymicro_f1 = f1_score(y_test, y_pred, average='micro')
# Weighted: weighted average by class support (number of samples)# Use when class imbalance exists and larger classes matter moreweighted_f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Macro F1: {macro_f1:.4f}")print(f"Micro F1: {micro_f1:.4f}")print(f"Weighted F1: {weighted_f1:.4f}")F-Beta Score: Weighting Recall vs. Precision
F_β = (1 + β²) × Precision × Recall / (β² × Precision + Recall)
β < 1: Precision matters more (penalize false positives)β = 1: Equal weight (standard F1)β > 1: Recall matters more (penalize false negatives)from sklearn.metrics import fbeta_score
# Medical screening: missing a disease (FN) is much worse than a false alarm (FP)# β=2 means recall is twice as important as precisionf2 = fbeta_score(y_test, y_pred, beta=2)print(f"F2 Score: {f2:.4f}")
# Spam filtering: user tolerates missed spam, hates losing legitimate mail# β=0.5 means precision is twice as important as recallf_half = fbeta_score(y_test, y_pred, beta=0.5)Which Metric to Optimize?
| Use Case | Primary Metric | Why |
|---|---|---|
| Medical diagnosis | Recall (sensitivity) | Missing a disease is costly |
| Spam detection | Precision | Blocking real emails is worse than missing spam |
| Fraud detection | F1 or AUPRC | Balance both, use PR curve to tune threshold |
| Information retrieval | Precision@K | User cares about top K results |
| Balanced classes | Accuracy | Simple, meaningful when balanced |
When classes are imbalanced, prefer F1 or AUPRC over accuracy. When the cost of FP ≠ cost of FN, tune the threshold explicitly using the precision-recall curve.