Precision, Recall, and F1 Score

Accuracy is often the wrong metric. When classes are imbalanced — 99% negative, 1% positive — a model that predicts everything as negative gets 99% accuracy while being completely useless. Precision, recall, and F1 score focus on the class that matters.

The Metrics

Precision = TP / (TP + FP)
→ "Of all my positive predictions, how many are actually positive?"
→ Measures prediction reliability

Recall (Sensitivity) = TP / (TP + FN)
→ "Of all actual positives, how many did I find?"
→ Measures coverage

F1 Score = 2 × Precision × Recall / (Precision + Recall)
→ Harmonic mean — a single score balancing both
→ Close to the lower of the two (punishes imbalance between precision and recall)

The Precision-Recall Trade-off

Adjusting the classification threshold changes precision and recall in opposite directions:

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Get probability scores
y_prob = model.predict_proba(X_test)[:, 1]

precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

plt.plot(recalls, precisions)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.fill_between(recalls, precisions, alpha=0.2)

# Area Under PR Curve (AUPRC) — higher = better
from sklearn.metrics import average_precision_score
auprc = average_precision_score(y_test, y_prob)
print(f"AUPRC: {auprc:.4f}")

Choosing a Threshold

import numpy as np

# Find threshold that maximizes F1
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-8)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
print(f"Best threshold: {best_threshold:.3f}, F1: {f1_scores[best_idx]:.4f}")

# Apply custom threshold
y_pred_custom = (y_prob >= best_threshold).astype(int)

# Or optimize for recall (catch as many positives as possible)
# Find threshold where recall >= 0.95
recall_target_idx = np.where(recalls >= 0.95)[0][-1]
high_recall_threshold = thresholds[recall_target_idx]

sklearn Classification Report

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))

# Output:
#               precision    recall  f1-score   support
#
#       Normal       0.99      0.99      0.99      9800
#        Fraud       0.72      0.68      0.70       200
#
#     accuracy                           0.98     10000
#    macro avg       0.86      0.84      0.85     10000
# weighted avg       0.98      0.98      0.98     10000

Macro vs. Micro vs. Weighted Averaging

For multiclass problems, you need to combine per-class metrics:

from sklearn.metrics import f1_score

# Macro: simple average of per-class scores
# Use when all classes matter equally, regardless of size
macro_f1 = f1_score(y_test, y_pred, average='macro')

# Micro: aggregate TP/FP/FN across all classes, then compute
# Use when each instance matters equally
micro_f1 = f1_score(y_test, y_pred, average='micro')

# Weighted: weighted average by class support (number of samples)
# Use when class imbalance exists and larger classes matter more
weighted_f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Macro F1:    {macro_f1:.4f}")
print(f"Micro F1:    {micro_f1:.4f}")
print(f"Weighted F1: {weighted_f1:.4f}")

F-Beta Score: Weighting Recall vs. Precision

F_β = (1 + β²) × Precision × Recall / (β² × Precision + Recall)

β < 1: Precision matters more (penalize false positives)
β = 1: Equal weight (standard F1)
β > 1: Recall matters more (penalize false negatives)

from sklearn.metrics import fbeta_score

# Medical screening: missing a disease (FN) is much worse than a false alarm (FP)
# β=2 means recall is twice as important as precision
f2 = fbeta_score(y_test, y_pred, beta=2)
print(f"F2 Score: {f2:.4f}")

# Spam filtering: user tolerates missed spam, hates losing legitimate mail
# β=0.5 means precision is twice as important as recall
f_half = fbeta_score(y_test, y_pred, beta=0.5)

Which Metric to Optimize?

Use Case	Primary Metric	Why
Medical diagnosis	Recall (sensitivity)	Missing a disease is costly
Spam detection	Precision	Blocking real emails is worse than missing spam
Fraud detection	F1 or AUPRC	Balance both, use PR curve to tune threshold
Information retrieval	Precision@K	User cares about top K results
Balanced classes	Accuracy	Simple, meaningful when balanced

When classes are imbalanced, prefer F1 or AUPRC over accuracy. When the cost of FP ≠ cost of FN, tune the threshold explicitly using the precision-recall curve.