Confusion Matrix: Understanding Classification Model Performance

Master confusion matrices — true/false positives and negatives, reading a confusion matrix, multi-class variants, normalization, and extracting actionable insights.

Confusion Matrix

A confusion matrix shows exactly where a classifier makes mistakes — not just how often, but what kind. A model with 90% accuracy can still be useless if it fails systematically on the minority class. The confusion matrix reveals what accuracy hides.


Structure

For binary classification:

Predicted
Positive Negative
Actual Positive [ TP ] [ FN ] ← Actual positives
Negative [ FP ] [ TN ] ← Actual negatives
TP = True Positive: Predicted Positive, Actually Positive ✓
TN = True Negative: Predicted Negative, Actually Negative ✓
FP = False Positive: Predicted Positive, Actually Negative ✗ (Type I Error)
FN = False Negative: Predicted Negative, Actually Positive ✗ (Type II Error)

Computing a Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
y_pred = model.predict(X_test)
# Raw confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN FP]
# [FN TP]]
# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues', colorbar=False)
plt.title('Confusion Matrix')
plt.show()

Derived Metrics

Accuracy = (TP + TN) / Total — Overall correct rate
Precision = TP / (TP + FP) — Of predicted positives, how many are actually positive?
Recall = TP / (TP + FN) — Of actual positives, how many did we catch?
Specificity = TN / (TN + FP) — Of actual negatives, how many did we correctly exclude?
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

The trade-off: Raising recall often lowers precision (and vice versa). Lowering the classification threshold catches more positives (higher recall) but also more false alarms (lower precision).


Reading a Confusion Matrix: Example

Medical test: Disease detection
Predicted Disease Predicted Healthy
Actual Disease 85 15 ← 100 sick patients
Actual Healthy 10 90 ← 100 healthy patients
Precision = 85 / (85+10) = 89.5% (most positive predictions are correct)
Recall = 85 / (85+15) = 85.0% (caught 85% of sick patients)
Accuracy = (85+90) / 200 = 87.5%
The 15 False Negatives (sick patients diagnosed as healthy) are the critical failures.
The 10 False Positives (healthy patients flagged) cause unnecessary anxiety and follow-ups.

Multiclass Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np
y_pred = model.predict(X_test)
class_names = ['Cat', 'Dog', 'Bird']
cm = confusion_matrix(y_test, y_pred)
# Normalize by true class (row normalization)
cm_normalized = cm.astype(float) / cm.sum(axis=1, keepdims=True)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Raw counts
ConfusionMatrixDisplay(cm, display_labels=class_names).plot(ax=axes[0], cmap='Blues')
axes[0].set_title('Raw Counts')
# Normalized
ConfusionMatrixDisplay(cm_normalized, display_labels=class_names).plot(ax=axes[1], cmap='Blues')
axes[1].set_title('Normalized (Per True Class)')
plt.show()

What to Look For

Patterns in the confusion matrix:

  1. Diagonal dominance: All correct predictions — a well-performing model
  2. Off-diagonal clusters: Systematic confusions (e.g., cats confused with dogs but not birds)
  3. Row with many errors: The model struggles with a specific class
  4. Column with many false positives: Model over-predicts a specific class
# Find the most confused class pairs
import numpy as np
# Zero out diagonal (true positives)
cm_errors = cm.copy()
np.fill_diagonal(cm_errors, 0)
# Find worst confusion pair
row, col = np.unravel_index(cm_errors.argmax(), cm_errors.shape)
print(f"Most confused: {class_names[row]}{class_names[col]} ({cm_errors[row, col]} errors)")

Adjusting for Business Context

Not all errors are equal. The confusion matrix helps quantify the cost of each error type:

# Cost matrix: rows = actual, cols = predicted
cost_matrix = np.array([
[0, 500], # Missing a fraud case costs $500
[10, 0] # False alarm costs $10 (investigation time)
])
total_cost = np.sum(cm * cost_matrix)
print(f"Total expected cost: ${total_cost}")

This lets you optimize the decision threshold to minimize business cost, not just maximize accuracy.