Confusion Matrix

A confusion matrix shows exactly where a classifier makes mistakes — not just how often, but what kind. A model with 90% accuracy can still be useless if it fails systematically on the minority class. The confusion matrix reveals what accuracy hides.

Structure

For binary classification:

                    Predicted
                  Positive  Negative
Actual  Positive  [  TP  ]  [  FN  ]   ← Actual positives
        Negative  [  FP  ]  [  TN  ]   ← Actual negatives

TP = True Positive:  Predicted Positive, Actually Positive  ✓
TN = True Negative:  Predicted Negative, Actually Negative  ✓
FP = False Positive: Predicted Positive, Actually Negative  ✗ (Type I Error)
FN = False Negative: Predicted Negative, Actually Positive  ✗ (Type II Error)

Computing a Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

y_pred = model.predict(X_test)

# Raw confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN FP]
#  [FN TP]]

# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues', colorbar=False)
plt.title('Confusion Matrix')
plt.show()

Derived Metrics

Accuracy    = (TP + TN) / Total       — Overall correct rate
Precision   = TP / (TP + FP)          — Of predicted positives, how many are actually positive?
Recall      = TP / (TP + FN)          — Of actual positives, how many did we catch?
Specificity = TN / (TN + FP)          — Of actual negatives, how many did we correctly exclude?
F1 Score    = 2 × (Precision × Recall) / (Precision + Recall)

The trade-off: Raising recall often lowers precision (and vice versa). Lowering the classification threshold catches more positives (higher recall) but also more false alarms (lower precision).

Reading a Confusion Matrix: Example

Medical test: Disease detection
              Predicted Disease  Predicted Healthy
Actual Disease      85               15          ← 100 sick patients
Actual Healthy      10               90          ← 100 healthy patients

Precision = 85 / (85+10) = 89.5%  (most positive predictions are correct)
Recall    = 85 / (85+15) = 85.0%  (caught 85% of sick patients)
Accuracy  = (85+90) / 200 = 87.5%

The 15 False Negatives (sick patients diagnosed as healthy) are the critical failures.
The 10 False Positives (healthy patients flagged) cause unnecessary anxiety and follow-ups.

Multiclass Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np

y_pred = model.predict(X_test)
class_names = ['Cat', 'Dog', 'Bird']

cm = confusion_matrix(y_test, y_pred)

# Normalize by true class (row normalization)
cm_normalized = cm.astype(float) / cm.sum(axis=1, keepdims=True)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts
ConfusionMatrixDisplay(cm, display_labels=class_names).plot(ax=axes[0], cmap='Blues')
axes[0].set_title('Raw Counts')

# Normalized
ConfusionMatrixDisplay(cm_normalized, display_labels=class_names).plot(ax=axes[1], cmap='Blues')
axes[1].set_title('Normalized (Per True Class)')
plt.show()

What to Look For

Patterns in the confusion matrix:

Diagonal dominance: All correct predictions — a well-performing model
Off-diagonal clusters: Systematic confusions (e.g., cats confused with dogs but not birds)
Row with many errors: The model struggles with a specific class
Column with many false positives: Model over-predicts a specific class

# Find the most confused class pairs
import numpy as np

# Zero out diagonal (true positives)
cm_errors = cm.copy()
np.fill_diagonal(cm_errors, 0)

# Find worst confusion pair
row, col = np.unravel_index(cm_errors.argmax(), cm_errors.shape)
print(f"Most confused: {class_names[row]} → {class_names[col]} ({cm_errors[row, col]} errors)")

Adjusting for Business Context

Not all errors are equal. The confusion matrix helps quantify the cost of each error type:

# Cost matrix: rows = actual, cols = predicted
cost_matrix = np.array([
    [0, 500],   # Missing a fraud case costs $500
    [10, 0]     # False alarm costs $10 (investigation time)
])

total_cost = np.sum(cm * cost_matrix)
print(f"Total expected cost: ${total_cost}")

This lets you optimize the decision threshold to minimize business cost, not just maximize accuracy.