Confusion Matrix
A confusion matrix shows exactly where a classifier makes mistakes — not just how often, but what kind. A model with 90% accuracy can still be useless if it fails systematically on the minority class. The confusion matrix reveals what accuracy hides.
Structure
For binary classification:
Predicted Positive NegativeActual Positive [ TP ] [ FN ] ← Actual positives Negative [ FP ] [ TN ] ← Actual negatives
TP = True Positive: Predicted Positive, Actually Positive ✓TN = True Negative: Predicted Negative, Actually Negative ✓FP = False Positive: Predicted Positive, Actually Negative ✗ (Type I Error)FN = False Negative: Predicted Negative, Actually Positive ✗ (Type II Error)Computing a Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplayimport matplotlib.pyplot as plt
y_pred = model.predict(X_test)
# Raw confusion matrixcm = confusion_matrix(y_test, y_pred)print(cm)# [[TN FP]# [FN TP]]
# Visualizedisp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])disp.plot(cmap='Blues', colorbar=False)plt.title('Confusion Matrix')plt.show()Derived Metrics
Accuracy = (TP + TN) / Total — Overall correct ratePrecision = TP / (TP + FP) — Of predicted positives, how many are actually positive?Recall = TP / (TP + FN) — Of actual positives, how many did we catch?Specificity = TN / (TN + FP) — Of actual negatives, how many did we correctly exclude?F1 Score = 2 × (Precision × Recall) / (Precision + Recall)The trade-off: Raising recall often lowers precision (and vice versa). Lowering the classification threshold catches more positives (higher recall) but also more false alarms (lower precision).
Reading a Confusion Matrix: Example
Medical test: Disease detection Predicted Disease Predicted HealthyActual Disease 85 15 ← 100 sick patientsActual Healthy 10 90 ← 100 healthy patients
Precision = 85 / (85+10) = 89.5% (most positive predictions are correct)Recall = 85 / (85+15) = 85.0% (caught 85% of sick patients)Accuracy = (85+90) / 200 = 87.5%
The 15 False Negatives (sick patients diagnosed as healthy) are the critical failures.The 10 False Positives (healthy patients flagged) cause unnecessary anxiety and follow-ups.Multiclass Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplayimport numpy as np
y_pred = model.predict(X_test)class_names = ['Cat', 'Dog', 'Bird']
cm = confusion_matrix(y_test, y_pred)
# Normalize by true class (row normalization)cm_normalized = cm.astype(float) / cm.sum(axis=1, keepdims=True)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Raw countsConfusionMatrixDisplay(cm, display_labels=class_names).plot(ax=axes[0], cmap='Blues')axes[0].set_title('Raw Counts')
# NormalizedConfusionMatrixDisplay(cm_normalized, display_labels=class_names).plot(ax=axes[1], cmap='Blues')axes[1].set_title('Normalized (Per True Class)')plt.show()What to Look For
Patterns in the confusion matrix:
- Diagonal dominance: All correct predictions — a well-performing model
- Off-diagonal clusters: Systematic confusions (e.g., cats confused with dogs but not birds)
- Row with many errors: The model struggles with a specific class
- Column with many false positives: Model over-predicts a specific class
# Find the most confused class pairsimport numpy as np
# Zero out diagonal (true positives)cm_errors = cm.copy()np.fill_diagonal(cm_errors, 0)
# Find worst confusion pairrow, col = np.unravel_index(cm_errors.argmax(), cm_errors.shape)print(f"Most confused: {class_names[row]} → {class_names[col]} ({cm_errors[row, col]} errors)")Adjusting for Business Context
Not all errors are equal. The confusion matrix helps quantify the cost of each error type:
# Cost matrix: rows = actual, cols = predictedcost_matrix = np.array([ [0, 500], # Missing a fraud case costs $500 [10, 0] # False alarm costs $10 (investigation time)])
total_cost = np.sum(cm * cost_matrix)print(f"Total expected cost: ${total_cost}")This lets you optimize the decision threshold to minimize business cost, not just maximize accuracy.