Bias-Variance Tradeoff

Every prediction error can be decomposed into three parts: bias, variance, and irreducible noise. Understanding this decomposition tells you not just that a model is performing poorly, but why — and what to do about it.

The Decomposition

Expected Test Error = Bias² + Variance + Irreducible Noise

Bias²:             Error from wrong assumptions about the data
                   (underfitting — model too simple)

Variance:          Error from sensitivity to training data fluctuations
                   (overfitting — model too complex)

Irreducible Noise: Error from noise in the data itself — can't be reduced

Intuition with a Dartboard

Imagine throwing darts, where bull’s-eye = true function:

High Bias, Low Variance:    Low Bias, High Variance:    Low Bias, Low Variance:
(Consistent but wrong)      (Accurate but inconsistent)  (Ideal)

    ○ ○                          × ×                          ●
    ○●○                        ×●    ×                         ●●
    ○ ○                          ×   ×                          ●

Darts cluster far from        Darts spread widely            Darts cluster near
center — underfitting          around center — overfitting    center — just right

High Bias vs. High Variance Indicators

High Bias (Underfitting):

Training error is high
Validation error is close to training error (both bad)
Learning curve: both train/val curves plateau at high error
Fix: more complex model, better features, less regularization

High Variance (Overfitting):

Training error is low
Validation error is much higher than training error
Learning curve: large gap between train and val curves
Fix: more data, regularization, dropout, simpler model

Diagnosing with Learning Curves

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

def plot_learning_curves(estimator, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y,
        cv=5, scoring='accuracy',
        train_sizes=np.linspace(0.1, 1.0, 10),
        n_jobs=-1
    )

    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    plt.figure(figsize=(8, 5))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    plt.plot(train_sizes, val_mean, 'o-', color='green', label='Validation Score')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='green')

    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)

plot_learning_curves(SVC(kernel='rbf', C=100), X, y, 'High Variance (C=100)')
plot_learning_curves(SVC(kernel='linear', C=0.01), X, y, 'High Bias (C=0.01)')

How Model Complexity Affects Bias and Variance

Model Complexity →  Simple ─────────────────────── Complex

Bias:              High ───────────────────────── Low
Variance:          Low ──────────────────────────  High
Total Error:       High ──── optimal ───────────── High
                            ↑
                    Sweet spot: lowest total error

The “sweet spot” is the model complexity that minimizes validation error — neither too simple nor too complex.

The Double Descent Phenomenon

In very large overparameterized models (neural networks with more parameters than training samples), the classic bias-variance tradeoff breaks down:

Error
  |    \   classic U-shape
  |      ───
  |            \   second descent
  |             ─────────────
  |
  └─────────────────────────── Model Complexity
                ↑
        interpolation threshold

Modern neural networks are in this “beyond interpolation” regime where the classical tradeoff doesn’t directly apply — they can interpolate training data and still generalize well when sufficiently overparameterized.

Strategies by Diagnosis

Problem	Root Cause	Solutions
High training + high val error	High Bias	More features, more complex model, less regularization, more epochs
Low training + high val error	High Variance	More training data, regularization, dropout, simpler model, early stopping
Both errors unstable	High Variance	Fix randomness, more data, gradient clipping

The diagnosis comes first — every fix is specific to whether you have a bias problem or a variance problem. Misdiagnosis leads to applying the wrong fix and making things worse.