🌟 Overfitting, Underfitting & the Bias–Variance Trade-Off


In machine learning (ML), most projects succeed or fail based on how well the model generalises — that is, how well it performs on new, unseen data after being trained on known data. Two fundamental issues threaten generalisation:

  • Underfitting: when the model is too simple to capture the underlying pattern.
  • Overfitting: when the model is too complex and fits the training data (including noise) too closely.

Closely tied to these is the bias–variance trade-off, a core theory that explains why you can’t simply minimise training error and expect perfect generalisation. The goal is to find the “sweet‐spot” model complexity: complex enough to capture patterns, but simple enough to generalise.


1. Underfitting

What is underfitting?

A model underfits when it is too simple to learn the underlying structure of the data. It may yield high error both on training and on unseen test data. Key features: high bias, low variance. Because it cannot capture the complexity of the target function, it performs poorly even on the data it was trained on. ([IBM][1])

Why it happens

  • Too few features or weak feature extraction
  • An overly simple model (e.g., linear regression to model non-linear data)
  • Insufficient training time or data
  • Excessive regularisation (which forces the model to be simple)

Example Programs

Here are three unique code examples to illustrate underfitting.

Example 1A: Linear Regression on Non-linear Data (Python/sklearn)

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X = np.linspace(0,10,100).reshape(-1,1)
y = (np.sin(X).ravel() + np.random.normal(0,0.1,100))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print("Train R^2:", model.score(X_train, y_train))
print("Test R^2:", model.score(X_test, y_test))

Here a linear model tries to fit a sine wave. It fails because the data is non-linear → underfitting.

Example 1B: Decision Tree with Shallow Depth (classification)

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
test_size=0.3, random_state=0)
model = DecisionTreeClassifier(max_depth=1) # extremely shallow tree
model.fit(X_train, y_train)
print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy :", model.score(X_test, y_test))

With max_depth=1, the tree is too simple and will underfit the complex iris dataset.

Example 1C: High regularisation in Logistic Regression

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2,
n_redundant=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LogisticRegression(C=1e-6) # very strong regularization
model.fit(X_train, y_train)
print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy :", model.score(X_test, y_test))

Here we restrict the model so much (through very small C) that it cannot learn meaningful patterns → underfitting.

How to detect underfitting

  • Training error is high (model not learning well)
  • Test error is similar to training error (both poor)
  • Learning curve: both training and validation error remain high across model complexity

How to fix underfitting

  • Increase model complexity (e.g., deeper tree, more layers)
  • Add more features or transform features (polynomial, interaction terms)
  • Reduce regularisation
  • Ensure sufficient data and training time

2. Overfitting

What is overfitting?

Overfitting occurs when the model is too complex and fits the training data (including noise and outliers) very well, but generalises poorly to new data. Training error is low, test error is high. This corresponds to low bias but high variance. ([IBM][1])

Why it happens

  • Model is too flexible (too many parameters)
  • Insufficient or noisy data
  • Too many features relative to number of observations (curse of dimensionality)
  • Training for too long without validation, or no regularisation

Example Programs

Here are three unique code examples of overfitting.

Example 2A: Polynomial Regression with High Degree

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X = np.linspace(0,10,30).reshape(-1,1)
y = np.sin(X).ravel() + np.random.normal(0,0.1,30)
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print("Train R^2:", model.score(X_train, y_train))
print("Test R^2:", model.score(X_test, y_test))

With degree=15 on just 30 samples, the model will very likely overfit.

Example 2B: Decision Tree Without Depth Limit (classification)

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier
X, y = make_moons(n_samples=200, noise=0.25, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = DecisionTreeClassifier() # no limit depth
model.fit(X_train, y_train)
print("Train acc:", model.score(X_train, y_train))
print("Test acc :", model.score(X_test, y_test))

Tree becomes extremely complex, fitting noise, leading to overfitting.

Example 2C: Neural Network Too Many Epochs Without Regularisation

import tensorflow as tf
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.reshape(-1,28*28)/255.0
X_test = X_test.reshape(-1,28*28)/255.0
model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test))

Without regularisation or early stopping, the network will overfit training digits and falter on unseen test set.

How to detect overfitting

  • Training error is low (model fits training data well)
  • Validation/test error is much worse than training error
  • Learning curve: training error drops, validation error rises or plateaus

How to fix overfitting

  • Reduce model complexity (prune tree, reduce layers/parameters)
  • Regularisation (L1, L2, dropout)
  • Use more data
  • Use cross-validation, early stopping
  • Use simpler feature set or dimensionality reduction

3. The Bias–Variance Trade-Off

What is the bias–variance trade-off?

Bias refers to errors introduced by approximating a real‐world problem (which may be complex) by a too‐simple model. Variance refers to errors introduced by the model’s sensitivity to small fluctuations in the training set. The total expected error can be decomposed roughly into:

Total Error ≈ Bias² + Variance + Irreducible Error ([Wikipedia][2])

Put simply:

  • High bias → underfitting (too simple)
  • High variance → overfitting (too complex)
  • The trade-off: as you reduce bias by adding complexity, you increase variance; as you reduce variance (simpler model), you increase bias. The ideal model minimises both.

Why it matters

Finding that balance means better generalisation — the performance on unseen data. A model might show excellent training performance, but if variance is high, it will fail on real‐world data. Conversely, a model might perform poorly across the board because it is too biased (underfitting). ([GeeksforGeeks][3])

Example Programs

Here are three examples that directly illustrate the bias–variance trade-off.

Example 3A: Varying Tree Depth to Visualise Trade-Off

import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
train_errors = []
test_errors = []
depths = range(1, 21)
for d in depths:
model = DecisionTreeRegressor(max_depth=d)
model.fit(X_train, y_train)
train_errors.append(((y_train - model.predict(X_train))**2).mean())
test_errors.append(((y_test - model.predict(X_test))**2).mean())
plt.plot(depths, train_errors, label='Train MSE')
plt.plot(depths, test_errors, label='Test MSE')
plt.xlabel('Max Depth')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.show()

The plot typically shows a U-shape for test error: first decreasing (reducing bias), then increasing (increasing variance).

Example 3B: Regularisation Strength Sweep in Ridge Regression

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=200, n_features=50, noise=20, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
alphas = np.logspace(-4, 4, 50)
train_scores = []
test_scores = []
for a in alphas:
model = Ridge(alpha=a)
model.fit(X_train, y_train)
train_scores.append(model.score(X_train, y_train))
test_scores.append(model.score(X_test, y_test))
plt.semilogx(alphas, train_scores, label='Train score')
plt.semilogx(alphas, test_scores, label='Test score')
plt.xlabel('Alpha')
plt.ylabel('R^2 Score')
plt.legend()
plt.show()

Here varying regularisation alpha moves between high variance (small alpha) and high bias (large alpha), visualising the trade-off.

Example 3C: K‐Nearest Neighbours with Varying K

from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=300, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
train_acc = []
test_acc = []
ks = range(1, 31)
for k in ks:
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
train_acc.append(model.score(X_train, y_train))
test_acc.append(model.score(X_test, y_test))
plt.plot(ks, train_acc, label='Train acc')
plt.plot(ks, test_acc, label='Test acc')
plt.xlabel('Number of neighbours K')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

At very low k you get overfitting (low bias, high variance); at very high k you get underfitting (high bias, low variance); the optimal k is the trade-off point.

Visualising Bias-Variance Trade-Off

Here’s a diagram to illustrate how model complexity influences bias, variance, and error:

Complexity → low high
┌────────────────────────────────────────────
Error │
│ ↑
│ U-shape for test error
│ /
│ /
│ /
│ /
└────────────────────────────────────────────

More intuitively:

Low complexity model → High Bias, Low Variance → Underfitting
Medium complexity → Balanced Bias & Variance → Good Generalisation
High complexity → Low Bias, High Variance → Overfitting

4. How to Remember These Concepts (for Interview/Exam)

Mnemonics

  • “BAV”: Bias, Accuracy, Variance — The interplay between them
  • “Too Simple → Under-fit; Too Complex → Over-fit; Just Right → Balance”
  • “Bias High = Under-fit; Variance High = Over-fit”

Flashcard Style Q&A

  • Q: What indicates a model is underfitting? A: Poor performance on both training and test data (high bias).
  • Q: What indicates overfitting? A: Great training performance but poor test performance (high variance).
  • Q: What is the bias–variance trade-off? A: The tension between bias and variance; reducing one often increases the other.
  • Q: How do you reduce overfitting? A: Use more data, regularisation, simpler model, cross-validation.
  • Q: How do you reduce underfitting? A: Add complexity, more features, reduce regularisation.

Interview/Exam Talking Points

  • Explain how learning curves (plots of error vs epoch or complexity) help diagnose under/over-fitting.
  • Mention specific techniques: cross‐validation, regularisation (L1, L2), pruning, drop‐out, early stopping.
  • Be ready to discuss how the bias–variance trade‐off explains why you can’t simply minimise training error.
  • Use analogies: Overfitting = memorising answers for one test; Underfitting = not studying enough.
  • Show awareness of modern nuances (e.g., double descent) though not required for all interviews.

5. Why It Is Important to Learn This Concept

Real-World Impact

  • A model that overfits can appear perfect in training but fail catastrophically in production — expensive and risky. ([Cross Validated][4])
  • A model that underfits discards potential insights, resulting in poor performance and misguided decisions.
  • Understanding the bias–variance trade‐off helps you design models that perform well on unseen data, not just the training set.

Project Efficiency

  • Helps in choosing the right model complexity, hyperparameters, and feature engineering strategy.
  • Aids in diagnosing issues when a model behaves unexpectedly (e.g., training accuracy high but test accuracy low).
  • Supports effective use of cross‐validation, regularisation and better ML workflow.

Professional & Career Growth

  • Employers value ML practitioners who know how to generalise models rather than simply fit the training set.
  • Demonstrating ability to trade off bias vs variance shows maturity in ML thinking.
  • Fundamental for roles in data science, machine learning engineering, model governance and reliability.

Theoretical Foundations

  • Bias–variance decomposition is a core theoretical framework in supervised learning. ([Wikipedia][2])
  • Helps understand why some algorithms generalise better than others, and why bigger is not always better in ML.

6. Summary & Key Takeaways

  • Underfitting: too simple a model → high bias → poor training & test performance.
  • Overfitting: too complex a model (or too little data) → high variance → good training, poor test performance.
  • Bias–Variance Trade-Off: managing model complexity to balance bias and variance → best generalisation.
  • Use diagnostics: learning curves, train/test error gap.
  • Techniques: regularisation, more data, simpler model, feature engineering, cross‐validation.
  • For interviews: talk about error vs complexity curves, examples, and how you would fix issues.
  • For exams: know definitions, diagrams, signs, causes & remedies.

7. Final Thoughts

The journey from underfitting to overfitting is one of finding “just enough” complexity. Think of a recipe: too bland (underfit) or too spicy (overfit) – the goal is the perfect balance. In machine learning, that balance is your model’s ability to generalise well.

Always ask:

“Is my model learning the pattern or just memorising noise?”

By mastering these concepts — underfitting, overfitting and the bias–variance trade-off — you’ll be equipped to build robust, real‐world machine learning models that perform well beyond the training data.