Overfitting

You train a model that achieves 98% accuracy on your training data. Then you deploy it — and it performs at 67% on real inputs. What happened?

Overfitting. The model memorized the training data, including its noise and idiosyncrasies, instead of learning general patterns. It’s the most common reason ML models fail in production, and every practitioner eventually deals with it.

What Overfitting Looks Like

Training loss:   ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼  (keeps going down)
Validation loss: ▼▼▼▼▼▼▼ then ▲▲▲▲▲    (starts rising)
                                ↑
                        Overfitting begins here

The model is learning things specific to the training set — noise, outliers, coincidental patterns — that don’t generalize to new data.

Why It Happens

Too much model capacity: A 1000-node neural network fitting 100 training examples has so many parameters it can memorize all 100 examples exactly, with zero generalization.

Too little data: Small datasets increase the chance the model picks up statistical flukes.

Too many training epochs: Training too long allows the model to squeeze out every bit of training-set information, including noise.

No regularization: Nothing penalizing the model for becoming too complex.

Diagnosing Overfitting

Learning Curves

Plot training and validation loss (or accuracy) as a function of training size or epochs:

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, cv=5, scoring='accuracy',
    train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='Train')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training set size')
plt.ylabel('Accuracy')
plt.legend()

Overfitting signature: Large gap between training and validation performance.
Underfitting signature: Both curves plateau at low performance.

Techniques to Prevent Overfitting

1. Get More Data

The most reliable fix. More data means the model can’t memorize individual examples.

When you can’t collect more data:

Data augmentation (images): random crops, flips, rotations, color jitter
Synthetic data: SMOTE for tabular imbalanced datasets

2. Regularization

Add a penalty to the loss function that discourages large weights.

from sklearn.linear_model import Ridge, Lasso

# L2 regularization (Ridge) — penalizes large weights
ridge = Ridge(alpha=1.0)

# L1 regularization (Lasso) — drives some weights to zero (sparse)
lasso = Lasso(alpha=0.1)

# For neural networks, via weight decay in optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

3. Dropout (Neural Networks)

Randomly zero out neurons during training, forcing the network to learn redundant representations.

model = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # 50% of neurons zeroed each forward pass
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(128, n_classes)
)

4. Early Stopping

Stop training when validation loss stops improving.

from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,        # Stop after 10 epochs of no improvement
    restore_best_weights=True  # Keep the best model
)

model.fit(X_train, y_train, validation_split=0.2,
          epochs=1000, callbacks=[early_stop])

5. Reduce Model Complexity

Use a simpler model with fewer parameters.

Fewer layers / neurons in neural networks
Shallower trees with max_depth limits
Fewer polynomial features

6. Cross-Validation

Evaluates generalization across multiple train/validation splits — gives a more honest picture than a single split.

How Much Gap Is Acceptable?

A small train/validation gap is fine and expected — training data is easier than new data by definition. The question is whether the gap is disproportionately large.

Rules of thumb:

For classification: a gap > 5–10 percentage points usually warrants investigation
For regression: validation RMSE > 150% of training RMSE suggests overfitting
Always: if validation performance doesn’t justify deployment, the model isn’t ready

Overfitting vs. Overfitting-on-Evaluation-Data

A subtle but important point: even your validation set can become contaminated if you repeatedly evaluate on it and select the best model. This leads to “overfitting to the validation set.”

The test set should be evaluated once, only when you’re ready to report final numbers. If you’ve done 100 experiments tweaking your model based on validation results, your test set is your real held-out evaluation — never touch it until the end.