Overfitting
You train a model that achieves 98% accuracy on your training data. Then you deploy it — and it performs at 67% on real inputs. What happened?
Overfitting. The model memorized the training data, including its noise and idiosyncrasies, instead of learning general patterns. It’s the most common reason ML models fail in production, and every practitioner eventually deals with it.
What Overfitting Looks Like
Training loss: ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼ (keeps going down)Validation loss: ▼▼▼▼▼▼▼ then ▲▲▲▲▲ (starts rising) ↑ Overfitting begins hereThe model is learning things specific to the training set — noise, outliers, coincidental patterns — that don’t generalize to new data.
Why It Happens
Too much model capacity: A 1000-node neural network fitting 100 training examples has so many parameters it can memorize all 100 examples exactly, with zero generalization.
Too little data: Small datasets increase the chance the model picks up statistical flukes.
Too many training epochs: Training too long allows the model to squeeze out every bit of training-set information, including noise.
No regularization: Nothing penalizing the model for becoming too complex.
Diagnosing Overfitting
Learning Curves
Plot training and validation loss (or accuracy) as a function of training size or epochs:
import matplotlib.pyplot as pltfrom sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve( model, X, y, cv=5, scoring='accuracy', train_sizes=np.linspace(0.1, 1.0, 10))
plt.plot(train_sizes, train_scores.mean(axis=1), label='Train')plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')plt.xlabel('Training set size')plt.ylabel('Accuracy')plt.legend()Overfitting signature: Large gap between training and validation performance.
Underfitting signature: Both curves plateau at low performance.
Techniques to Prevent Overfitting
1. Get More Data
The most reliable fix. More data means the model can’t memorize individual examples.
When you can’t collect more data:
- Data augmentation (images): random crops, flips, rotations, color jitter
- Synthetic data: SMOTE for tabular imbalanced datasets
2. Regularization
Add a penalty to the loss function that discourages large weights.
from sklearn.linear_model import Ridge, Lasso
# L2 regularization (Ridge) — penalizes large weightsridge = Ridge(alpha=1.0)
# L1 regularization (Lasso) — drives some weights to zero (sparse)lasso = Lasso(alpha=0.1)
# For neural networks, via weight decay in optimizeroptimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)3. Dropout (Neural Networks)
Randomly zero out neurons during training, forcing the network to learn redundant representations.
model = nn.Sequential( nn.Linear(512, 256), nn.ReLU(), nn.Dropout(p=0.5), # 50% of neurons zeroed each forward pass nn.Linear(256, 128), nn.ReLU(), nn.Dropout(p=0.3), nn.Linear(128, n_classes))4. Early Stopping
Stop training when validation loss stops improving.
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping( monitor='val_loss', patience=10, # Stop after 10 epochs of no improvement restore_best_weights=True # Keep the best model)
model.fit(X_train, y_train, validation_split=0.2, epochs=1000, callbacks=[early_stop])5. Reduce Model Complexity
Use a simpler model with fewer parameters.
- Fewer layers / neurons in neural networks
- Shallower trees with
max_depthlimits - Fewer polynomial features
6. Cross-Validation
Evaluates generalization across multiple train/validation splits — gives a more honest picture than a single split.
How Much Gap Is Acceptable?
A small train/validation gap is fine and expected — training data is easier than new data by definition. The question is whether the gap is disproportionately large.
Rules of thumb:
- For classification: a gap > 5–10 percentage points usually warrants investigation
- For regression: validation RMSE > 150% of training RMSE suggests overfitting
- Always: if validation performance doesn’t justify deployment, the model isn’t ready
Overfitting vs. Overfitting-on-Evaluation-Data
A subtle but important point: even your validation set can become contaminated if you repeatedly evaluate on it and select the best model. This leads to “overfitting to the validation set.”
The test set should be evaluated once, only when you’re ready to report final numbers. If you’ve done 100 experiments tweaking your model based on validation results, your test set is your real held-out evaluation — never touch it until the end.