Dataset Preparation: Train, Validation, and Test Splits Done Correctly
A model that scores 99% accuracy during development and then performs poorly once deployed is one of the most common and most avoidable failures in machine learning — and it almost always traces back to how the data was split, not the model architecture. Getting train/validation/test splits right is unglamorous, easy to get subtly wrong, and one of the highest-leverage things to get right before any modeling work begins.
Why Three Splits, Not Two
Training set: the data the model actually learns from — its weights are updated based on this data’s loss.
Validation set: data the model never trains on, but that you (the practitioner) use repeatedly to make decisions — which hyperparameters to use, when to stop training, which architecture performs better.
Test set: data used exactly once, at the very end, to report a final, honest estimate of how the model performs on genuinely unseen data.
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)# Result: 70% train, 15% validation, 15% testThe critical distinction between validation and test: the validation set gets “used up” informationally over time, because every decision you make based on it (this learning rate looks better, this architecture scores higher) is implicitly fitting your choices to that specific data. The test set stays untouched specifically so it can give an honest, unbiased final answer.
Data Leakage: The Silent Killer of Valid Evaluation
Data leakage happens when information from outside the training set — often from the validation or test set — influences the model during training, producing inflated, unrealistic performance metrics that don’t hold up once the model sees genuinely new data.
Leakage via preprocessing computed on the full dataset before splitting:
# WRONG: normalization statistics computed using test data toomean = X.mean() # includes test set data in this calculation!std = X.std()X_normalized = (X - mean) / stdX_train, X_test = train_test_split(X_normalized, y)
# RIGHT: compute statistics only from training data, apply to all splitsX_train, X_test = train_test_split(X, y)mean = X_train.mean() # only training data informs thisstd = X_train.std()X_train_normalized = (X_train - mean) / stdX_test_normalized = (X_test - mean) / std # test set normalized using train's statsThe wrong version lets test set values quietly influence the mean and standard deviation used for normalization — a subtle form of the model “seeing” test data before evaluation, inflating reported performance.
Leakage via duplicate or near-duplicate records across splits — the same customer’s data appearing in both train and test, or near-identical images (slightly cropped variants of the same photo) ending up on both sides of the split. A model can effectively “memorize” these near-duplicates during training and then trivially recognize them again at test time, producing scores that say nothing about genuine generalization.
Leakage via time. For time-series data, a random split can leak future information into the training set — training on data from March through December and testing on January and February from the same year lets the model implicitly learn from future patterns it should never have access to at prediction time. Time-series data needs a chronological split, not a random one.
# Correct for time-series: split chronologically, not randomlycutoff_date = "2026-06-01"train_data = df[df["date"] < cutoff_date]test_data = df[df["date"] >= cutoff_date]Stratified Splitting: Preserving Class Balance
For classification tasks with imbalanced classes (95% negative, 5% positive), a naive random split risks producing a test set with very few (or zero) examples of the minority class, making evaluation metrics unreliable. Stratified splitting ensures each split maintains the same class proportions as the original dataset.
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42)Cross-Validation: Getting More Out of Limited Data
When a dataset is small enough that a single train/validation split feels wasteful or unstable, k-fold cross-validation trains and validates the model k separate times, each time holding out a different fold as validation, and averages the resulting performance.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)print(scores.mean(), scores.std()) # average performance across 5 folds, plus varianceThis is used far more commonly with classical ML models than with deep learning, since training a large neural network five separate times is often computationally expensive — but the underlying principle (get a more reliable performance estimate from limited data) remains valuable whenever data is scarce.
How Much Data Should Go in Each Split
There’s no universally correct ratio, but a common practical range: 70-80% training, 10-15% validation, 10-15% test for small-to-medium datasets. For very large datasets (millions of examples, common in deep learning), the validation and test sets can be a much smaller percentage while still being large enough in absolute terms — a 1% validation split on a 10-million-example dataset is still 100,000 examples, plenty for a reliable estimate, and leaves proportionally more data for actual training, which tends to matter more at that scale than preserving a traditionally “generous” validation percentage. The right split size is ultimately about ensuring each split is large enough to give a statistically reliable estimate of performance for its purpose, not about hitting a specific textbook ratio.
Summary
| Split | Purpose | Used for |
|---|---|---|
| Training set | Model learns from this | Weight updates via gradient descent |
| Validation set | Guides your decisions during development | Hyperparameter tuning, early stopping |
| Test set | Final, honest performance estimate | Used exactly once, at the end |
The specific model architecture you choose matters far less than most beginners assume — a well-architected model evaluated on a leaky, poorly-split dataset produces numbers that are actively misleading, while even a simple model evaluated correctly gives you an honest signal to actually build on.