Train, Validation, and Test Split
The most common mistake in machine learning is evaluating a model on data it was trained on. Splitting data properly — into train, validation, and test sets — is the foundation of honest model evaluation. Get this wrong and your performance numbers mean nothing.
The Three Sets
All labeled data: ├── Training set (60–70%): Fit the model ├── Validation set (15–20%): Tune hyperparameters, select models └── Test set (15–20%): Final evaluation — touch once, report onceWhy three sets? Every time you check validation performance and make a decision based on it, you’re indirectly using that data to train. After dozens of experiments, your model is implicitly tuned to the validation set. The test set is the only genuinely unseen evaluation.
Basic Split with sklearn
from sklearn.model_selection import train_test_split
# Two-step split: first separate test, then split remainder into train/valX_trainval, X_test, y_trainval, y_test = train_test_split( X, y, test_size=0.15, random_state=42, stratify=y # stratify preserves class ratios)
X_train, X_val, y_train, y_val = train_test_split( X_trainval, y_trainval, test_size=0.18, # 0.18 × 0.85 ≈ 0.15 of original random_state=42, stratify=y_trainval)
print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")Stratified Splits
For classification, each split should preserve the class distribution:
import pandas as pdfrom sklearn.model_selection import train_test_split
# Without stratify:# Train might have 80% class A, 20% class B# Test might have 70% class A, 30% class B — biased evaluation
# With stratify=y:# Both train and test maintain the original 75%/25% ratioX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42)
# Verifypd.Series(y_train).value_counts(normalize=True)pd.Series(y_test).value_counts(normalize=True)Time Series Splits
For temporal data, random splits leak future information into the past:
# Correct: maintain temporal orderingn = len(data)train_end = int(0.7 * n)val_end = int(0.85 * n)
X_train = X[:train_end]X_val = X[train_end:val_end]X_test = X[val_end:]
# sklearn's TimeSeriesSplit for cross-validationfrom sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)for fold, (train_idx, val_idx) in enumerate(tscv.split(X)): X_fold_train, X_fold_val = X[train_idx], X[val_idx]Preventing Data Leakage
Data leakage happens when information from the test set contaminates training — producing optimistic performance estimates that fail in production.
Common sources of leakage:
# BAD: Fit scaler on all datascaler = StandardScaler()X_scaled = scaler.fit_transform(X) # Test stats leak into trainingX_train, X_test = train_test_split(X_scaled, ...)
# GOOD: Fit only on training dataX_train, X_test = train_test_split(X, ...)scaler = StandardScaler()X_train = scaler.fit_transform(X_train) # Fit on trainX_test = scaler.transform(X_test) # Apply to test (no fitting)
# BEST: Use a Pipeline to automate thisfrom sklearn.pipeline import Pipelinepipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression())])# Pipeline correctly fits scaler only on training fold in cross-validationHow Much Data for Each Set
| Dataset size | Train | Validation | Test |
|---|---|---|---|
| Small (<1k) | 60% | 20% | 20% |
| Medium (1k–100k) | 70% | 15% | 15% |
| Large (>100k) | 80–90% | 5–10% | 5–10% |
| Very large (>1M) | 95–98% | 1–2% | 1–2% |
For very large datasets, even 1% test set gives tens of thousands of examples — plenty for reliable evaluation.
The Test Set Rule
Never make decisions based on test set performance until final reporting.
If you run any experiment that uses test set feedback — even a single early look — your test set has become a validation set. Reserve a fresh holdout that no one sees until the project is complete. This discipline is what separates internally valid experiments from published results that hold up in production.