Train, Validation, and Test Split: Honest Model Evaluation

Understand train-validation-test splits — why you need three sets, typical split ratios, stratified splits, holdout sets, and avoiding data leakage in ML evaluation.

Train, Validation, and Test Split

The most common mistake in machine learning is evaluating a model on data it was trained on. Splitting data properly — into train, validation, and test sets — is the foundation of honest model evaluation. Get this wrong and your performance numbers mean nothing.


The Three Sets

All labeled data:
├── Training set (60–70%): Fit the model
├── Validation set (15–20%): Tune hyperparameters, select models
└── Test set (15–20%): Final evaluation — touch once, report once

Why three sets? Every time you check validation performance and make a decision based on it, you’re indirectly using that data to train. After dozens of experiments, your model is implicitly tuned to the validation set. The test set is the only genuinely unseen evaluation.


Basic Split with sklearn

from sklearn.model_selection import train_test_split
# Two-step split: first separate test, then split remainder into train/val
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y # stratify preserves class ratios
)
X_train, X_val, y_train, y_val = train_test_split(
X_trainval, y_trainval, test_size=0.18, # 0.18 × 0.85 ≈ 0.15 of original
random_state=42, stratify=y_trainval
)
print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")

Stratified Splits

For classification, each split should preserve the class distribution:

import pandas as pd
from sklearn.model_selection import train_test_split
# Without stratify:
# Train might have 80% class A, 20% class B
# Test might have 70% class A, 30% class B — biased evaluation
# With stratify=y:
# Both train and test maintain the original 75%/25% ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Verify
pd.Series(y_train).value_counts(normalize=True)
pd.Series(y_test).value_counts(normalize=True)

Time Series Splits

For temporal data, random splits leak future information into the past:

# Correct: maintain temporal ordering
n = len(data)
train_end = int(0.7 * n)
val_end = int(0.85 * n)
X_train = X[:train_end]
X_val = X[train_end:val_end]
X_test = X[val_end:]
# sklearn's TimeSeriesSplit for cross-validation
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_fold_train, X_fold_val = X[train_idx], X[val_idx]

Preventing Data Leakage

Data leakage happens when information from the test set contaminates training — producing optimistic performance estimates that fail in production.

Common sources of leakage:

# BAD: Fit scaler on all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Test stats leak into training
X_train, X_test = train_test_split(X_scaled, ...)
# GOOD: Fit only on training data
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit on train
X_test = scaler.transform(X_test) # Apply to test (no fitting)
# BEST: Use a Pipeline to automate this
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# Pipeline correctly fits scaler only on training fold in cross-validation

How Much Data for Each Set

Dataset sizeTrainValidationTest
Small (<1k)60%20%20%
Medium (1k–100k)70%15%15%
Large (>100k)80–90%5–10%5–10%
Very large (>1M)95–98%1–2%1–2%

For very large datasets, even 1% test set gives tens of thousands of examples — plenty for reliable evaluation.


The Test Set Rule

Never make decisions based on test set performance until final reporting.

If you run any experiment that uses test set feedback — even a single early look — your test set has become a validation set. Reserve a fresh holdout that no one sees until the project is complete. This discipline is what separates internally valid experiments from published results that hold up in production.