Cross-Validation: Reliable Generalization Estimation in Machine Learning

Master cross-validation — k-fold, stratified k-fold, leave-one-out, nested CV, time series CV, and when cross-validation gives more reliable estimates than a single split.

Cross-Validation

A single train-validation split produces a noisy performance estimate — the number changes depending on which 20% of data ended up in the validation set. Cross-validation addresses this by training and evaluating the model multiple times on different partitions, then averaging the results.


K-Fold Cross-Validation

The most common approach: split data into K equal folds, use K-1 folds for training and 1 for validation, rotate K times:

K=5 folds:
Fold 1: [VAL][TR ][TR ][TR ][TR ] → score₁
Fold 2: [TR ][VAL][TR ][TR ][TR ] → score₂
Fold 3: [TR ][TR ][VAL][TR ][TR ] → score₃
Fold 4: [TR ][TR ][TR ][VAL][TR ] → score₄
Fold 5: [TR ][TR ][TR ][TR ][VAL] → score₅
Final estimate: mean(score₁...score₅) ± std(score₁...score₅)
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

Stratified K-Fold

Preserves class proportions in each fold — critical for imbalanced datasets:

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

Always use StratifiedKFold for classification unless you have a specific reason not to.


Cross-Validate Multiple Metrics

from sklearn.model_selection import cross_validate
results = cross_validate(
model, X, y,
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro', 'roc_auc_ovr'],
return_train_score=True # Also report train score to detect overfitting
)
for metric in ['test_accuracy', 'test_f1_macro', 'test_roc_auc_ovr']:
m = results[metric]
print(f"{metric}: {m.mean():.4f} ± {m.std():.4f}")

Nested Cross-Validation

When you want to both tune hyperparameters and get an unbiased generalization estimate, you need nested CV. Without it, the hyperparameter search leaks information:

from sklearn.model_selection import GridSearchCV, cross_val_score
# Inner CV: hyperparameter tuning
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
# Outer CV: unbiased performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, None]}
inner_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid, cv=inner_cv, scoring='roc_auc'
)
# Outer loop gives unbiased estimate of the full pipeline (search + fit)
outer_scores = cross_val_score(inner_search, X, y, cv=outer_cv, scoring='roc_auc')
print(f"Unbiased AUC: {outer_scores.mean():.4f} ± {outer_scores.std():.4f}")

Leave-One-Out Cross-Validation (LOOCV)

K = N (each sample is its own validation fold):

from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print(f"LOOCV Accuracy: {scores.mean():.4f}")

LOOCV gives the lowest-bias estimate but is computationally expensive (N model fits). Use when dataset is small (<200 samples) and K-fold variance would be too high.


Time Series Cross-Validation

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5, gap=0)
# Always expanding window: earlier data in train, later data in validation
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_train_fold = X[train_idx]
X_val_fold = X[val_idx]
print(f"Fold {fold}: Train {train_idx[0]}{train_idx[-1]}, Val {val_idx[0]}{val_idx[-1]}")
scores = cross_val_score(model, X, y, cv=tscv, scoring='neg_mean_squared_error')

K-Fold with Pipeline (Leak-Safe)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('scaler', StandardScaler()), # Fitted only on train fold in each iteration
('model', RandomForestClassifier(n_estimators=100))
])
scores = cross_val_score(pipeline, X, y, cv=StratifiedKFold(5), scoring='accuracy')

Wrapping preprocessing in a Pipeline ensures that feature scaling (and any other preprocessing) never leaks test fold statistics into training.


Choosing K

K valueBiasVarianceCompute
K=3HighLowCheap
K=5MediumMediumModerate
K=10LowMedium2× K=5
LOOCVVery lowHighN× K=5

K=5 is the default for most problems. K=10 when dataset is small enough and compute is available. K=3 when training a single model is expensive (deep learning).