Cross-Validation
A single train-validation split produces a noisy performance estimate — the number changes depending on which 20% of data ended up in the validation set. Cross-validation addresses this by training and evaluating the model multiple times on different partitions, then averaging the results.
K-Fold Cross-Validation
The most common approach: split data into K equal folds, use K-1 folds for training and 1 for validation, rotate K times:
K=5 folds:Fold 1: [VAL][TR ][TR ][TR ][TR ] → score₁Fold 2: [TR ][VAL][TR ][TR ][TR ] → score₂Fold 3: [TR ][TR ][VAL][TR ][TR ] → score₃Fold 4: [TR ][TR ][TR ][VAL][TR ] → score₄Fold 5: [TR ][TR ][TR ][TR ][VAL] → score₅
Final estimate: mean(score₁...score₅) ± std(score₁...score₅)from sklearn.model_selection import cross_val_score, KFoldfrom sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')print(f"CV Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")Stratified K-Fold
Preserves class proportions in each fold — critical for imbalanced datasets:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")Always use StratifiedKFold for classification unless you have a specific reason not to.
Cross-Validate Multiple Metrics
from sklearn.model_selection import cross_validate
results = cross_validate( model, X, y, cv=StratifiedKFold(5, shuffle=True, random_state=42), scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro', 'roc_auc_ovr'], return_train_score=True # Also report train score to detect overfitting)
for metric in ['test_accuracy', 'test_f1_macro', 'test_roc_auc_ovr']: m = results[metric] print(f"{metric}: {m.mean():.4f} ± {m.std():.4f}")Nested Cross-Validation
When you want to both tune hyperparameters and get an unbiased generalization estimate, you need nested CV. Without it, the hyperparameter search leaks information:
from sklearn.model_selection import GridSearchCV, cross_val_score
# Inner CV: hyperparameter tuninginner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)# Outer CV: unbiased performance estimationouter_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, None]}inner_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=inner_cv, scoring='roc_auc')
# Outer loop gives unbiased estimate of the full pipeline (search + fit)outer_scores = cross_val_score(inner_search, X, y, cv=outer_cv, scoring='roc_auc')print(f"Unbiased AUC: {outer_scores.mean():.4f} ± {outer_scores.std():.4f}")Leave-One-Out Cross-Validation (LOOCV)
K = N (each sample is its own validation fold):
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')print(f"LOOCV Accuracy: {scores.mean():.4f}")LOOCV gives the lowest-bias estimate but is computationally expensive (N model fits). Use when dataset is small (<200 samples) and K-fold variance would be too high.
Time Series Cross-Validation
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5, gap=0)
# Always expanding window: earlier data in train, later data in validationfor fold, (train_idx, val_idx) in enumerate(tscv.split(X)): X_train_fold = X[train_idx] X_val_fold = X[val_idx] print(f"Fold {fold}: Train {train_idx[0]}–{train_idx[-1]}, Val {val_idx[0]}–{val_idx[-1]}")
scores = cross_val_score(model, X, y, cv=tscv, scoring='neg_mean_squared_error')K-Fold with Pipeline (Leak-Safe)
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler
pipeline = Pipeline([ ('scaler', StandardScaler()), # Fitted only on train fold in each iteration ('model', RandomForestClassifier(n_estimators=100))])
scores = cross_val_score(pipeline, X, y, cv=StratifiedKFold(5), scoring='accuracy')Wrapping preprocessing in a Pipeline ensures that feature scaling (and any other preprocessing) never leaks test fold statistics into training.
Choosing K
| K value | Bias | Variance | Compute |
|---|---|---|---|
| K=3 | High | Low | Cheap |
| K=5 | Medium | Medium | Moderate |
| K=10 | Low | Medium | 2× K=5 |
| LOOCV | Very low | High | N× K=5 |
K=5 is the default for most problems. K=10 when dataset is small enough and compute is available. K=3 when training a single model is expensive (deep learning).