Cross-Validation

A single train-validation split produces a noisy performance estimate — the number changes depending on which 20% of data ended up in the validation set. Cross-validation addresses this by training and evaluating the model multiple times on different partitions, then averaging the results.

K-Fold Cross-Validation

The most common approach: split data into K equal folds, use K-1 folds for training and 1 for validation, rotate K times:

K=5 folds:
Fold 1: [VAL][TR ][TR ][TR ][TR ]  → score₁
Fold 2: [TR ][VAL][TR ][TR ][TR ]  → score₂
Fold 3: [TR ][TR ][VAL][TR ][TR ]  → score₃
Fold 4: [TR ][TR ][TR ][VAL][TR ]  → score₄
Fold 5: [TR ][TR ][TR ][TR ][VAL]  → score₅

Final estimate: mean(score₁...score₅) ± std(score₁...score₅)

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

Stratified K-Fold

Preserves class proportions in each fold — critical for imbalanced datasets:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

Always use StratifiedKFold for classification unless you have a specific reason not to.

Cross-Validate Multiple Metrics

from sklearn.model_selection import cross_validate

results = cross_validate(
    model, X, y,
    cv=StratifiedKFold(5, shuffle=True, random_state=42),
    scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro', 'roc_auc_ovr'],
    return_train_score=True  # Also report train score to detect overfitting
)

for metric in ['test_accuracy', 'test_f1_macro', 'test_roc_auc_ovr']:
    m = results[metric]
    print(f"{metric}: {m.mean():.4f} ± {m.std():.4f}")

Nested Cross-Validation

When you want to both tune hyperparameters and get an unbiased generalization estimate, you need nested CV. Without it, the hyperparameter search leaks information:

from sklearn.model_selection import GridSearchCV, cross_val_score

# Inner CV: hyperparameter tuning
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
# Outer CV: unbiased performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, None]}
inner_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=inner_cv, scoring='roc_auc'
)

# Outer loop gives unbiased estimate of the full pipeline (search + fit)
outer_scores = cross_val_score(inner_search, X, y, cv=outer_cv, scoring='roc_auc')
print(f"Unbiased AUC: {outer_scores.mean():.4f} ± {outer_scores.std():.4f}")

Leave-One-Out Cross-Validation (LOOCV)

K = N (each sample is its own validation fold):

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print(f"LOOCV Accuracy: {scores.mean():.4f}")

LOOCV gives the lowest-bias estimate but is computationally expensive (N model fits). Use when dataset is small (<200 samples) and K-fold variance would be too high.

Time Series Cross-Validation

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5, gap=0)

# Always expanding window: earlier data in train, later data in validation
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train_fold = X[train_idx]
    X_val_fold = X[val_idx]
    print(f"Fold {fold}: Train {train_idx[0]}–{train_idx[-1]}, Val {val_idx[0]}–{val_idx[-1]}")

scores = cross_val_score(model, X, y, cv=tscv, scoring='neg_mean_squared_error')

K-Fold with Pipeline (Leak-Safe)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scaler', StandardScaler()),    # Fitted only on train fold in each iteration
    ('model', RandomForestClassifier(n_estimators=100))
])

scores = cross_val_score(pipeline, X, y, cv=StratifiedKFold(5), scoring='accuracy')

Wrapping preprocessing in a Pipeline ensures that feature scaling (and any other preprocessing) never leaks test fold statistics into training.

Choosing K

K value	Bias	Variance	Compute
K=3	High	Low	Cheap
K=5	Medium	Medium	Moderate
K=10	Low	Medium	2× K=5
LOOCV	Very low	High	N× K=5

K=5 is the default for most problems. K=10 when dataset is small enough and compute is available. K=3 when training a single model is expensive (deep learning).