Random Forests

Random forests solve the fundamental problem with decision trees: high variance. A single decision tree is fragile — small changes in training data can produce a completely different tree. Random forests fix this by training hundreds of trees and letting them vote. The result is a robust, accurate model that’s one of the best general-purpose algorithms available.

The Core Idea: Bagging

Bootstrap Aggregation (Bagging): Train many models on different random subsets of the training data, then average their predictions.

Original dataset (N samples):
  Sample 1: Bootstrap → Train Tree 1
  Sample 2: Bootstrap → Train Tree 2
  ...
  Sample k: Bootstrap → Train Tree k

Prediction: Majority vote (classification) or mean (regression)

Each bootstrap sample is drawn with replacement — roughly 63% of original samples appear at least once, and ~37% are “out-of-bag” (OOB).

What Makes Random Forests Different from Plain Bagging

Random forests add a second source of randomness: feature subsampling at each split.

At each node, instead of considering all features, only √(n_features) features are randomly selected for the best split. This:

Decorrelates the trees (if one strong feature dominates, it won’t appear in every tree)
Means each tree is trained on different data AND different features
Produces diverse trees whose errors cancel out when aggregated

Training a Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier(
    n_estimators=100,     # Number of trees (more = better, diminishing returns)
    max_features='sqrt',  # Features to consider at each split
    max_depth=None,       # Trees grow fully (bagging handles overfitting)
    min_samples_leaf=1,
    oob_score=True,       # Use out-of-bag samples for free validation
    n_jobs=-1,            # Parallelize across all CPU cores
    random_state=42
)

model.fit(X_train, y_train)

print(f"OOB Score: {model.oob_score_:.4f}")  # Free accuracy estimate
print(f"Test Score: {model.score(X_test, y_test):.4f}")

Out-of-Bag (OOB) Evaluation

The ~37% of samples not used to train each tree can be used to evaluate it — for free, without a separate validation set. The OOB score is an unbiased estimate of generalization performance.

# OOB score is automatically calculated when oob_score=True
print(f"OOB accuracy: {model.oob_score_:.4f}")

# Per-sample OOB probabilities
oob_probs = model.oob_decision_function_  # shape: (n_samples, n_classes)

Feature Importance

Random forests provide the most reliable feature importances among sklearn models — the mean decrease in impurity (MDI) across all trees:

import pandas as pd
import matplotlib.pyplot as plt

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

importance_df.head(15).plot(x='feature', y='importance', kind='barh', figsize=(10, 8))
plt.title("Random Forest Feature Importances")
plt.tight_layout()

Caveat: MDI can be biased toward high-cardinality features. For more reliable importances, use permutation importance:

from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
perm_importance = pd.Series(result.importances_mean, index=feature_names)

Key Hyperparameters

Parameter	Default	Effect
`n_estimators`	100	More trees = lower variance; diminishing returns past ~200–500
`max_features`	`sqrt`	`sqrt` for classification, `1/3` for regression
`max_depth`	None (full)	Limit to reduce memory; small effect on accuracy
`min_samples_leaf`	1	Increase to smooth predictions and reduce overfitting
`class_weight`	None	Set `balanced` for imbalanced datasets

Random Forest vs. Gradient Boosting

	Random Forest	Gradient Boosting (XGBoost)
Training	Parallel (fast)	Sequential (slower)
Tuning effort	Low (robust defaults)	High (many hyperparameters)
Overfitting risk	Low	Higher
Tabular data performance	Very good	Slightly better on average
Interpretability	Feature importances	Feature importances + SHAP

Random forests are typically the better starting point. Switch to gradient boosting when you need to squeeze out extra performance and are willing to tune.

Random Forests for Regression

from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(
    n_estimators=200,
    max_features=1/3,       # Recommended for regression
    oob_score=True,
    n_jobs=-1,
    random_state=42
)
reg.fit(X_train, y_train)

Predictions are the average of all tree outputs. Unlike individual trees, the output is smooth over the input space.