Random Forests: Ensemble Learning with Decision Tree Aggregation

Learn random forests — bagging, feature randomness, out-of-bag evaluation, feature importance, and when random forests outperform boosting and single trees in ML.

Random Forests

Random forests solve the fundamental problem with decision trees: high variance. A single decision tree is fragile — small changes in training data can produce a completely different tree. Random forests fix this by training hundreds of trees and letting them vote. The result is a robust, accurate model that’s one of the best general-purpose algorithms available.


The Core Idea: Bagging

Bootstrap Aggregation (Bagging): Train many models on different random subsets of the training data, then average their predictions.

Original dataset (N samples):
Sample 1: Bootstrap → Train Tree 1
Sample 2: Bootstrap → Train Tree 2
...
Sample k: Bootstrap → Train Tree k
Prediction: Majority vote (classification) or mean (regression)

Each bootstrap sample is drawn with replacement — roughly 63% of original samples appear at least once, and ~37% are “out-of-bag” (OOB).


What Makes Random Forests Different from Plain Bagging

Random forests add a second source of randomness: feature subsampling at each split.

At each node, instead of considering all features, only √(n_features) features are randomly selected for the best split. This:

  1. Decorrelates the trees (if one strong feature dominates, it won’t appear in every tree)
  2. Means each tree is trained on different data AND different features
  3. Produces diverse trees whose errors cancel out when aggregated

Training a Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(
n_estimators=100, # Number of trees (more = better, diminishing returns)
max_features='sqrt', # Features to consider at each split
max_depth=None, # Trees grow fully (bagging handles overfitting)
min_samples_leaf=1,
oob_score=True, # Use out-of-bag samples for free validation
n_jobs=-1, # Parallelize across all CPU cores
random_state=42
)
model.fit(X_train, y_train)
print(f"OOB Score: {model.oob_score_:.4f}") # Free accuracy estimate
print(f"Test Score: {model.score(X_test, y_test):.4f}")

Out-of-Bag (OOB) Evaluation

The ~37% of samples not used to train each tree can be used to evaluate it — for free, without a separate validation set. The OOB score is an unbiased estimate of generalization performance.

# OOB score is automatically calculated when oob_score=True
print(f"OOB accuracy: {model.oob_score_:.4f}")
# Per-sample OOB probabilities
oob_probs = model.oob_decision_function_ # shape: (n_samples, n_classes)

Feature Importance

Random forests provide the most reliable feature importances among sklearn models — the mean decrease in impurity (MDI) across all trees:

import pandas as pd
import matplotlib.pyplot as plt
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
importance_df.head(15).plot(x='feature', y='importance', kind='barh', figsize=(10, 8))
plt.title("Random Forest Feature Importances")
plt.tight_layout()

Caveat: MDI can be biased toward high-cardinality features. For more reliable importances, use permutation importance:

from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
perm_importance = pd.Series(result.importances_mean, index=feature_names)

Key Hyperparameters

ParameterDefaultEffect
n_estimators100More trees = lower variance; diminishing returns past ~200–500
max_featuressqrtsqrt for classification, 1/3 for regression
max_depthNone (full)Limit to reduce memory; small effect on accuracy
min_samples_leaf1Increase to smooth predictions and reduce overfitting
class_weightNoneSet balanced for imbalanced datasets

Random Forest vs. Gradient Boosting

Random ForestGradient Boosting (XGBoost)
TrainingParallel (fast)Sequential (slower)
Tuning effortLow (robust defaults)High (many hyperparameters)
Overfitting riskLowHigher
Tabular data performanceVery goodSlightly better on average
InterpretabilityFeature importancesFeature importances + SHAP

Random forests are typically the better starting point. Switch to gradient boosting when you need to squeeze out extra performance and are willing to tune.


Random Forests for Regression

from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(
n_estimators=200,
max_features=1/3, # Recommended for regression
oob_score=True,
n_jobs=-1,
random_state=42
)
reg.fit(X_train, y_train)

Predictions are the average of all tree outputs. Unlike individual trees, the output is smooth over the input space.