Random Forests
Random forests solve the fundamental problem with decision trees: high variance. A single decision tree is fragile — small changes in training data can produce a completely different tree. Random forests fix this by training hundreds of trees and letting them vote. The result is a robust, accurate model that’s one of the best general-purpose algorithms available.
The Core Idea: Bagging
Bootstrap Aggregation (Bagging): Train many models on different random subsets of the training data, then average their predictions.
Original dataset (N samples): Sample 1: Bootstrap → Train Tree 1 Sample 2: Bootstrap → Train Tree 2 ... Sample k: Bootstrap → Train Tree k
Prediction: Majority vote (classification) or mean (regression)Each bootstrap sample is drawn with replacement — roughly 63% of original samples appear at least once, and ~37% are “out-of-bag” (OOB).
What Makes Random Forests Different from Plain Bagging
Random forests add a second source of randomness: feature subsampling at each split.
At each node, instead of considering all features, only √(n_features) features are randomly selected for the best split. This:
- Decorrelates the trees (if one strong feature dominates, it won’t appear in every tree)
- Means each tree is trained on different data AND different features
- Produces diverse trees whose errors cancel out when aggregated
Training a Random Forest
from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_score
model = RandomForestClassifier( n_estimators=100, # Number of trees (more = better, diminishing returns) max_features='sqrt', # Features to consider at each split max_depth=None, # Trees grow fully (bagging handles overfitting) min_samples_leaf=1, oob_score=True, # Use out-of-bag samples for free validation n_jobs=-1, # Parallelize across all CPU cores random_state=42)
model.fit(X_train, y_train)
print(f"OOB Score: {model.oob_score_:.4f}") # Free accuracy estimateprint(f"Test Score: {model.score(X_test, y_test):.4f}")Out-of-Bag (OOB) Evaluation
The ~37% of samples not used to train each tree can be used to evaluate it — for free, without a separate validation set. The OOB score is an unbiased estimate of generalization performance.
# OOB score is automatically calculated when oob_score=Trueprint(f"OOB accuracy: {model.oob_score_:.4f}")
# Per-sample OOB probabilitiesoob_probs = model.oob_decision_function_ # shape: (n_samples, n_classes)Feature Importance
Random forests provide the most reliable feature importances among sklearn models — the mean decrease in impurity (MDI) across all trees:
import pandas as pdimport matplotlib.pyplot as plt
importance_df = pd.DataFrame({ 'feature': feature_names, 'importance': model.feature_importances_}).sort_values('importance', ascending=False)
importance_df.head(15).plot(x='feature', y='importance', kind='barh', figsize=(10, 8))plt.title("Random Forest Feature Importances")plt.tight_layout()Caveat: MDI can be biased toward high-cardinality features. For more reliable importances, use permutation importance:
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)perm_importance = pd.Series(result.importances_mean, index=feature_names)Key Hyperparameters
| Parameter | Default | Effect |
|---|---|---|
n_estimators | 100 | More trees = lower variance; diminishing returns past ~200–500 |
max_features | sqrt | sqrt for classification, 1/3 for regression |
max_depth | None (full) | Limit to reduce memory; small effect on accuracy |
min_samples_leaf | 1 | Increase to smooth predictions and reduce overfitting |
class_weight | None | Set balanced for imbalanced datasets |
Random Forest vs. Gradient Boosting
| Random Forest | Gradient Boosting (XGBoost) | |
|---|---|---|
| Training | Parallel (fast) | Sequential (slower) |
| Tuning effort | Low (robust defaults) | High (many hyperparameters) |
| Overfitting risk | Low | Higher |
| Tabular data performance | Very good | Slightly better on average |
| Interpretability | Feature importances | Feature importances + SHAP |
Random forests are typically the better starting point. Switch to gradient boosting when you need to squeeze out extra performance and are willing to tune.
Random Forests for Regression
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor( n_estimators=200, max_features=1/3, # Recommended for regression oob_score=True, n_jobs=-1, random_state=42)reg.fit(X_train, y_train)Predictions are the average of all tree outputs. Unlike individual trees, the output is smooth over the input space.