Gradient Boosting: Sequential Ensemble Learning for High Performance

Learn gradient boosting machines — additive training, pseudo-residuals, shrinkage, tree depth, and why gradient boosting dominates Kaggle tabular competitions.

Gradient Boosting

Gradient boosting is the algorithm behind some of the most powerful ML systems in production. It wins Kaggle tabular competitions more often than any other algorithm and consistently achieves state-of-the-art results on structured data. Understanding how it works makes you better at tuning it.


The Core Idea: Sequential Error Correction

Unlike random forests (parallel trees that average), gradient boosting builds trees sequentially, where each new tree corrects the mistakes of all previous trees:

Iteration 1: Fit tree₁ to data → predictions₁
Iteration 2: Fit tree₂ to residuals of predictions₁
Iteration 3: Fit tree₃ to residuals of (predictions₁ + predictions₂)
...
Final prediction = predictions₁ + lr × predictions₂ + lr × predictions₃ + ...

This “boosting” approach turns many weak learners (shallow trees) into one powerful model.


Gradient Descent in Function Space

The name comes from gradient descent: at each step, the algorithm fits a tree to the negative gradient of the loss function — which, for squared error loss, equals the residuals. For other loss functions (log loss, Huber), the “pseudo-residuals” are more complex but the principle is the same.

For MSE loss: pseudo-residuals = y_true - y_pred (actual residuals)
For log loss: pseudo-residuals = y_true - sigmoid(y_pred)
For Huber: pseudo-residuals = clipped residuals (robust to outliers)

Sklearn Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
gbm = GradientBoostingClassifier(
n_estimators=200, # Number of trees
learning_rate=0.1, # Shrinkage (smaller = more trees needed, better generalization)
max_depth=3, # Tree depth (shallower = more regularization)
min_samples_leaf=20,
subsample=0.8, # Stochastic gradient boosting (like dropout for trees)
random_state=42
)
gbm.fit(X_train, y_train)
print(f"Train accuracy: {gbm.score(X_train, y_train):.4f}")
print(f"Test accuracy: {gbm.score(X_test, y_test):.4f}")

The Learning Rate / N-Estimators Trade-off

These two parameters are tightly coupled:

learning_rate=0.1, n_estimators=100 (fast, less optimal)
learning_rate=0.01, n_estimators=1000 (slow training, often better accuracy)
learning_rate=0.001, n_estimators=10000 (very slow, marginal gains)

A common workflow: set a small learning rate (0.01–0.05) and use early stopping to find the right n_estimators automatically.


Stochastic Gradient Boosting

Setting subsample < 1.0 introduces row subsampling at each tree (like bagging):

  • Reduces variance and overfitting
  • Speeds up training
  • Often improves final accuracy
  • subsample=0.8 is a common starting point

You can also subsample features per tree (max_features) for additional regularization.


Early Stopping

from sklearn.ensemble import GradientBoostingClassifier
gbm = GradientBoostingClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=3,
subsample=0.8,
validation_fraction=0.1,
n_iter_no_change=20, # Stop if no improvement for 20 rounds
tol=1e-4,
random_state=42
)
gbm.fit(X_train, y_train)
print(f"Trees used: {gbm.n_estimators_}") # Actual trees fitted (early stopped)

Visualizing Training Progress

import matplotlib.pyplot as plt
# Track train/test deviance per stage
test_score = np.zeros((gbm.n_estimators_,))
for i, y_pred in enumerate(gbm.staged_predict_proba(X_test)):
test_score[i] = gbm.loss_(y_test, y_pred[:, 1].reshape(-1, 1))
plt.plot(range(1, gbm.n_estimators_ + 1), gbm.train_score_, label='Train')
plt.plot(range(1, gbm.n_estimators_ + 1), test_score, label='Test')
plt.xlabel('Number of Boosting Rounds')
plt.ylabel('Deviance')
plt.legend()
plt.title('Gradient Boosting: Training Progress')

Key Hyperparameters Reference

ParameterTypical RangeEffect
learning_rate0.01–0.1Smaller = slower training, better generalization
n_estimators100–2000Use early stopping to tune
max_depth2–5Shallower = less overfitting for boosting
subsample0.6–0.9Row subsampling per tree
min_samples_leaf5–50Regularize leaf size

When to Use Gradient Boosting

Gradient boosting is the go-to for structured/tabular data, especially when:

  • Features are mixed types (numeric, categorical)
  • Non-linear relationships exist
  • You can afford hyperparameter tuning time
  • You want state-of-the-art performance

For a faster, more scalable implementation, see XGBoost, LightGBM, and CatBoost — the workhorses of production ML on tabular data.