Gradient Boosting

Gradient boosting is the algorithm behind some of the most powerful ML systems in production. It wins Kaggle tabular competitions more often than any other algorithm and consistently achieves state-of-the-art results on structured data. Understanding how it works makes you better at tuning it.

The Core Idea: Sequential Error Correction

Unlike random forests (parallel trees that average), gradient boosting builds trees sequentially, where each new tree corrects the mistakes of all previous trees:

Iteration 1: Fit tree₁ to data → predictions₁
Iteration 2: Fit tree₂ to residuals of predictions₁
Iteration 3: Fit tree₃ to residuals of (predictions₁ + predictions₂)
...
Final prediction = predictions₁ + lr × predictions₂ + lr × predictions₃ + ...

This “boosting” approach turns many weak learners (shallow trees) into one powerful model.

Gradient Descent in Function Space

The name comes from gradient descent: at each step, the algorithm fits a tree to the negative gradient of the loss function — which, for squared error loss, equals the residuals. For other loss functions (log loss, Huber), the “pseudo-residuals” are more complex but the principle is the same.

For MSE loss: pseudo-residuals = y_true - y_pred (actual residuals)
For log loss: pseudo-residuals = y_true - sigmoid(y_pred)
For Huber:    pseudo-residuals = clipped residuals (robust to outliers)

Sklearn Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

gbm = GradientBoostingClassifier(
    n_estimators=200,       # Number of trees
    learning_rate=0.1,      # Shrinkage (smaller = more trees needed, better generalization)
    max_depth=3,            # Tree depth (shallower = more regularization)
    min_samples_leaf=20,
    subsample=0.8,          # Stochastic gradient boosting (like dropout for trees)
    random_state=42
)

gbm.fit(X_train, y_train)
print(f"Train accuracy: {gbm.score(X_train, y_train):.4f}")
print(f"Test accuracy:  {gbm.score(X_test, y_test):.4f}")

The Learning Rate / N-Estimators Trade-off

These two parameters are tightly coupled:

learning_rate=0.1, n_estimators=100  (fast, less optimal)
learning_rate=0.01, n_estimators=1000 (slow training, often better accuracy)
learning_rate=0.001, n_estimators=10000 (very slow, marginal gains)

A common workflow: set a small learning rate (0.01–0.05) and use early stopping to find the right n_estimators automatically.

Stochastic Gradient Boosting

Setting subsample < 1.0 introduces row subsampling at each tree (like bagging):

Reduces variance and overfitting
Speeds up training
Often improves final accuracy
subsample=0.8 is a common starting point

You can also subsample features per tree (max_features) for additional regularization.

Early Stopping

from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=3,
    subsample=0.8,
    validation_fraction=0.1,
    n_iter_no_change=20,     # Stop if no improvement for 20 rounds
    tol=1e-4,
    random_state=42
)

gbm.fit(X_train, y_train)
print(f"Trees used: {gbm.n_estimators_}")  # Actual trees fitted (early stopped)

Visualizing Training Progress

import matplotlib.pyplot as plt

# Track train/test deviance per stage
test_score = np.zeros((gbm.n_estimators_,))
for i, y_pred in enumerate(gbm.staged_predict_proba(X_test)):
    test_score[i] = gbm.loss_(y_test, y_pred[:, 1].reshape(-1, 1))

plt.plot(range(1, gbm.n_estimators_ + 1), gbm.train_score_, label='Train')
plt.plot(range(1, gbm.n_estimators_ + 1), test_score, label='Test')
plt.xlabel('Number of Boosting Rounds')
plt.ylabel('Deviance')
plt.legend()
plt.title('Gradient Boosting: Training Progress')

Key Hyperparameters Reference

Parameter	Typical Range	Effect
`learning_rate`	0.01–0.1	Smaller = slower training, better generalization
`n_estimators`	100–2000	Use early stopping to tune
`max_depth`	2–5	Shallower = less overfitting for boosting
`subsample`	0.6–0.9	Row subsampling per tree
`min_samples_leaf`	5–50	Regularize leaf size

When to Use Gradient Boosting

Gradient boosting is the go-to for structured/tabular data, especially when:

Features are mixed types (numeric, categorical)
Non-linear relationships exist
You can afford hyperparameter tuning time
You want state-of-the-art performance

For a faster, more scalable implementation, see XGBoost, LightGBM, and CatBoost — the workhorses of production ML on tabular data.