Gradient Boosting
Gradient boosting is the algorithm behind some of the most powerful ML systems in production. It wins Kaggle tabular competitions more often than any other algorithm and consistently achieves state-of-the-art results on structured data. Understanding how it works makes you better at tuning it.
The Core Idea: Sequential Error Correction
Unlike random forests (parallel trees that average), gradient boosting builds trees sequentially, where each new tree corrects the mistakes of all previous trees:
Iteration 1: Fit tree₁ to data → predictions₁Iteration 2: Fit tree₂ to residuals of predictions₁Iteration 3: Fit tree₃ to residuals of (predictions₁ + predictions₂)...Final prediction = predictions₁ + lr × predictions₂ + lr × predictions₃ + ...This “boosting” approach turns many weak learners (shallow trees) into one powerful model.
Gradient Descent in Function Space
The name comes from gradient descent: at each step, the algorithm fits a tree to the negative gradient of the loss function — which, for squared error loss, equals the residuals. For other loss functions (log loss, Huber), the “pseudo-residuals” are more complex but the principle is the same.
For MSE loss: pseudo-residuals = y_true - y_pred (actual residuals)For log loss: pseudo-residuals = y_true - sigmoid(y_pred)For Huber: pseudo-residuals = clipped residuals (robust to outliers)Sklearn Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressorfrom sklearn.model_selection import cross_val_score
gbm = GradientBoostingClassifier( n_estimators=200, # Number of trees learning_rate=0.1, # Shrinkage (smaller = more trees needed, better generalization) max_depth=3, # Tree depth (shallower = more regularization) min_samples_leaf=20, subsample=0.8, # Stochastic gradient boosting (like dropout for trees) random_state=42)
gbm.fit(X_train, y_train)print(f"Train accuracy: {gbm.score(X_train, y_train):.4f}")print(f"Test accuracy: {gbm.score(X_test, y_test):.4f}")The Learning Rate / N-Estimators Trade-off
These two parameters are tightly coupled:
learning_rate=0.1, n_estimators=100 (fast, less optimal)learning_rate=0.01, n_estimators=1000 (slow training, often better accuracy)learning_rate=0.001, n_estimators=10000 (very slow, marginal gains)A common workflow: set a small learning rate (0.01–0.05) and use early stopping to find the right n_estimators automatically.
Stochastic Gradient Boosting
Setting subsample < 1.0 introduces row subsampling at each tree (like bagging):
- Reduces variance and overfitting
- Speeds up training
- Often improves final accuracy
subsample=0.8is a common starting point
You can also subsample features per tree (max_features) for additional regularization.
Early Stopping
from sklearn.ensemble import GradientBoostingClassifier
gbm = GradientBoostingClassifier( n_estimators=1000, learning_rate=0.05, max_depth=3, subsample=0.8, validation_fraction=0.1, n_iter_no_change=20, # Stop if no improvement for 20 rounds tol=1e-4, random_state=42)
gbm.fit(X_train, y_train)print(f"Trees used: {gbm.n_estimators_}") # Actual trees fitted (early stopped)Visualizing Training Progress
import matplotlib.pyplot as plt
# Track train/test deviance per stagetest_score = np.zeros((gbm.n_estimators_,))for i, y_pred in enumerate(gbm.staged_predict_proba(X_test)): test_score[i] = gbm.loss_(y_test, y_pred[:, 1].reshape(-1, 1))
plt.plot(range(1, gbm.n_estimators_ + 1), gbm.train_score_, label='Train')plt.plot(range(1, gbm.n_estimators_ + 1), test_score, label='Test')plt.xlabel('Number of Boosting Rounds')plt.ylabel('Deviance')plt.legend()plt.title('Gradient Boosting: Training Progress')Key Hyperparameters Reference
| Parameter | Typical Range | Effect |
|---|---|---|
learning_rate | 0.01–0.1 | Smaller = slower training, better generalization |
n_estimators | 100–2000 | Use early stopping to tune |
max_depth | 2–5 | Shallower = less overfitting for boosting |
subsample | 0.6–0.9 | Row subsampling per tree |
min_samples_leaf | 5–50 | Regularize leaf size |
When to Use Gradient Boosting
Gradient boosting is the go-to for structured/tabular data, especially when:
- Features are mixed types (numeric, categorical)
- Non-linear relationships exist
- You can afford hyperparameter tuning time
- You want state-of-the-art performance
For a faster, more scalable implementation, see XGBoost, LightGBM, and CatBoost — the workhorses of production ML on tabular data.