XGBoost, LightGBM, and CatBoost: Modern Gradient Boosting Frameworks

Compare XGBoost, LightGBM, and CatBoost — speed, memory efficiency, categorical handling, SHAP values, and practical guidance for choosing the right framework.

XGBoost, LightGBM, and CatBoost

Three frameworks dominate competitive machine learning on tabular data: XGBoost, LightGBM, and CatBoost. All implement gradient boosting, but they differ in speed, memory efficiency, handling of categorical variables, and tuning complexity. Knowing when to use each saves significant time.


Why They’re Faster Than sklearn GBM

Sklearn’s GradientBoostingClassifier evaluates each split point exactly across all features and samples — O(n × d × n_splits) per tree. The modern frameworks use algorithmic tricks:

XGBoost: Approximate split finding, regularized objective (L1+L2 on weights), column block data structure for cache efficiency, sparse-aware (native handling of missing values).

LightGBM: Gradient-based One-Side Sampling (GOSS) — only uses large-gradient samples for finding splits, ignoring small-gradient samples (they’re “easy” cases). Exclusive Feature Bundling (EFB) bundles sparse mutually exclusive features. Leaf-wise tree growth (not level-wise like XGBoost/sklearn).

CatBoost: Ordered boosting — processes data in a random permutation to avoid target leakage. Native categorical feature encoding that beats manual ordinal encoding. Symmetric (oblivious) trees for faster prediction.


XGBoost

import xgboost as xgb
from sklearn.model_selection import cross_val_score
model = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=6,
min_child_weight=1,
subsample=0.8,
colsample_bytree=0.8,
gamma=0, # Min gain to make a split
reg_alpha=0, # L1 regularization
reg_lambda=1, # L2 regularization
use_label_encoder=False,
eval_metric='logloss',
early_stopping_rounds=50,
random_state=42,
n_jobs=-1
)
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_set=eval_set, verbose=100)
# SHAP values (built-in)
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

LightGBM

LightGBM is typically 5–10× faster than XGBoost and uses less memory. Preferred when training time is critical.

import lightgbm as lgb
model = lgb.LGBMClassifier(
n_estimators=1000,
learning_rate=0.05,
num_leaves=31, # Controls tree complexity (not max_depth)
max_depth=-1, # -1 = no limit
min_child_samples=20,
feature_fraction=0.8, # Column subsampling
bagging_fraction=0.8, # Row subsampling
bagging_freq=5,
reg_alpha=0,
reg_lambda=0,
n_jobs=-1,
random_state=42,
verbose=-1
)
callbacks = [lgb.early_stopping(50), lgb.log_evaluation(100)]
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
callbacks=callbacks
)

num_leaves is the primary complexity parameter in LightGBM (instead of max_depth). Set num_leaves < 2^max_depth to avoid overfitting.


CatBoost

CatBoost handles categorical features natively — no need for one-hot encoding or target encoding.

from catboost import CatBoostClassifier, Pool
# Identify categorical feature indices
cat_features = [0, 2, 5] # Index positions of categorical columns
train_pool = Pool(X_train, y_train, cat_features=cat_features)
test_pool = Pool(X_test, y_test, cat_features=cat_features)
model = CatBoostClassifier(
iterations=1000,
learning_rate=0.05,
depth=6,
l2_leaf_reg=3,
early_stopping_rounds=50,
eval_metric='Accuracy',
random_seed=42,
verbose=100
)
model.fit(train_pool, eval_set=test_pool)

Framework Comparison

FeatureXGBoostLightGBMCatBoost
Training speedFastFastestMedium
Memory usageHighLowMedium
Categorical supportManualManualNative
GPU supportYesYesYes
Missing value handlingNativeNativeNative
SHAP supportYesYesYes
Best forGeneral purposeLarge datasets, speedHigh cardinality categoricals

Tuning Strategy

All three follow the same basic tuning workflow:

import optuna
def objective(trial):
params = {
'n_estimators': 1000,
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
}
model = xgb.XGBClassifier(**params, early_stopping_rounds=50, eval_metric='auc')
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
return model.score(X_val, y_val)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

Practical Recommendation

Start with LightGBM for speed. If you have many high-cardinality categoricals, try CatBoost. Use XGBoost when you need maximum community support and the broadest deployment compatibility (e.g., ONNX export, Spark integration). All three will outperform sklearn’s GBM in most scenarios.