XGBoost, LightGBM, and CatBoost
Three frameworks dominate competitive machine learning on tabular data: XGBoost, LightGBM, and CatBoost. All implement gradient boosting, but they differ in speed, memory efficiency, handling of categorical variables, and tuning complexity. Knowing when to use each saves significant time.
Why They’re Faster Than sklearn GBM
Sklearn’s GradientBoostingClassifier evaluates each split point exactly across all features and samples — O(n × d × n_splits) per tree. The modern frameworks use algorithmic tricks:
XGBoost: Approximate split finding, regularized objective (L1+L2 on weights), column block data structure for cache efficiency, sparse-aware (native handling of missing values).
LightGBM: Gradient-based One-Side Sampling (GOSS) — only uses large-gradient samples for finding splits, ignoring small-gradient samples (they’re “easy” cases). Exclusive Feature Bundling (EFB) bundles sparse mutually exclusive features. Leaf-wise tree growth (not level-wise like XGBoost/sklearn).
CatBoost: Ordered boosting — processes data in a random permutation to avoid target leakage. Native categorical feature encoding that beats manual ordinal encoding. Symmetric (oblivious) trees for faster prediction.
XGBoost
import xgboost as xgbfrom sklearn.model_selection import cross_val_score
model = xgb.XGBClassifier( n_estimators=1000, learning_rate=0.05, max_depth=6, min_child_weight=1, subsample=0.8, colsample_bytree=0.8, gamma=0, # Min gain to make a split reg_alpha=0, # L1 regularization reg_lambda=1, # L2 regularization use_label_encoder=False, eval_metric='logloss', early_stopping_rounds=50, random_state=42, n_jobs=-1)
eval_set = [(X_test, y_test)]model.fit(X_train, y_train, eval_set=eval_set, verbose=100)
# SHAP values (built-in)import shapexplainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(X_test)shap.summary_plot(shap_values, X_test, feature_names=feature_names)LightGBM
LightGBM is typically 5–10× faster than XGBoost and uses less memory. Preferred when training time is critical.
import lightgbm as lgb
model = lgb.LGBMClassifier( n_estimators=1000, learning_rate=0.05, num_leaves=31, # Controls tree complexity (not max_depth) max_depth=-1, # -1 = no limit min_child_samples=20, feature_fraction=0.8, # Column subsampling bagging_fraction=0.8, # Row subsampling bagging_freq=5, reg_alpha=0, reg_lambda=0, n_jobs=-1, random_state=42, verbose=-1)
callbacks = [lgb.early_stopping(50), lgb.log_evaluation(100)]model.fit( X_train, y_train, eval_set=[(X_test, y_test)], callbacks=callbacks)num_leaves is the primary complexity parameter in LightGBM (instead of max_depth). Set num_leaves < 2^max_depth to avoid overfitting.
CatBoost
CatBoost handles categorical features natively — no need for one-hot encoding or target encoding.
from catboost import CatBoostClassifier, Pool
# Identify categorical feature indicescat_features = [0, 2, 5] # Index positions of categorical columns
train_pool = Pool(X_train, y_train, cat_features=cat_features)test_pool = Pool(X_test, y_test, cat_features=cat_features)
model = CatBoostClassifier( iterations=1000, learning_rate=0.05, depth=6, l2_leaf_reg=3, early_stopping_rounds=50, eval_metric='Accuracy', random_seed=42, verbose=100)
model.fit(train_pool, eval_set=test_pool)Framework Comparison
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Training speed | Fast | Fastest | Medium |
| Memory usage | High | Low | Medium |
| Categorical support | Manual | Manual | Native |
| GPU support | Yes | Yes | Yes |
| Missing value handling | Native | Native | Native |
| SHAP support | Yes | Yes | Yes |
| Best for | General purpose | Large datasets, speed | High cardinality categoricals |
Tuning Strategy
All three follow the same basic tuning workflow:
import optuna
def objective(trial): params = { 'n_estimators': 1000, 'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True), 'max_depth': trial.suggest_int('max_depth', 3, 10), 'subsample': trial.suggest_float('subsample', 0.5, 1.0), 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0), 'min_child_weight': trial.suggest_int('min_child_weight', 1, 20), 'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True), 'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True), } model = xgb.XGBClassifier(**params, early_stopping_rounds=50, eval_metric='auc') model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False) return model.score(X_val, y_val)
study = optuna.create_study(direction='maximize')study.optimize(objective, n_trials=100)Practical Recommendation
Start with LightGBM for speed. If you have many high-cardinality categoricals, try CatBoost. Use XGBoost when you need maximum community support and the broadest deployment compatibility (e.g., ONNX export, Spark integration). All three will outperform sklearn’s GBM in most scenarios.