XGBoost, LightGBM, and CatBoost

Three frameworks dominate competitive machine learning on tabular data: XGBoost, LightGBM, and CatBoost. All implement gradient boosting, but they differ in speed, memory efficiency, handling of categorical variables, and tuning complexity. Knowing when to use each saves significant time.

Why They’re Faster Than sklearn GBM

Sklearn’s GradientBoostingClassifier evaluates each split point exactly across all features and samples — O(n × d × n_splits) per tree. The modern frameworks use algorithmic tricks:

XGBoost: Approximate split finding, regularized objective (L1+L2 on weights), column block data structure for cache efficiency, sparse-aware (native handling of missing values).

LightGBM: Gradient-based One-Side Sampling (GOSS) — only uses large-gradient samples for finding splits, ignoring small-gradient samples (they’re “easy” cases). Exclusive Feature Bundling (EFB) bundles sparse mutually exclusive features. Leaf-wise tree growth (not level-wise like XGBoost/sklearn).

CatBoost: Ordered boosting — processes data in a random permutation to avoid target leakage. Native categorical feature encoding that beats manual ordinal encoding. Symmetric (oblivious) trees for faster prediction.

XGBoost

import xgboost as xgb
from sklearn.model_selection import cross_val_score

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=1,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=0,               # Min gain to make a split
    reg_alpha=0,           # L1 regularization
    reg_lambda=1,          # L2 regularization
    use_label_encoder=False,
    eval_metric='logloss',
    early_stopping_rounds=50,
    random_state=42,
    n_jobs=-1
)

eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_set=eval_set, verbose=100)

# SHAP values (built-in)
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

LightGBM

LightGBM is typically 5–10× faster than XGBoost and uses less memory. Preferred when training time is critical.

import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,          # Controls tree complexity (not max_depth)
    max_depth=-1,           # -1 = no limit
    min_child_samples=20,
    feature_fraction=0.8,   # Column subsampling
    bagging_fraction=0.8,   # Row subsampling
    bagging_freq=5,
    reg_alpha=0,
    reg_lambda=0,
    n_jobs=-1,
    random_state=42,
    verbose=-1
)

callbacks = [lgb.early_stopping(50), lgb.log_evaluation(100)]
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    callbacks=callbacks
)

num_leaves is the primary complexity parameter in LightGBM (instead of max_depth). Set num_leaves < 2^max_depth to avoid overfitting.

CatBoost

CatBoost handles categorical features natively — no need for one-hot encoding or target encoding.

from catboost import CatBoostClassifier, Pool

# Identify categorical feature indices
cat_features = [0, 2, 5]  # Index positions of categorical columns

train_pool = Pool(X_train, y_train, cat_features=cat_features)
test_pool = Pool(X_test, y_test, cat_features=cat_features)

model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3,
    early_stopping_rounds=50,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=100
)

model.fit(train_pool, eval_set=test_pool)

Framework Comparison

Feature	XGBoost	LightGBM	CatBoost
Training speed	Fast	Fastest	Medium
Memory usage	High	Low	Medium
Categorical support	Manual	Manual	Native
GPU support	Yes	Yes	Yes
Missing value handling	Native	Native	Native
SHAP support	Yes	Yes	Yes
Best for	General purpose	Large datasets, speed	High cardinality categoricals

Tuning Strategy

All three follow the same basic tuning workflow:

import optuna

def objective(trial):
    params = {
        'n_estimators': 1000,
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
    }
    model = xgb.XGBClassifier(**params, early_stopping_rounds=50, eval_metric='auc')
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
    return model.score(X_val, y_val)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

Practical Recommendation

Start with LightGBM for speed. If you have many high-cardinality categoricals, try CatBoost. Use XGBoost when you need maximum community support and the broadest deployment compatibility (e.g., ONNX export, Spark integration). All three will outperform sklearn’s GBM in most scenarios.