The Story That Explains XGBoost
This is Gradient Boosting โ the idea behind XGBoost. Each model is not independent; each one corrects the residuals of its predecessors. XGBoost (eXtreme Gradient Boosting) is the engineering masterpiece that makes this process blazingly fast, regularised to prevent overfitting, and competition-grade accurate.
XGBoost was created by Tianqi Chen in 2014 and dominated Kaggle leaderboards for years. It remains one of the most powerful algorithms for structured/tabular data, combining gradient boosting mathematics with regularisation, sparsity awareness, and highly optimised parallel computation.
XGBoost builds an ensemble of decision trees sequentially โ each tree fits the negative gradient of the loss function from all previous trees, and adds a small corrective contribution to the overall prediction.
How Gradient Boosting Works โ Step by Step
Before touching XGBoost's engineering, you must understand the mathematical heart: gradient boosting. Each tree is built to minimise a loss function using its gradient (direction of steepest improvement).
Classic gradient boosting uses first-order gradient information only. XGBoost uses second-order Taylor expansion (both gradient and Hessian), which produces a more precise optimal leaf weight at each split. It also adds an explicit regularisation term (ฮฉ) to the objective, penalising tree complexity and preventing overfitting from the start.
The XGBoost Objective Function โ Math Explained Simply
XGBoost minimises the following objective at each boosting round:
The optimal leaf weight formula w* = โG/(H+ฮป) means each leaf is analytically solved โ
no gradient descent needed inside the tree. The ฮป (L2 regularisation) term
in the denominator shrinks weights toward zero, preventing extreme predictions.
The ฮณ (min split gain) means XGBoost naturally prunes trees that don't improve enough.
What Makes XGBoost Special โ Six Engineering Breakthroughs
Visual Diagram โ Inside XGBoost Boosting Rounds
Each boosting round: compute residuals โ fit a tree โ update the ensemble. Residuals shrink with each round. The final prediction is the weighted sum of all trees.
Python Implementation โ Classification & Regression
Installation
# Install XGBoost
pip install xgboost scikit-learn pandas numpy
Classification โ Predicting Customer Churn
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
# Simulated churn dataset: 8000 customers, 20 features
X, y = make_classification(
n_samples=8000, n_features=20,
n_informative=12, n_redundant=4,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# XGBoost Classifier โ sensible starting defaults
model = xgb.XGBClassifier(
n_estimators=300, # number of boosting rounds
learning_rate=0.05, # eta โ shrinkage per round
max_depth=6, # max tree depth
subsample=0.8, # row sampling per tree
colsample_bytree=0.8, # feature sampling per tree
gamma=0, # min split gain
reg_alpha=0, # L1 regularisation
reg_lambda=1, # L2 regularisation
use_label_encoder=False,
eval_metric='logloss',
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=50 # print every 50 rounds
)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))
Regression โ California Housing Prices
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_absolute_error
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
reg = xgb.XGBRegressor(
n_estimators=500,
learning_rate=0.03,
max_depth=5,
subsample=0.75,
colsample_bytree=0.75,
reg_lambda=1.5,
random_state=42
)
reg.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False)
y_pred = reg.predict(X_test)
print(f"Rยฒ Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
Key Hyperparameters โ Complete Reference
Understanding XGBoost's hyperparameters is essential before tuning. They fall into four groups:
Group 1 โ Boosting Control
| Parameter | Default | Range | Effect | Tuning Priority |
|---|---|---|---|---|
n_estimators | 100 | 50โ2000 | Number of boosting rounds (trees) | High โ tune with early stopping |
learning_rate (eta) | 0.3 | 0.01โ0.3 | Shrinkage per round. Lower = better generalisation but needs more trees | Critical โ start at 0.05โ0.1 |
booster | 'gbtree' | gbtree/gblinear/dart | Base learner type. gbtree is almost always best for tabular data | Rarely changed |
Group 2 โ Tree Structure
| Parameter | Default | Range | Effect | Tuning Priority |
|---|---|---|---|---|
max_depth | 6 | 3โ10 | Maximum tree depth. Deeper = more complex patterns but higher overfitting risk | Critical โ tune first |
min_child_weight | 1 | 1โ10 | Minimum sum of Hessian in a leaf. Higher = more conservative splits, less overfitting | High |
gamma | 0 | 0โ5 | Minimum gain for a split. 0 = split freely. Higher = more aggressive pruning | High |
max_leaves | 0 (unlimited) | 0โ64 | Used with grow_policy='lossguide'. Limits leaves instead of depth | Optional |
Group 3 โ Sampling (Stochastic Boosting)
| Parameter | Default | Range | Effect | Tuning Priority |
|---|---|---|---|---|
subsample | 1.0 | 0.5โ1.0 | Fraction of training rows sampled per tree. <1.0 adds randomness like Random Forest | High |
colsample_bytree | 1.0 | 0.3โ1.0 | Fraction of features sampled per tree | High |
colsample_bylevel | 1.0 | 0.3โ1.0 | Fraction of features sampled per tree depth level | Medium |
colsample_bynode | 1.0 | 0.3โ1.0 | Fraction of features sampled per node split | Medium |
Group 4 โ Regularisation
| Parameter | Default | Range | Effect | Tuning Priority |
|---|---|---|---|---|
reg_alpha (alpha) | 0 | 0โ10 | L1 regularisation on leaf weights. Drives some weights to zero (sparse solutions) | High for high-dimensional data |
reg_lambda (lambda) | 1 | 0โ10 | L2 regularisation on leaf weights. Shrinks all weights toward zero. Usually beneficial | High |
scale_pos_weight | 1 | ratio | For imbalanced data: set to sum(neg)/sum(pos). Like class_weight='balanced' | Critical for imbalanced problems |
1. Fix learning_rate=0.05 and find the right n_estimators via early stopping.
2. Tune max_depth and min_child_weight together.
3. Tune subsample and colsample_bytree.
4. Tune gamma, reg_alpha, reg_lambda.
5. Lower learning_rate and increase n_estimators proportionally.
Early Stopping โ Finding the Right Number of Trees
Early stopping does exactly this. It watches validation loss at every boosting round and stops when no improvement has been seen for
early_stopping_rounds consecutive rounds,
returning the model from the best round โ not the final one.
import xgboost as xgb
from sklearn.model_selection import train_test_split
# Use a validation set for early stopping
X_train, X_val, y_train, y_val = train_test_split(
X_train_full, y_train_full,
test_size=0.15, random_state=42
)
model = xgb.XGBClassifier(
n_estimators=2000, # set high โ early stopping will find actual best
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
eval_metric='logloss',
early_stopping_rounds=50, # stop if no improvement for 50 rounds
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False
)
print(f"Best round: {model.best_iteration}")
print(f"Best val logloss: {model.best_score:.5f}")
# Use best_iteration as n_estimators in final model
The Four Tuning Strategies โ Overview
Hyperparameter tuning is the art of finding the combination of knobs that maximises model performance. There are four major strategies, each with very different trade-offs between speed, exploration, and compute cost.
Strategy 1 โ Grid Search
Grid Search trains a model for every possible combination of the values you specify. With 3 parameters each having 5 values, that's 5ยณ = 125 models. With 6 parameters it becomes 5โถ = 15,625 models โ each with 5-fold CV = 78,125 training runs.
Use Grid Search only when you have โค3 hyperparameters and a small, fast-training model. For XGBoost with its 10+ parameters, Grid Search is rarely the right choice. Use it for fine-tuning a small range after Random Search or Optuna has found a good region.
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
# Keep the grid SMALL โ 3 params ร 3 values = 27 combos ร 5 folds = 135 fits
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [100, 300, 500],
}
base_model = xgb.XGBClassifier(
subsample=0.8,
colsample_bytree=0.8,
use_label_encoder=False,
eval_metric='logloss',
random_state=42
)
grid_search = GridSearchCV(
estimator=base_model,
param_grid=param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1, # use all CPU cores
verbose=2
)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print(f"Best CV F1: {grid_search.best_score_:.4f}")
print(f"Test Accuracy: {grid_search.best_estimator_.score(X_test, y_test):.4f}")
# Inspect all results as a DataFrame
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
top5 = results_df.nlargest(5, 'mean_test_score')[[
'param_max_depth', 'param_learning_rate',
'param_n_estimators', 'mean_test_score'
]]
print(top5.to_string(index=False))
Adding one more parameter with 5 values multiplies total combinations by 5ร. If each XGBoost run takes 10 seconds and you have 6 parameters with 5 values each, Grid Search needs 5โถ ร 5-fold ร 10s = over 21 hours on a single machine. Random Search with 100 iterations would finish in under 17 minutes and typically finds a comparably good solution.
Strategy 2 โ Random Search
It turns out the gold is concentrated in a thin 50-metre seam diagonally across the desert. Prospector A's grid misses it entirely โ none of his rows cross that seam. Prospector B's random samples are more likely to stumble across it.
Bergstra & Bengio (2012) proved this mathematically: for hyperparameter search, random sampling is almost always more efficient than grid sampling when most parameters don't matter.
Random Search samples each parameter independently from a distribution (list, range, or statistical distribution). Because it doesn't waste evaluations testing every combination of unimportant parameters, it finds good solutions much faster than Grid Search.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform
import xgboost as xgb
# Wide search space โ Random Search explores this efficiently
param_dist = {
'n_estimators': randint(100, 1000),
'learning_rate': loguniform(0.005, 0.3), # log-scale: more low values
'max_depth': randint(3, 10),
'min_child_weight': randint(1, 10),
'subsample': uniform(0.5, 0.5), # uniform(loc, scale) โ [0.5, 1.0]
'colsample_bytree': uniform(0.4, 0.6), # โ [0.4, 1.0]
'gamma': uniform(0, 2),
'reg_alpha': loguniform(1e-4, 10),
'reg_lambda': loguniform(1e-4, 10),
}
base = xgb.XGBClassifier(
use_label_encoder=False,
eval_metric='logloss',
random_state=42
)
random_search = RandomizedSearchCV(
estimator=base,
param_distributions=param_dist,
n_iter=100, # 100 random combos โ much cheaper than full grid
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=42
)
random_search.fit(X_train, y_train)
print("Best Parameters:", random_search.best_params_)
print(f"Best CV AUC: {random_search.best_score_:.5f}")
best_xgb = random_search.best_estimator_
from sklearn.metrics import roc_auc_score
y_proba = best_xgb.predict_proba(X_test)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, y_proba):.5f}")
| Metric | Value |
|---|---|
| Evaluations | 135 (27ร5 folds) |
| Best CV AUC | 0.9431 |
| Search space coverage | Exhaustive in small grid |
| Missed params | subsample, alpha, lambda, gamma |
| Metric | Value |
|---|---|
| Evaluations | 500 (100ร5 folds) |
| Best CV AUC | 0.9721 |
| Search space coverage | Wide, log-uniform distributions |
| Covered params | All 9 key XGBoost params |
Strategy 3 โ Optuna (TPE + Pruning)
Optuna's Tree-structured Parzen Estimator (TPE) builds a probabilistic model of which parameter regions produce good results, then samples more from good regions. Its pruner terminates bad trials early, saving massive compute.
Optuna is a modern hyperparameter optimisation framework by Preferred Networks. It uses the TPE algorithm which models the objective function with two probability densities โ one for good parameter regions and one for bad โ and samples from the ratio.
pip install optuna
import optuna
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import numpy as np
# Suppress Optuna's default logging for clean output
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
# Define the search space with Optuna's suggest API
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.3, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 1.0),
'gamma': trial.suggest_float('gamma', 0, 5),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 10, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-4, 10, log=True),
'use_label_encoder': False,
'eval_metric': 'logloss',
'random_state': 42,
}
model = xgb.XGBClassifier(**params)
scores = cross_val_score(model, X_train, y_train,
cv=5, scoring='roc_auc',
n_jobs=-1)
return scores.mean() # Optuna maximises this
# Create a study and optimise
study = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42), # Tree Parzen Estimator
pruner=optuna.pruners.MedianPruner(n_startup_trials=10)
)
study.optimize(objective, n_trials=100, show_progress_bar=True)
print(f"Best AUC: {study.best_value:.5f}")
print(f"Best params: {study.best_params}")
# Retrain final model on full training set with best params
best_model = xgb.XGBClassifier(**study.best_params,
use_label_encoder=False,
eval_metric='logloss')
best_model.fit(X_train, y_train)
print(f"Test AUC: {roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]):.5f}")
Optuna with Early Stopping (Advanced โ Real Speed-up)
import optuna
from optuna.integration import XGBoostPruningCallback
import xgboost as xgb
def objective_with_pruning(trial):
params = {
'n_estimators': 1000,
'learning_rate': trial.suggest_float('lr', 0.005, 0.3, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 10, log=True),
'use_label_encoder':False,
'eval_metric': 'logloss',
'early_stopping_rounds': 50,
'callbacks': [XGBoostPruningCallback(trial, 'validation-logloss')]
}
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False)
return model.best_score # returns best val logloss (minimise)
study2 = optuna.create_study(direction='minimize') # minimise logloss
study2.optimize(objective_with_pruning, n_trials=80)
print(f"Pruned trials: {len([t for t in study2.trials if t.state == optuna.trial.TrialState.PRUNED])}")
print(f"Best logloss: {study2.best_value:.5f}")
Optuna generates rich interactive plots: optuna.visualization.plot_param_importances(study)
tells you which hyperparameters matter most, plot_optimization_history(study) shows convergence,
and plot_contour(study) reveals interactions between pairs of parameters.
These are invaluable for understanding your model's search landscape.
Strategy 4 โ Bayesian Optimisation
He balances exploitation (digging near known-good spots) and exploration (checking uncertain new areas). After 30 digs he finds the treasure โ a random searcher would have needed 300.
Bayesian Optimisation works identically: it builds a surrogate model (Gaussian Process or TPE) of the objective function and uses an acquisition function (like Expected Improvement) to pick the next most promising point to evaluate.
Bayesian Optimisation is the most sample-efficient tuning method โ it evaluates the fewest models to find the best result. It works by:
Bayesian Optimisation with scikit-optimize (skopt)
pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real, Integer
import xgboost as xgb
# Define search space with proper types
search_space = {
'n_estimators': Integer(100, 1000),
'learning_rate': Real(0.005, 0.3, prior='log-uniform'),
'max_depth': Integer(3, 10),
'min_child_weight': Integer(1, 10),
'subsample': Real(0.5, 1.0),
'colsample_bytree': Real(0.4, 1.0),
'gamma': Real(0, 5.0),
'reg_alpha': Real(1e-4, 10, prior='log-uniform'),
'reg_lambda': Real(1e-4, 10, prior='log-uniform'),
}
base = xgb.XGBClassifier(
use_label_encoder=False,
eval_metric='logloss',
random_state=42
)
# BayesSearchCV wraps the GP-based search in sklearn's CV framework
bayes_search = BayesSearchCV(
estimator=base,
search_spaces=search_space,
n_iter=50, # 50 evaluations โ often beats 200 random iters
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=0,
random_state=42
)
# Callback to print progress
def on_step(optim_result):
print(f" Iter {len(optim_result.x_iters):3d} | Best AUC: {-optim_result.fun:.5f}")
bayes_search.fit(X_train, y_train, callback=[on_step])
print(f"\nBest CV AUC: {bayes_search.best_score_:.5f}")
print("Best params:", bayes_search.best_params_)
Bayesian Optimisation with Optuna (GP Sampler)
import optuna
from optuna.samplers import GPSampler # Gaussian Process sampler (Optuna โฅ 3.6)
study_gp = optuna.create_study(
direction='maximize',
sampler=GPSampler(seed=42),
study_name='xgb_gp_tuning'
)
study_gp.optimize(objective, n_trials=50)
print(f"GP-Optuna Best AUC: {study_gp.best_value:.5f}")
Tuning Strategy Comparison โ Full Table
| Property | Grid Search | Random Search | Optuna (TPE) | Bayesian (GP) |
|---|---|---|---|---|
| Search strategy | Exhaustive grid | Random sampling | TPE (probabilistic) | GP surrogate + EI |
| Learns from past? | No | No | Yes (TPE model) | Yes (GP model) |
| Handles many params? | Fails (exponential) | Yes, scales well | Yes, very well | Medium (GP scales O(nยณ)) |
| Parallelisable? | Yes (trivial) | Yes (trivial) | Yes (async) | Partially (sequential by design) |
| Trial pruning? | No | No | Yes (MedianPruner) | Some implementations |
| Sample efficiency | Very low | Medium | High | Highest |
| Typical evaluations needed | All (pโฟ) | 50โ200 | 50โ150 | 25โ75 |
| Best for XGBoost when... | Fine-tuning 2โ3 params in narrow range | Initial broad exploration, โฅ5 params | Production tuning with budget | Slow evaluations, tight iteration budget |
| Library | sklearn.GridSearchCV |
sklearn.RandomizedSearchCV |
optuna |
scikit-optimize, optuna GPSampler |
Bayesian Optimisation and Optuna (TPE) converge to near-optimal solutions in far fewer evaluations than Grid or Random Search. The advantage is greatest when each evaluation is expensive (slow model training).
Feature Importance in XGBoost
XGBoost provides three types of built-in feature importance, each measuring something different:
import xgboost as xgb
import pandas as pd
import matplotlib.pyplot as plt
# Get feature importances (gain is most informative)
importances = model.get_booster().get_score(importance_type='gain')
imp_df = pd.DataFrame({
'feature': list(importances.keys()),
'importance': list(importances.values())
}).sort_values('importance', ascending=False)
print(imp_df.to_string(index=False))
# Built-in plot (requires matplotlib)
xgb.plot_importance(model, importance_type='gain', max_num_features=15)
plt.title('XGBoost Feature Importance (Gain)')
plt.tight_layout()
plt.show()
# SHAP values โ gold standard for XGBoost interpretability
# pip install shap
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)
Built-in feature importance (weight/gain/cover) can be misleading when features are correlated.
SHAP (SHapley Additive exPlanations) values are the gold standard:
they are consistent, locally accurate, and handle correlations properly.
XGBoost's TreeExplainer computes SHAP values in O(TLยฒ) time โ very fast.
Complete Production Pipeline โ End-to-End Example
import xgboost as xgb
import optuna
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.preprocessing import LabelEncoder
## โโ STEP 1: Data โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
X, y = make_classification(n_samples=10000, n_features=25,
n_informative=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.15, stratify=y_train, random_state=42
)
## โโ STEP 2: Find n_estimators with early stopping โ
probe = xgb.XGBClassifier(
n_estimators=2000, learning_rate=0.05, max_depth=6,
subsample=0.8, colsample_bytree=0.8,
eval_metric='logloss', early_stopping_rounds=50,
use_label_encoder=False, random_state=42
)
probe.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
best_n = probe.best_iteration
print(f"Best n_estimators: {best_n}")
## โโ STEP 3: Optuna for all other params โโโโโโโโโโโ
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
params = {
'n_estimators': best_n,
'learning_rate': trial.suggest_float('lr', 0.01, 0.1, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 8),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'gamma': trial.suggest_float('gamma', 0, 3),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 5, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-4, 5, log=True),
'use_label_encoder':False,
'eval_metric': 'logloss',
'random_state': 42,
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
xgb.XGBClassifier(**params), X_train, y_train,
cv=cv, scoring='roc_auc', n_jobs=-1
)
return scores.mean()
study = optuna.create_study(direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=80, show_progress_bar=True)
## โโ STEP 4: Final model โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
final_params = {**study.best_params, 'n_estimators': best_n,
'use_label_encoder': False, 'eval_metric': 'logloss'}
final_model = xgb.XGBClassifier(**final_params)
final_model.fit(X_train, y_train)
y_proba = final_model.predict_proba(X_test)[:, 1]
y_pred = final_model.predict(X_test)
print(f"\nTest AUC-ROC: {roc_auc_score(y_test, y_proba):.5f}")
print(classification_report(y_test, y_pred))
Golden Rules โ XGBoost & Hyperparameter Tuning
n_estimators very high (1000โ2000)
and let early_stopping_rounds=50 find the optimal number.
Then use that number as a fixed parameter in all tuning runs.
learning_rate=0.05. After all other tuning is done,
halve it to 0.025 and double n_estimators โ this usually gives another
0.5โ1% accuracy boost at the cost of 2ร training time.
max_depth and min_child_weight are the most impactful tree parameters.
Tune them together first. For most tabular problems, max_depth=4โ6 is optimal.
Higher depth rarely helps and always risks overfitting.
subsample and colsample_bytree to 0.6โ0.9.
This stochastic component reduces correlation between trees (like Random Forest's bagging)
and is one of XGBoost's most powerful anti-overfitting mechanisms.
scale_pos_weight = count(negative) / count(positive).
Use eval_metric='aucpr' (area under precision-recall curve) instead of AUC-ROC โ
it's more sensitive to rare class performance.
TreeExplainer.
XGBoost vs LightGBM vs CatBoost โ Quick Decision Guide
| Property | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Split strategy | Level-wise (BFS) | Leaf-wise (depth-first) | Symmetric tree (oblivious) |
| Training speed | Good | Fastest | Slower on dense data |
| Memory usage | High | Low (histogram) | Medium |
| Categorical features | Needs encoding | Basic native support | Excellent native support |
| Missing values | Native | Native | Native |
| Small datasets (<10k) | Best | OK (overfit risk) | Very good |
| Large datasets (>1M) | Slower | Fastest | Medium |
| Tune sensitivity | Medium | High (num_leaves) | Low (robust defaults) |
| Kaggle dominance era | 2014โ2017 | 2017โ2020 | 2019โpresent (tabular) |
Start with XGBoost for battle-tested reliability and the most documentation. Switch to LightGBM when training is too slow (>1M rows). Switch to CatBoost when you have many categorical columns and don't want to encode them. In competitions, ensemble all three for maximum performance.