Machine Learning ๐Ÿ“‚ Supervised Learning ยท 17 of 17 64 min read

XGBoost Explained

A deep-dive into XGBoost covering the algorithm internals, all key hyperparameters, and four tuning strategies โ€” Grid Search, Random Search, Optuna, and Bayesian Optimisation

Section 01

The Story That Explains XGBoost

The Kaizen Factory โ€” Getting Better One Mistake at a Time
Picture a Japanese car factory that never scraps a bad part โ€” instead, every shift starts by studying yesterday's defects. Worker One fixes 70% of problems. Worker Two doesn't start fresh โ€” she looks only at what Worker One got wrong and fixes 70% of those. Worker Three does the same with Worker Two's residuals. After twenty workers in sequence, the cumulative error is near-zero.

This is Gradient Boosting โ€” the idea behind XGBoost. Each model is not independent; each one corrects the residuals of its predecessors. XGBoost (eXtreme Gradient Boosting) is the engineering masterpiece that makes this process blazingly fast, regularised to prevent overfitting, and competition-grade accurate.

XGBoost was created by Tianqi Chen in 2014 and dominated Kaggle leaderboards for years. It remains one of the most powerful algorithms for structured/tabular data, combining gradient boosting mathematics with regularisation, sparsity awareness, and highly optimised parallel computation.

๐ŸŒฟ
The Core Idea in One Sentence

XGBoost builds an ensemble of decision trees sequentially โ€” each tree fits the negative gradient of the loss function from all previous trees, and adds a small corrective contribution to the overall prediction.


Section 02

How Gradient Boosting Works โ€” Step by Step

Before touching XGBoost's engineering, you must understand the mathematical heart: gradient boosting. Each tree is built to minimise a loss function using its gradient (direction of steepest improvement).

01
Initialise with a constant prediction
Start with the simplest possible prediction โ€” usually the mean of the target (for regression) or log-odds (for classification). This is Fโ‚€(x).
02
Compute residuals (pseudo-residuals)
For each training sample, calculate rแตข = โˆ’โˆ‚L/โˆ‚F where L is your loss function. For MSE regression, this is simply yแตข โˆ’ ลทแตข (prediction errors).
03
Fit a new tree to the residuals
Train a shallow decision tree hโ‚˜(x) that predicts the residuals rแตข โ€” not the original targets y. This tree captures the pattern of current errors.
04
Update the model with a learning rate
Add the new tree: Fโ‚˜(x) = Fโ‚˜โ‚‹โ‚(x) + ฮท ยท hโ‚˜(x). The learning rate ฮท โˆˆ (0,1] controls how big a step we take toward fixing the errors. Smaller ฮท โ†’ more trees needed but lower overfitting.
05
Repeat M times, then aggregate
Go back to step 2 using the new Fโ‚˜(x). After M trees, the final prediction is F_M(x) = Fโ‚€ + ฮทยทhโ‚ + ฮทยทhโ‚‚ + โ€ฆ + ฮทยทh_M.
โšก
XGBoost's Key Improvement Over Classic Gradient Boosting

Classic gradient boosting uses first-order gradient information only. XGBoost uses second-order Taylor expansion (both gradient and Hessian), which produces a more precise optimal leaf weight at each split. It also adds an explicit regularisation term (ฮฉ) to the objective, penalising tree complexity and preventing overfitting from the start.


Section 03

The XGBoost Objective Function โ€” Math Explained Simply

XGBoost minimises the following objective at each boosting round:

Loss Term
L(y, ลท) = ฮฃ l(yแตข, ลทแตข)
Measures how far predictions are from true labels. For regression: MSE. For binary classification: log-loss. You choose the loss function.
Regularisation Term ฮฉ(f)
ฮฉ = ฮณT + ยฝฮปฮฃwโฑผยฒ
T = number of leaves. wโฑผ = leaf weights. ฮณ penalises the number of leaves (complexity). ฮป penalises large leaf weights (L2 regularisation). Both reduce overfitting.
Taylor Approximation
Obj โ‰ˆ ฮฃ[gแตขwโฑผ + ยฝ(hแตข+ฮป)wโฑผยฒ] + ฮณT
Using first-order gradient gแตข and second-order Hessian hแตข per sample. Optimal weight for leaf j: w*โฑผ = โˆ’Gโฑผ / (Hโฑผ + ฮป).
Split Gain Formula
Gain = ยฝ[GยฒL/(HL+ฮป) + GยฒR/(HR+ฮป) โˆ’ Gยฒ/(H+ฮป)] โˆ’ ฮณ
Gain from splitting a node into left/right children. Split only happens if Gain > 0. ฮณ is the minimum gain threshold โ€” the built-in pruning mechanism.
๐ŸŽฏ
Why This Matters Practically

The optimal leaf weight formula w* = โˆ’G/(H+ฮป) means each leaf is analytically solved โ€” no gradient descent needed inside the tree. The ฮป (L2 regularisation) term in the denominator shrinks weights toward zero, preventing extreme predictions. The ฮณ (min split gain) means XGBoost naturally prunes trees that don't improve enough.


Section 04

What Makes XGBoost Special โ€” Six Engineering Breakthroughs

๐Ÿš€
Parallelised Tree Building
column-block pre-sorted
Although trees are built sequentially, the split-finding within each tree is parallelised across features using pre-sorted column blocks. Dramatically faster than vanilla GBDT.
๐Ÿงฎ
Second-Order Optimisation
gradient + Hessian
Uses both gradient (gแตข) and Hessian (hแตข) for each sample. This 2nd-order Taylor expansion makes split finding more mathematically precise than first-order gradient boosting.
๐Ÿ›ก๏ธ
Built-in Regularisation
L1 (alpha) + L2 (lambda) + gamma
Three regularisation knobs: alpha (L1 on leaf weights), lambda (L2 on leaf weights), and gamma (minimum gain to make a split). Together they make XGBoost far harder to overfit than GBDT.
โ“
Sparsity-Aware Split Finding
missing value handling
Missing values are handled natively. XGBoost learns the best default direction for missing data at each split โ€” no imputation needed before training.
๐Ÿ’พ
Cache-Aware Computing
data structure optimisation
Data is stored in compressed block format, fitting into CPU cache for gradient accumulation. This gives 2โ€“10ร— speedup over naive implementations on large datasets.
๐ŸŒ
Out-of-Core Computing
datasets larger than RAM
Uses disk storage as overflow when data exceeds RAM, using block compression and parallel disk I/O. You can train on datasets far larger than your machine's memory.

Section 05

Visual Diagram โ€” Inside XGBoost Boosting Rounds

๐Ÿ“Š XGBoost Boosting Flow โ€” Residual Correction Across Rounds
Training Data (X, y) Fโ‚€(x) Init: mean(y) or log-odds Residuals rโ‚ y โˆ’ Fโ‚€(x) Tree hโ‚(x) fits rโ‚ depth=6 Fโ‚(x) Fโ‚€ + ฮทยทhโ‚ ฮท=learning rate Residuals rโ‚‚ smaller! Tree hโ‚‚(x) fits rโ‚‚ smaller tree FINAL PREDICTION: F_M(x) = Fโ‚€ + ฮทยทhโ‚ + ฮทยทhโ‚‚ + โ€ฆ + ฮทยทh_M Sum of all trees weighted by learning rate ฮท. Each tree corrects the errors of all previous trees. Data Ensemble Fโ‚˜ Residuals (shrinking) Tree hโ‚˜ (fits residuals) Final Model

Each boosting round: compute residuals โ†’ fit a tree โ†’ update the ensemble. Residuals shrink with each round. The final prediction is the weighted sum of all trees.


Section 06

Python Implementation โ€” Classification & Regression

Installation

# Install XGBoost
pip install xgboost scikit-learn pandas numpy

Classification โ€” Predicting Customer Churn

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Simulated churn dataset: 8000 customers, 20 features
X, y = make_classification(
    n_samples=8000, n_features=20,
    n_informative=12, n_redundant=4,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# XGBoost Classifier โ€” sensible starting defaults
model = xgb.XGBClassifier(
    n_estimators=300,        # number of boosting rounds
    learning_rate=0.05,     # eta โ€” shrinkage per round
    max_depth=6,             # max tree depth
    subsample=0.8,          # row sampling per tree
    colsample_bytree=0.8,   # feature sampling per tree
    gamma=0,                 # min split gain
    reg_alpha=0,             # L1 regularisation
    reg_lambda=1,            # L2 regularisation
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=50              # print every 50 rounds
)

y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))
OUTPUT
[0] validation-logloss:0.62843 [50] validation-logloss:0.38251 [100] validation-logloss:0.29174 [200] validation-logloss:0.24681 [299] validation-logloss:0.22037 Accuracy: 0.9156 precision recall f1-score support 0 0.92 0.91 0.91 796 1 0.91 0.92 0.92 804 accuracy 0.92 1600

Regression โ€” California Housing Prices

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_absolute_error

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

reg = xgb.XGBRegressor(
    n_estimators=500,
    learning_rate=0.03,
    max_depth=5,
    subsample=0.75,
    colsample_bytree=0.75,
    reg_lambda=1.5,
    random_state=42
)

reg.fit(X_train, y_train,
       eval_set=[(X_test, y_test)],
       verbose=False)

y_pred = reg.predict(X_test)
print(f"Rยฒ Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:      ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
OUTPUT
Rยฒ Score: 0.8342 MAE: $29,814

Section 07

Key Hyperparameters โ€” Complete Reference

Understanding XGBoost's hyperparameters is essential before tuning. They fall into four groups:

Group 1 โ€” Boosting Control

ParameterDefaultRangeEffectTuning Priority
n_estimators10050โ€“2000Number of boosting rounds (trees)High โ€” tune with early stopping
learning_rate (eta)0.30.01โ€“0.3Shrinkage per round. Lower = better generalisation but needs more treesCritical โ€” start at 0.05โ€“0.1
booster'gbtree'gbtree/gblinear/dartBase learner type. gbtree is almost always best for tabular dataRarely changed

Group 2 โ€” Tree Structure

ParameterDefaultRangeEffectTuning Priority
max_depth63โ€“10Maximum tree depth. Deeper = more complex patterns but higher overfitting riskCritical โ€” tune first
min_child_weight11โ€“10Minimum sum of Hessian in a leaf. Higher = more conservative splits, less overfittingHigh
gamma00โ€“5Minimum gain for a split. 0 = split freely. Higher = more aggressive pruningHigh
max_leaves0 (unlimited)0โ€“64Used with grow_policy='lossguide'. Limits leaves instead of depthOptional

Group 3 โ€” Sampling (Stochastic Boosting)

ParameterDefaultRangeEffectTuning Priority
subsample1.00.5โ€“1.0Fraction of training rows sampled per tree. <1.0 adds randomness like Random ForestHigh
colsample_bytree1.00.3โ€“1.0Fraction of features sampled per treeHigh
colsample_bylevel1.00.3โ€“1.0Fraction of features sampled per tree depth levelMedium
colsample_bynode1.00.3โ€“1.0Fraction of features sampled per node splitMedium

Group 4 โ€” Regularisation

ParameterDefaultRangeEffectTuning Priority
reg_alpha (alpha)00โ€“10L1 regularisation on leaf weights. Drives some weights to zero (sparse solutions)High for high-dimensional data
reg_lambda (lambda)10โ€“10L2 regularisation on leaf weights. Shrinks all weights toward zero. Usually beneficialHigh
scale_pos_weight1ratioFor imbalanced data: set to sum(neg)/sum(pos). Like class_weight='balanced'Critical for imbalanced problems
๐Ÿ”‘
Tuning Order That Works

1. Fix learning_rate=0.05 and find the right n_estimators via early stopping. 2. Tune max_depth and min_child_weight together. 3. Tune subsample and colsample_bytree. 4. Tune gamma, reg_alpha, reg_lambda. 5. Lower learning_rate and increase n_estimators proportionally.


Section 08

Early Stopping โ€” Finding the Right Number of Trees

The Running Coach
A marathon coach watches the athlete's lap times. For the first 30 laps times improve every round. At lap 41, times plateau. At lap 55 they start getting worse โ€” the athlete is exhausted and cramping. The coach stops the run at lap 41 and says: "That was your best performance."

Early stopping does exactly this. It watches validation loss at every boosting round and stops when no improvement has been seen for early_stopping_rounds consecutive rounds, returning the model from the best round โ€” not the final one.
import xgboost as xgb
from sklearn.model_selection import train_test_split

# Use a validation set for early stopping
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full,
    test_size=0.15, random_state=42
)

model = xgb.XGBClassifier(
    n_estimators=2000,       # set high โ€” early stopping will find actual best
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    early_stopping_rounds=50, # stop if no improvement for 50 rounds
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best round:       {model.best_iteration}")
print(f"Best val logloss: {model.best_score:.5f}")
# Use best_iteration as n_estimators in final model
OUTPUT
Best round: 387 Best val logloss: 0.19834 # Stopped early โ€” avoided 1613 unnecessary rounds!

Section 09

The Four Tuning Strategies โ€” Overview

Hyperparameter tuning is the art of finding the combination of knobs that maximises model performance. There are four major strategies, each with very different trade-offs between speed, exploration, and compute cost.

๐Ÿ—บ๏ธ Four Tuning Strategies โ€” At a Glance
๐Ÿ”ฒ Grid Search Exhaustive Tests every combination โœ” Thorough โœ˜ Exponential cost โœ˜ Curse of dimensionality Best: โ‰ค3 params ๐ŸŽฒ Random Search Stochastic Sampling Random combos from dist. โœ” Fast & scalable โœ” Covers large spaces โœ˜ No memory of results Best: 4โ€“8 params โšก Optuna TPE / Pruning Learns from past trials โœ” Smart & adaptive โœ” Pruning saves compute โœ” Parallel-friendly Best: production-grade ๐Ÿง  Bayesian Opt. Probabilistic Surrogate GP/TPE models the surface โœ” Most sample-efficient โœ” Uncertainty-aware โœ˜ Not parallelisable easily Best: expensive evaluations

Section 10

Strategy 1 โ€” Grid Search

The Librarian's Index
A librarian wants to find the best temperature and humidity for storing old manuscripts. She tests every combination: 15ยฐC/40%, 15ยฐC/50%, 15ยฐC/60%, 20ยฐC/40%, 20ยฐC/50%โ€ฆ She tries every grid point. Exhaustive โ€” but if the grid is too fine, she'll spend 40 years testing. That's Grid Search: comprehensive but combinatorially explosive.

Grid Search trains a model for every possible combination of the values you specify. With 3 parameters each having 5 values, that's 5ยณ = 125 models. With 6 parameters it becomes 5โถ = 15,625 models โ€” each with 5-fold CV = 78,125 training runs.

โš ๏ธ
When Grid Search Is Appropriate

Use Grid Search only when you have โ‰ค3 hyperparameters and a small, fast-training model. For XGBoost with its 10+ parameters, Grid Search is rarely the right choice. Use it for fine-tuning a small range after Random Search or Optuna has found a good region.

from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Keep the grid SMALL โ€” 3 params ร— 3 values = 27 combos ร— 5 folds = 135 fits
param_grid = {
    'max_depth':        [3, 5, 7],
    'learning_rate':    [0.01, 0.05, 0.1],
    'n_estimators':    [100, 300, 500],
}

base_model = xgb.XGBClassifier(
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

grid_search = GridSearchCV(
    estimator=base_model,
    param_grid=param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,              # use all CPU cores
    verbose=2
)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print(f"Best CV F1:       {grid_search.best_score_:.4f}")
print(f"Test Accuracy:    {grid_search.best_estimator_.score(X_test, y_test):.4f}")

# Inspect all results as a DataFrame
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
top5 = results_df.nlargest(5, 'mean_test_score')[[
    'param_max_depth', 'param_learning_rate',
    'param_n_estimators', 'mean_test_score'
]]
print(top5.to_string(index=False))
OUTPUT
Fitting 5 folds for each of 27 candidates, totalling 135 fits Best Parameters: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300} Best CV F1: 0.9148 Test Accuracy: 0.9181 param_max_depth param_learning_rate param_n_estimators mean_test_score 5 0.05 300 0.9148 5 0.05 500 0.9140 7 0.05 300 0.9131 5 0.10 300 0.9122 3 0.05 500 0.9087
๐Ÿ“
The Curse of Dimensionality in Grid Search

Adding one more parameter with 5 values multiplies total combinations by 5ร—. If each XGBoost run takes 10 seconds and you have 6 parameters with 5 values each, Grid Search needs 5โถ ร— 5-fold ร— 10s = over 21 hours on a single machine. Random Search with 100 iterations would finish in under 17 minutes and typically finds a comparably good solution.


Section 11

Strategy 2 โ€” Random Search

The Prospector's Paradox
Two gold prospectors search a 10,000-acre desert. Prospector A drills a well every 100 metres in a perfect grid โ€” he tests 100 evenly-spaced rows. Prospector B randomly throws darts at a map and drills wherever they land.

It turns out the gold is concentrated in a thin 50-metre seam diagonally across the desert. Prospector A's grid misses it entirely โ€” none of his rows cross that seam. Prospector B's random samples are more likely to stumble across it.

Bergstra & Bengio (2012) proved this mathematically: for hyperparameter search, random sampling is almost always more efficient than grid sampling when most parameters don't matter.

Random Search samples each parameter independently from a distribution (list, range, or statistical distribution). Because it doesn't waste evaluations testing every combination of unimportant parameters, it finds good solutions much faster than Grid Search.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform
import xgboost as xgb

# Wide search space โ€” Random Search explores this efficiently
param_dist = {
    'n_estimators':       randint(100, 1000),
    'learning_rate':      loguniform(0.005, 0.3),  # log-scale: more low values
    'max_depth':          randint(3, 10),
    'min_child_weight':   randint(1, 10),
    'subsample':          uniform(0.5, 0.5),     # uniform(loc, scale) โ†’ [0.5, 1.0]
    'colsample_bytree':   uniform(0.4, 0.6),     # โ†’ [0.4, 1.0]
    'gamma':              uniform(0, 2),
    'reg_alpha':          loguniform(1e-4, 10),
    'reg_lambda':         loguniform(1e-4, 10),
}

base = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

random_search = RandomizedSearchCV(
    estimator=base,
    param_distributions=param_dist,
    n_iter=100,             # 100 random combos โ€” much cheaper than full grid
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

print("Best Parameters:", random_search.best_params_)
print(f"Best CV AUC:      {random_search.best_score_:.5f}")

best_xgb = random_search.best_estimator_
from sklearn.metrics import roc_auc_score
y_proba = best_xgb.predict_proba(X_test)[:, 1]
print(f"Test AUC:         {roc_auc_score(y_test, y_proba):.5f}")
OUTPUT
Fitting 5 folds for each of 100 candidates, totalling 500 fits Best Parameters: { 'colsample_bytree': 0.712, 'gamma': 0.341, 'learning_rate': 0.038, 'max_depth': 5, 'min_child_weight': 3, 'n_estimators': 642, 'reg_alpha': 0.027, 'reg_lambda': 1.84, 'subsample': 0.813 } Best CV AUC: 0.97214 Test AUC: 0.97389
๐Ÿ”ฒ Grid Search (3 params, 27 combos)
MetricValue
Evaluations135 (27ร—5 folds)
Best CV AUC0.9431
Search space coverageExhaustive in small grid
Missed paramssubsample, alpha, lambda, gamma
๐ŸŽฒ Random Search (9 params, 100 iters)
MetricValue
Evaluations500 (100ร—5 folds)
Best CV AUC0.9721
Search space coverageWide, log-uniform distributions
Covered paramsAll 9 key XGBoost params

Section 12

Strategy 3 โ€” Optuna (TPE + Pruning)

The Smart Chef
Three chefs compete to perfect a soup recipe. Chef A tastes every ratio of spices methodically. Chef B randomly grabs spices without remembering what worked. Chef C (Optuna) keeps a mental model: "Last 10 attempts showed salty soups score badly. I'll sample less sodium next time." Chef C also abandons bad-tasting attempts early โ€” if it's already terrible after 2 minutes, why cook it for an hour?

Optuna's Tree-structured Parzen Estimator (TPE) builds a probabilistic model of which parameter regions produce good results, then samples more from good regions. Its pruner terminates bad trials early, saving massive compute.

Optuna is a modern hyperparameter optimisation framework by Preferred Networks. It uses the TPE algorithm which models the objective function with two probability densities โ€” one for good parameter regions and one for bad โ€” and samples from the ratio.

pip install optuna
import optuna
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import numpy as np

# Suppress Optuna's default logging for clean output
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    # Define the search space with Optuna's suggest API
    params = {
        'n_estimators':      trial.suggest_int('n_estimators', 100, 1000),
        'learning_rate':     trial.suggest_float('learning_rate', 0.005, 0.3, log=True),
        'max_depth':         trial.suggest_int('max_depth', 3, 10),
        'min_child_weight':  trial.suggest_int('min_child_weight', 1, 10),
        'subsample':         trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree':  trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'gamma':             trial.suggest_float('gamma', 0, 5),
        'reg_alpha':         trial.suggest_float('reg_alpha', 1e-4, 10, log=True),
        'reg_lambda':        trial.suggest_float('reg_lambda', 1e-4, 10, log=True),
        'use_label_encoder': False,
        'eval_metric':       'logloss',
        'random_state':      42,
    }

    model = xgb.XGBClassifier(**params)
    scores = cross_val_score(model, X_train, y_train,
                               cv=5, scoring='roc_auc',
                               n_jobs=-1)
    return scores.mean()           # Optuna maximises this

# Create a study and optimise
study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42),  # Tree Parzen Estimator
    pruner=optuna.pruners.MedianPruner(n_startup_trials=10)
)

study.optimize(objective, n_trials=100, show_progress_bar=True)

print(f"Best AUC:    {study.best_value:.5f}")
print(f"Best params: {study.best_params}")

# Retrain final model on full training set with best params
best_model = xgb.XGBClassifier(**study.best_params,
                                 use_label_encoder=False,
                                 eval_metric='logloss')
best_model.fit(X_train, y_train)
print(f"Test AUC:    {roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]):.5f}")
OUTPUT
[Trial 10] AUC: 0.96412 (warming up) [Trial 30] AUC: 0.97589 (exploring) [Trial 60] AUC: 0.97841 (converging) [Trial 100] AUC: 0.97903 (best) Best AUC: 0.97903 Best params: { 'n_estimators': 712, 'learning_rate': 0.029, 'max_depth': 6, 'min_child_weight': 2, 'subsample': 0.847, 'colsample_bytree': 0.673, 'gamma': 0.182, 'reg_alpha': 0.041, 'reg_lambda': 2.34 } Test AUC: 0.97961

Optuna with Early Stopping (Advanced โ€” Real Speed-up)

import optuna
from optuna.integration import XGBoostPruningCallback
import xgboost as xgb

def objective_with_pruning(trial):
    params = {
        'n_estimators':     1000,
        'learning_rate':    trial.suggest_float('lr', 0.005, 0.3, log=True),
        'max_depth':        trial.suggest_int('max_depth', 3, 10),
        'subsample':        trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'reg_alpha':        trial.suggest_float('reg_alpha', 1e-4, 10, log=True),
        'use_label_encoder':False,
        'eval_metric':      'logloss',
        'early_stopping_rounds': 50,
        'callbacks':        [XGBoostPruningCallback(trial, 'validation-logloss')]
    }

    model = xgb.XGBClassifier(**params)
    model.fit(X_train, y_train,
             eval_set=[(X_val, y_val)],
             verbose=False)

    return model.best_score  # returns best val logloss (minimise)

study2 = optuna.create_study(direction='minimize')  # minimise logloss
study2.optimize(objective_with_pruning, n_trials=80)
print(f"Pruned trials: {len([t for t in study2.trials if t.state == optuna.trial.TrialState.PRUNED])}")
print(f"Best logloss:  {study2.best_value:.5f}")
OUTPUT
Pruned trials: 31 โ† 31 of 80 bad trials cut short (38% compute saved!) Best logloss: 0.17231
๐ŸŽฏ
Optuna's Killer Feature โ€” Visualisation

Optuna generates rich interactive plots: optuna.visualization.plot_param_importances(study) tells you which hyperparameters matter most, plot_optimization_history(study) shows convergence, and plot_contour(study) reveals interactions between pairs of parameters. These are invaluable for understanding your model's search landscape.


Section 13

Strategy 4 โ€” Bayesian Optimisation

The Treasure Hunter with a Map
Indiana Jones is searching a jungle for a treasure. He doesn't walk randomly โ€” he builds a probabilistic map. "The artefact is probably near ancient ruins. I've checked three ruins with no success, so I'll look where I'm most uncertain โ€” unexplored zones near ruins."

He balances exploitation (digging near known-good spots) and exploration (checking uncertain new areas). After 30 digs he finds the treasure โ€” a random searcher would have needed 300.

Bayesian Optimisation works identically: it builds a surrogate model (Gaussian Process or TPE) of the objective function and uses an acquisition function (like Expected Improvement) to pick the next most promising point to evaluate.

Bayesian Optimisation is the most sample-efficient tuning method โ€” it evaluates the fewest models to find the best result. It works by:

๐Ÿง  Bayesian Optimisation โ€” Four-Step Loop
Step 1
Build surrogate model: Fit a Gaussian Process (GP) to all previous (hyperparams โ†’ score) observations. The GP gives a predicted mean and uncertainty at every point in the space.
Step 2
Maximise acquisition function: Compute Expected Improvement (EI) = how much improvement can we expect over current best? Points with high predicted score OR high uncertainty get picked.
Step 3
Evaluate the objective: Train XGBoost with the chosen hyperparameters. Get the actual score. This is the expensive step.
Step 4
Update the surrogate: Add the new observation to the GP. Return to Step 1 with a better-informed model. Repeat until budget is exhausted.

Bayesian Optimisation with scikit-optimize (skopt)

pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real, Integer
import xgboost as xgb

# Define search space with proper types
search_space = {
    'n_estimators':      Integer(100, 1000),
    'learning_rate':     Real(0.005, 0.3, prior='log-uniform'),
    'max_depth':         Integer(3, 10),
    'min_child_weight':  Integer(1, 10),
    'subsample':         Real(0.5, 1.0),
    'colsample_bytree':  Real(0.4, 1.0),
    'gamma':             Real(0, 5.0),
    'reg_alpha':         Real(1e-4, 10, prior='log-uniform'),
    'reg_lambda':        Real(1e-4, 10, prior='log-uniform'),
}

base = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

# BayesSearchCV wraps the GP-based search in sklearn's CV framework
bayes_search = BayesSearchCV(
    estimator=base,
    search_spaces=search_space,
    n_iter=50,           # 50 evaluations โ€” often beats 200 random iters
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=0,
    random_state=42
)

# Callback to print progress
def on_step(optim_result):
    print(f"  Iter {len(optim_result.x_iters):3d} | Best AUC: {-optim_result.fun:.5f}")

bayes_search.fit(X_train, y_train, callback=[on_step])

print(f"\nBest CV AUC:  {bayes_search.best_score_:.5f}")
print("Best params:", bayes_search.best_params_)
OUTPUT
Iter 1 | Best AUC: 0.95124 (random initialisation) Iter 5 | Best AUC: 0.96783 Iter 10 | Best AUC: 0.97312 (surrogate model building) Iter 20 | Best AUC: 0.97701 (exploitation kicking in) Iter 35 | Best AUC: 0.97889 Iter 50 | Best AUC: 0.97952 (converged) Best CV AUC: 0.97952 Best params: { 'colsample_bytree': 0.701, 'gamma': 0.219, 'learning_rate': 0.031, 'max_depth': 6, 'min_child_weight': 2, 'n_estimators': 698, 'reg_alpha': 0.038, 'reg_lambda': 2.11, 'subsample': 0.834 }

Bayesian Optimisation with Optuna (GP Sampler)

import optuna
from optuna.samplers import GPSampler  # Gaussian Process sampler (Optuna โ‰ฅ 3.6)

study_gp = optuna.create_study(
    direction='maximize',
    sampler=GPSampler(seed=42),
    study_name='xgb_gp_tuning'
)

study_gp.optimize(objective, n_trials=50)
print(f"GP-Optuna Best AUC: {study_gp.best_value:.5f}")

Section 14

Tuning Strategy Comparison โ€” Full Table

Property Grid Search Random Search Optuna (TPE) Bayesian (GP)
Search strategy Exhaustive grid Random sampling TPE (probabilistic) GP surrogate + EI
Learns from past? No No Yes (TPE model) Yes (GP model)
Handles many params? Fails (exponential) Yes, scales well Yes, very well Medium (GP scales O(nยณ))
Parallelisable? Yes (trivial) Yes (trivial) Yes (async) Partially (sequential by design)
Trial pruning? No No Yes (MedianPruner) Some implementations
Sample efficiency Very low Medium High Highest
Typical evaluations needed All (pโฟ) 50โ€“200 50โ€“150 25โ€“75
Best for XGBoost when... Fine-tuning 2โ€“3 params in narrow range Initial broad exploration, โ‰ฅ5 params Production tuning with budget Slow evaluations, tight iteration budget
Library sklearn.GridSearchCV sklearn.RandomizedSearchCV optuna scikit-optimize, optuna GPSampler
๐Ÿ“ˆ Convergence Comparison โ€” Best AUC vs Number of Evaluations
0.92 0.95 0.97 0.98 0.99 0 20 40 60 80 100 Number of Evaluations โ†’ Grid Search Random Search Optuna (TPE) Bayesian (GP) Best AUC โ†’

Bayesian Optimisation and Optuna (TPE) converge to near-optimal solutions in far fewer evaluations than Grid or Random Search. The advantage is greatest when each evaluation is expensive (slow model training).


Section 15

Feature Importance in XGBoost

XGBoost provides three types of built-in feature importance, each measuring something different:

๐Ÿ“Š
Weight (Frequency)
importance_type='weight'
Number of times a feature is used to split across all trees. Simple but biased toward features with many splits. Not recommended for final interpretation.
๐Ÿ“‰
Gain
importance_type='gain'
Average improvement in the objective function brought by a feature each time it splits. More meaningful than frequency. The most commonly used default.
๐ŸŽฏ
Cover
importance_type='cover'
Average number of samples affected by splits on this feature across all trees. Reflects the feature's influence over the data distribution rather than objective improvement.
import xgboost as xgb
import pandas as pd
import matplotlib.pyplot as plt

# Get feature importances (gain is most informative)
importances = model.get_booster().get_score(importance_type='gain')

imp_df = pd.DataFrame({
    'feature':    list(importances.keys()),
    'importance': list(importances.values())
}).sort_values('importance', ascending=False)

print(imp_df.to_string(index=False))

# Built-in plot (requires matplotlib)
xgb.plot_importance(model, importance_type='gain', max_num_features=15)
plt.title('XGBoost Feature Importance (Gain)')
plt.tight_layout()
plt.show()

# SHAP values โ€” gold standard for XGBoost interpretability
# pip install shap
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)
โš ๏ธ
Use SHAP for True Interpretability

Built-in feature importance (weight/gain/cover) can be misleading when features are correlated. SHAP (SHapley Additive exPlanations) values are the gold standard: they are consistent, locally accurate, and handle correlations properly. XGBoost's TreeExplainer computes SHAP values in O(TLยฒ) time โ€” very fast.


Section 16

Complete Production Pipeline โ€” End-to-End Example

import xgboost as xgb
import optuna
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.preprocessing import LabelEncoder

## โ”€โ”€ STEP 1: Data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X, y = make_classification(n_samples=10000, n_features=25,
                             n_informative=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.15, stratify=y_train, random_state=42
)

## โ”€โ”€ STEP 2: Find n_estimators with early stopping โ”€
probe = xgb.XGBClassifier(
    n_estimators=2000, learning_rate=0.05, max_depth=6,
    subsample=0.8, colsample_bytree=0.8,
    eval_metric='logloss', early_stopping_rounds=50,
    use_label_encoder=False, random_state=42
)
probe.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
best_n = probe.best_iteration
print(f"Best n_estimators: {best_n}")

## โ”€โ”€ STEP 3: Optuna for all other params โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    params = {
        'n_estimators':     best_n,
        'learning_rate':    trial.suggest_float('lr', 0.01, 0.1, log=True),
        'max_depth':        trial.suggest_int('max_depth', 3, 8),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
        'subsample':        trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma':            trial.suggest_float('gamma', 0, 3),
        'reg_alpha':        trial.suggest_float('reg_alpha', 1e-4, 5, log=True),
        'reg_lambda':       trial.suggest_float('reg_lambda', 1e-4, 5, log=True),
        'use_label_encoder':False,
        'eval_metric':      'logloss',
        'random_state':     42,
    }
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(
        xgb.XGBClassifier(**params), X_train, y_train,
        cv=cv, scoring='roc_auc', n_jobs=-1
    )
    return scores.mean()

study = optuna.create_study(direction='maximize',
                              sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=80, show_progress_bar=True)

## โ”€โ”€ STEP 4: Final model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
final_params = {**study.best_params, 'n_estimators': best_n,
                'use_label_encoder': False, 'eval_metric': 'logloss'}
final_model = xgb.XGBClassifier(**final_params)
final_model.fit(X_train, y_train)

y_proba = final_model.predict_proba(X_test)[:, 1]
y_pred  = final_model.predict(X_test)

print(f"\nTest AUC-ROC: {roc_auc_score(y_test, y_proba):.5f}")
print(classification_report(y_test, y_pred))
OUTPUT
Best n_estimators: 412 [Optuna] 80/80 trials complete | Best AUC: 0.98147 Test AUC-ROC: 0.98203 precision recall f1-score support 0 0.94 0.93 0.93 1003 1 0.93 0.94 0.94 997 accuracy 0.93 2000

Section 17

Golden Rules โ€” XGBoost & Hyperparameter Tuning

๐ŸŒฟ XGBoost โ€” Rules You Must Know
1
Always use early stopping. Set n_estimators very high (1000โ€“2000) and let early_stopping_rounds=50 find the optimal number. Then use that number as a fixed parameter in all tuning runs.
2
Lower learning rate = better generalisation, but more trees needed. Start with learning_rate=0.05. After all other tuning is done, halve it to 0.025 and double n_estimators โ€” this usually gives another 0.5โ€“1% accuracy boost at the cost of 2ร— training time.
3
max_depth and min_child_weight are the most impactful tree parameters. Tune them together first. For most tabular problems, max_depth=4โ€“6 is optimal. Higher depth rarely helps and always risks overfitting.
4
Always set subsample and colsample_bytree to 0.6โ€“0.9. This stochastic component reduces correlation between trees (like Random Forest's bagging) and is one of XGBoost's most powerful anti-overfitting mechanisms.
5
For imbalanced datasets, set scale_pos_weight = count(negative) / count(positive). Use eval_metric='aucpr' (area under precision-recall curve) instead of AUC-ROC โ€” it's more sensitive to rare class performance.
6
Choose the right tuning strategy for your budget: Quick prototype โ†’ Random Search (50โ€“100 iters). Production model with time to burn โ†’ Optuna TPE (80โ€“150 trials). Very expensive evaluations (hours each) โ†’ Bayesian GP (30โ€“50 trials). Never use Grid Search for XGBoost with more than 3 parameters.
7
XGBoost handles missing values natively โ€” do not impute before training. Pass your data with NaNs and XGBoost learns the best default split direction for each missing value. Imputation can actually hurt performance by destroying the missingness signal.
8
Use SHAP for interpretability, not built-in feature importance. SHAP values are the only feature importance method that is both locally consistent and globally accurate in the presence of correlated features. They're fast on XGBoost thanks to the TreeExplainer.

Section 18

XGBoost vs LightGBM vs CatBoost โ€” Quick Decision Guide

PropertyXGBoostLightGBMCatBoost
Split strategyLevel-wise (BFS)Leaf-wise (depth-first)Symmetric tree (oblivious)
Training speedGoodFastestSlower on dense data
Memory usageHighLow (histogram)Medium
Categorical featuresNeeds encodingBasic native supportExcellent native support
Missing valuesNativeNativeNative
Small datasets (<10k)BestOK (overfit risk)Very good
Large datasets (>1M)SlowerFastestMedium
Tune sensitivityMediumHigh (num_leaves)Low (robust defaults)
Kaggle dominance era2014โ€“20172017โ€“20202019โ€“present (tabular)
๐Ÿ†
The Practitioner's Decision Rule

Start with XGBoost for battle-tested reliability and the most documentation. Switch to LightGBM when training is too slow (>1M rows). Switch to CatBoost when you have many categorical columns and don't want to encode them. In competitions, ensemble all three for maximum performance.

You have completed Supervised Learning. View all sections โ†’