XGBoost Tutorial: Grid Search, Random Search

Section 01

The Story That Explains XGBoost

📖 Real World Analogy

The Kaizen Factory — Getting Better One Mistake at a Time

Picture a Japanese car factory that never scraps a bad part — instead, every shift starts by studying yesterday's defects. Worker One fixes 70% of problems. Worker Two doesn't start fresh — she looks only at what Worker One got wrong and fixes 70% of those. Worker Three does the same with Worker Two's residuals. After twenty workers in sequence, the cumulative error is near-zero.

This is Gradient Boosting — the idea behind XGBoost. Each model is not independent; each one corrects the residuals of its predecessors. XGBoost (eXtreme Gradient Boosting) is the engineering masterpiece that makes this process blazingly fast, regularised to prevent overfitting, and competition-grade accurate.

XGBoost was created by Tianqi Chen in 2014 and dominated Kaggle leaderboards for years. It remains one of the most powerful algorithms for structured/tabular data, combining gradient boosting mathematics with regularisation, sparsity awareness, and highly optimised parallel computation.

🌿

The Core Idea in One Sentence

XGBoost builds an ensemble of decision trees sequentially — each tree fits the negative gradient of the loss function from all previous trees, and adds a small corrective contribution to the overall prediction.

Section 02

How Gradient Boosting Works — Step by Step

Before touching XGBoost's engineering, you must understand the mathematical heart: gradient boosting. Each tree is built to minimise a loss function using its gradient (direction of steepest improvement).

Initialise with a constant prediction

Start with the simplest possible prediction — usually the mean of the target (for regression) or log-odds (for classification). This is F₀(x).

Compute residuals (pseudo-residuals)

For each training sample, calculate rᵢ = −∂L/∂F where L is your loss function. For MSE regression, this is simply yᵢ − ŷᵢ (prediction errors).

Fit a new tree to the residuals

Train a shallow decision tree hₘ(x) that predicts the residuals rᵢ — not the original targets y. This tree captures the pattern of current errors.

Update the model with a learning rate

Add the new tree: Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x). The learning rate η ∈ (0,1] controls how big a step we take toward fixing the errors. Smaller η → more trees needed but lower overfitting.

Repeat M times, then aggregate

Go back to step 2 using the new Fₘ(x). After M trees, the final prediction is F_M(x) = F₀ + η·h₁ + η·h₂ + … + η·h_M.

⚡

XGBoost's Key Improvement Over Classic Gradient Boosting

Classic gradient boosting uses first-order gradient information only. XGBoost uses second-order Taylor expansion (both gradient and Hessian), which produces a more precise optimal leaf weight at each split. It also adds an explicit regularisation term (Ω) to the objective, penalising tree complexity and preventing overfitting from the start.

Section 03

The XGBoost Objective Function — Math Explained Simply

XGBoost minimises the following objective at each boosting round:

Loss Term

L(y, ŷ) = Σ l(yᵢ, ŷᵢ)

Measures how far predictions are from true labels. For regression: MSE. For binary classification: log-loss. You choose the loss function.

Regularisation Term Ω(f)

Ω = γT + ½λΣwⱼ²

T = number of leaves. wⱼ = leaf weights. γ penalises the number of leaves (complexity). λ penalises large leaf weights (L2 regularisation). Both reduce overfitting.

Taylor Approximation

Obj ≈ Σ[gᵢwⱼ + ½(hᵢ+λ)wⱼ²] + γT

Using first-order gradient gᵢ and second-order Hessian hᵢ per sample. Optimal weight for leaf j: w*ⱼ = −Gⱼ / (Hⱼ + λ).

Split Gain Formula

Gain = ½[G²L/(HL+λ) + G²R/(HR+λ) − G²/(H+λ)] − γ

Gain from splitting a node into left/right children. Split only happens if Gain > 0. γ is the minimum gain threshold — the built-in pruning mechanism.

🎯

Why This Matters Practically

The optimal leaf weight formula w* = −G/(H+λ) means each leaf is analytically solved — no gradient descent needed inside the tree. The λ (L2 regularisation) term in the denominator shrinks weights toward zero, preventing extreme predictions. The γ (min split gain) means XGBoost naturally prunes trees that don't improve enough.

Section 04

What Makes XGBoost Special — Six Engineering Breakthroughs

🚀

Parallelised Tree Building

column-block pre-sorted

Although trees are built sequentially, the split-finding within each tree is parallelised across features using pre-sorted column blocks. Dramatically faster than vanilla GBDT.

🧮

Second-Order Optimisation

gradient + Hessian

Uses both gradient (gᵢ) and Hessian (hᵢ) for each sample. This 2nd-order Taylor expansion makes split finding more mathematically precise than first-order gradient boosting.

🛡️

Built-in Regularisation

L1 (alpha) + L2 (lambda) + gamma

Three regularisation knobs: alpha (L1 on leaf weights), lambda (L2 on leaf weights), and gamma (minimum gain to make a split). Together they make XGBoost far harder to overfit than GBDT.

❓

Sparsity-Aware Split Finding

missing value handling

Missing values are handled natively. XGBoost learns the best default direction for missing data at each split — no imputation needed before training.

💾

Cache-Aware Computing

data structure optimisation

Data is stored in compressed block format, fitting into CPU cache for gradient accumulation. This gives 2–10× speedup over naive implementations on large datasets.

🌐

Out-of-Core Computing

datasets larger than RAM

Uses disk storage as overflow when data exceeds RAM, using block compression and parallel disk I/O. You can train on datasets far larger than your machine's memory.

Section 05

Visual Diagram — Inside XGBoost Boosting Rounds

📊 XGBoost Boosting Flow — Residual Correction Across Rounds

Each boosting round: compute residuals → fit a tree → update the ensemble. Residuals shrink with each round. The final prediction is the weighted sum of all trees.

Section 06

Python Implementation — Classification & Regression

Installation

# Install XGBoost
pip install xgboost scikit-learn pandas numpy

Classification — Predicting Customer Churn

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Simulated churn dataset: 8000 customers, 20 features
X, y = make_classification(
    n_samples=8000, n_features=20,
    n_informative=12, n_redundant=4,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# XGBoost Classifier — sensible starting defaults
model = xgb.XGBClassifier(
    n_estimators=300,        # number of boosting rounds
    learning_rate=0.05,     # eta — shrinkage per round
    max_depth=6,             # max tree depth
    subsample=0.8,          # row sampling per tree
    colsample_bytree=0.8,   # feature sampling per tree
    gamma=0,                 # min split gain
    reg_alpha=0,             # L1 regularisation
    reg_lambda=1,            # L2 regularisation
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=50              # print every 50 rounds
)

y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

OUTPUT

[0] validation-logloss:0.62843 [50] validation-logloss:0.38251 [100] validation-logloss:0.29174 [200] validation-logloss:0.24681 [299] validation-logloss:0.22037 Accuracy: 0.9156 precision recall f1-score support 0 0.92 0.91 0.91 796 1 0.91 0.92 0.92 804 accuracy 0.92 1600

Regression — California Housing Prices

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_absolute_error

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

reg = xgb.XGBRegressor(
    n_estimators=500,
    learning_rate=0.03,
    max_depth=5,
    subsample=0.75,
    colsample_bytree=0.75,
    reg_lambda=1.5,
    random_state=42
)

reg.fit(X_train, y_train,
       eval_set=[(X_test, y_test)],
       verbose=False)

y_pred = reg.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:      ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")

OUTPUT

R² Score: 0.8342 MAE: $29,814

Section 07

Key Hyperparameters — Complete Reference

Understanding XGBoost's hyperparameters is essential before tuning. They fall into four groups:

Group 1 — Boosting Control

Parameter	Default	Range	Effect	Tuning Priority
`n_estimators`	100	50–2000	Number of boosting rounds (trees)	High — tune with early stopping
`learning_rate` (eta)	0.3	0.01–0.3	Shrinkage per round. Lower = better generalisation but needs more trees	Critical — start at 0.05–0.1
`booster`	'gbtree'	gbtree/gblinear/dart	Base learner type. gbtree is almost always best for tabular data	Rarely changed

Group 2 — Tree Structure

Parameter	Default	Range	Effect	Tuning Priority
`max_depth`	6	3–10	Maximum tree depth. Deeper = more complex patterns but higher overfitting risk	Critical — tune first
`min_child_weight`	1	1–10	Minimum sum of Hessian in a leaf. Higher = more conservative splits, less overfitting	High
`gamma`	0	0–5	Minimum gain for a split. 0 = split freely. Higher = more aggressive pruning	High
`max_leaves`	0 (unlimited)	0–64	Used with grow_policy='lossguide'. Limits leaves instead of depth	Optional

Group 3 — Sampling (Stochastic Boosting)

Parameter	Default	Range	Effect	Tuning Priority
`subsample`	1.0	0.5–1.0	Fraction of training rows sampled per tree. <1.0 adds randomness like Random Forest	High
`colsample_bytree`	1.0	0.3–1.0	Fraction of features sampled per tree	High
`colsample_bylevel`	1.0	0.3–1.0	Fraction of features sampled per tree depth level	Medium
`colsample_bynode`	1.0	0.3–1.0	Fraction of features sampled per node split	Medium

Group 4 — Regularisation

Parameter	Default	Range	Effect	Tuning Priority
`reg_alpha` (alpha)	0	0–10	L1 regularisation on leaf weights. Drives some weights to zero (sparse solutions)	High for high-dimensional data
`reg_lambda` (lambda)	1	0–10	L2 regularisation on leaf weights. Shrinks all weights toward zero. Usually beneficial	High
`scale_pos_weight`	1	ratio	For imbalanced data: set to sum(neg)/sum(pos). Like class_weight='balanced'	Critical for imbalanced problems

🔑

Tuning Order That Works

1. Fix learning_rate=0.05 and find the right n_estimators via early stopping. 2. Tune max_depth and min_child_weight together. 3. Tune subsample and colsample_bytree. 4. Tune gamma, reg_alpha, reg_lambda. 5. Lower learning_rate and increase n_estimators proportionally.

Section 08

Early Stopping — Finding the Right Number of Trees

📖 Story

The Running Coach

A marathon coach watches the athlete's lap times. For the first 30 laps times improve every round. At lap 41, times plateau. At lap 55 they start getting worse — the athlete is exhausted and cramping. The coach stops the run at lap 41 and says: "That was your best performance."

Early stopping does exactly this. It watches validation loss at every boosting round and stops when no improvement has been seen for early_stopping_rounds consecutive rounds, returning the model from the best round — not the final one.

import xgboost as xgb
from sklearn.model_selection import train_test_split

# Use a validation set for early stopping
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full,
    test_size=0.15, random_state=42
)

model = xgb.XGBClassifier(
    n_estimators=2000,       # set high — early stopping will find actual best
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    early_stopping_rounds=50, # stop if no improvement for 50 rounds
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best round:       {model.best_iteration}")
print(f"Best val logloss: {model.best_score:.5f}")
# Use best_iteration as n_estimators in final model

OUTPUT

Best round: 387 Best val logloss: 0.19834 # Stopped early — avoided 1613 unnecessary rounds!

Section 09

The Four Tuning Strategies — Overview

Hyperparameter tuning is the art of finding the combination of knobs that maximises model performance. There are four major strategies, each with very different trade-offs between speed, exploration, and compute cost.

🗺️ Four Tuning Strategies — At a Glance

Section 10

Strategy 1 — Grid Search

📖 Story

The Librarian's Index

A librarian wants to find the best temperature and humidity for storing old manuscripts. She tests every combination: 15°C/40%, 15°C/50%, 15°C/60%, 20°C/40%, 20°C/50%… She tries every grid point. Exhaustive — but if the grid is too fine, she'll spend 40 years testing. That's Grid Search: comprehensive but combinatorially explosive.

Grid Search trains a model for every possible combination of the values you specify. With 3 parameters each having 5 values, that's 5³ = 125 models. With 6 parameters it becomes 5⁶ = 15,625 models — each with 5-fold CV = 78,125 training runs.

⚠️

When Grid Search Is Appropriate

Use Grid Search only when you have ≤3 hyperparameters and a small, fast-training model. For XGBoost with its 10+ parameters, Grid Search is rarely the right choice. Use it for fine-tuning a small range after Random Search or Optuna has found a good region.

from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Keep the grid SMALL — 3 params × 3 values = 27 combos × 5 folds = 135 fits
param_grid = {
    'max_depth':        [3, 5, 7],
    'learning_rate':    [0.01, 0.05, 0.1],
    'n_estimators':    [100, 300, 500],
}

base_model = xgb.XGBClassifier(
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

grid_search = GridSearchCV(
    estimator=base_model,
    param_grid=param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,              # use all CPU cores
    verbose=2
)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print(f"Best CV F1:       {grid_search.best_score_:.4f}")
print(f"Test Accuracy:    {grid_search.best_estimator_.score(X_test, y_test):.4f}")

# Inspect all results as a DataFrame
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
top5 = results_df.nlargest(5, 'mean_test_score')[[
    'param_max_depth', 'param_learning_rate',
    'param_n_estimators', 'mean_test_score'
]]
print(top5.to_string(index=False))

OUTPUT

Fitting 5 folds for each of 27 candidates, totalling 135 fits Best Parameters: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300} Best CV F1: 0.9148 Test Accuracy: 0.9181 param_max_depth param_learning_rate param_n_estimators mean_test_score 5 0.05 300 0.9148 5 0.05 500 0.9140 7 0.05 300 0.9131 5 0.10 300 0.9122 3 0.05 500 0.9087

📐

The Curse of Dimensionality in Grid Search

Adding one more parameter with 5 values multiplies total combinations by 5×. If each XGBoost run takes 10 seconds and you have 6 parameters with 5 values each, Grid Search needs 5⁶ × 5-fold × 10s = over 21 hours on a single machine. Random Search with 100 iterations would finish in under 17 minutes and typically finds a comparably good solution.

Section 11

Strategy 2 — Random Search

📖 Story

The Prospector's Paradox

Two gold prospectors search a 10,000-acre desert. Prospector A drills a well every 100 metres in a perfect grid — he tests 100 evenly-spaced rows. Prospector B randomly throws darts at a map and drills wherever they land.

It turns out the gold is concentrated in a thin 50-metre seam diagonally across the desert. Prospector A's grid misses it entirely — none of his rows cross that seam. Prospector B's random samples are more likely to stumble across it.

Bergstra & Bengio (2012) proved this mathematically: for hyperparameter search, random sampling is almost always more efficient than grid sampling when most parameters don't matter.

Random Search samples each parameter independently from a distribution (list, range, or statistical distribution). Because it doesn't waste evaluations testing every combination of unimportant parameters, it finds good solutions much faster than Grid Search.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform
import xgboost as xgb

# Wide search space — Random Search explores this efficiently
param_dist = {
    'n_estimators':       randint(100, 1000),
    'learning_rate':      loguniform(0.005, 0.3),  # log-scale: more low values
    'max_depth':          randint(3, 10),
    'min_child_weight':   randint(1, 10),
    'subsample':          uniform(0.5, 0.5),     # uniform(loc, scale) → [0.5, 1.0]
    'colsample_bytree':   uniform(0.4, 0.6),     # → [0.4, 1.0]
    'gamma':              uniform(0, 2),
    'reg_alpha':          loguniform(1e-4, 10),
    'reg_lambda':         loguniform(1e-4, 10),
}

base = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

random_search = RandomizedSearchCV(
    estimator=base,
    param_distributions=param_dist,
    n_iter=100,             # 100 random combos — much cheaper than full grid
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

print("Best Parameters:", random_search.best_params_)
print(f"Best CV AUC:      {random_search.best_score_:.5f}")

best_xgb = random_search.best_estimator_
from sklearn.metrics import roc_auc_score
y_proba = best_xgb.predict_proba(X_test)[:, 1]
print(f"Test AUC:         {roc_auc_score(y_test, y_proba):.5f}")

OUTPUT

Fitting 5 folds for each of 100 candidates, totalling 500 fits Best Parameters: { 'colsample_bytree': 0.712, 'gamma': 0.341, 'learning_rate': 0.038, 'max_depth': 5, 'min_child_weight': 3, 'n_estimators': 642, 'reg_alpha': 0.027, 'reg_lambda': 1.84, 'subsample': 0.813 } Best CV AUC: 0.97214 Test AUC: 0.97389

🔲 Grid Search (3 params, 27 combos)

Metric	Value
Evaluations	135 (27×5 folds)
Best CV AUC	0.9431
Search space coverage	Exhaustive in small grid
Missed params	subsample, alpha, lambda, gamma

🎲 Random Search (9 params, 100 iters)

Metric	Value
Evaluations	500 (100×5 folds)
Best CV AUC	0.9721
Search space coverage	Wide, log-uniform distributions
Covered params	All 9 key XGBoost params

Section 12

Strategy 3 — Optuna (TPE + Pruning)

📖 Story

The Smart Chef

Three chefs compete to perfect a soup recipe. Chef A tastes every ratio of spices methodically. Chef B randomly grabs spices without remembering what worked. Chef C (Optuna) keeps a mental model: "Last 10 attempts showed salty soups score badly. I'll sample less sodium next time." Chef C also abandons bad-tasting attempts early — if it's already terrible after 2 minutes, why cook it for an hour?

Optuna's Tree-structured Parzen Estimator (TPE) builds a probabilistic model of which parameter regions produce good results, then samples more from good regions. Its pruner terminates bad trials early, saving massive compute.

Optuna is a modern hyperparameter optimisation framework by Preferred Networks. It uses the TPE algorithm which models the objective function with two probability densities — one for good parameter regions and one for bad — and samples from the ratio.

pip install optuna

import optuna
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import numpy as np

# Suppress Optuna's default logging for clean output
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    # Define the search space with Optuna's suggest API
    params = {
        'n_estimators':      trial.suggest_int('n_estimators', 100, 1000),
        'learning_rate':     trial.suggest_float('learning_rate', 0.005, 0.3, log=True),
        'max_depth':         trial.suggest_int('max_depth', 3, 10),
        'min_child_weight':  trial.suggest_int('min_child_weight', 1, 10),
        'subsample':         trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree':  trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'gamma':             trial.suggest_float('gamma', 0, 5),
        'reg_alpha':         trial.suggest_float('reg_alpha', 1e-4, 10, log=True),
        'reg_lambda':        trial.suggest_float('reg_lambda', 1e-4, 10, log=True),
        'use_label_encoder': False,
        'eval_metric':       'logloss',
        'random_state':      42,
    }

    model = xgb.XGBClassifier(**params)
    scores = cross_val_score(model, X_train, y_train,
                               cv=5, scoring='roc_auc',
                               n_jobs=-1)
    return scores.mean()           # Optuna maximises this

# Create a study and optimise
study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42),  # Tree Parzen Estimator
    pruner=optuna.pruners.MedianPruner(n_startup_trials=10)
)

study.optimize(objective, n_trials=100, show_progress_bar=True)

print(f"Best AUC:    {study.best_value:.5f}")
print(f"Best params: {study.best_params}")

# Retrain final model on full training set with best params
best_model = xgb.XGBClassifier(**study.best_params,
                                 use_label_encoder=False,
                                 eval_metric='logloss')
best_model.fit(X_train, y_train)
print(f"Test AUC:    {roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]):.5f}")

OUTPUT

[Trial 10] AUC: 0.96412 (warming up) [Trial 30] AUC: 0.97589 (exploring) [Trial 60] AUC: 0.97841 (converging) [Trial 100] AUC: 0.97903 (best) Best AUC: 0.97903 Best params: { 'n_estimators': 712, 'learning_rate': 0.029, 'max_depth': 6, 'min_child_weight': 2, 'subsample': 0.847, 'colsample_bytree': 0.673, 'gamma': 0.182, 'reg_alpha': 0.041, 'reg_lambda': 2.34 } Test AUC: 0.97961

Optuna with Early Stopping (Advanced — Real Speed-up)

import optuna
from optuna.integration import XGBoostPruningCallback
import xgboost as xgb

def objective_with_pruning(trial):
    params = {
        'n_estimators':     1000,
        'learning_rate':    trial.suggest_float('lr', 0.005, 0.3, log=True),
        'max_depth':        trial.suggest_int('max_depth', 3, 10),
        'subsample':        trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'reg_alpha':        trial.suggest_float('reg_alpha', 1e-4, 10, log=True),
        'use_label_encoder':False,
        'eval_metric':      'logloss',
        'early_stopping_rounds': 50,
        'callbacks':        [XGBoostPruningCallback(trial, 'validation-logloss')]
    }

    model = xgb.XGBClassifier(**params)
    model.fit(X_train, y_train,
             eval_set=[(X_val, y_val)],
             verbose=False)

    return model.best_score  # returns best val logloss (minimise)

study2 = optuna.create_study(direction='minimize')  # minimise logloss
study2.optimize(objective_with_pruning, n_trials=80)
print(f"Pruned trials: {len([t for t in study2.trials if t.state == optuna.trial.TrialState.PRUNED])}")
print(f"Best logloss:  {study2.best_value:.5f}")

OUTPUT

Pruned trials: 31 ← 31 of 80 bad trials cut short (38% compute saved!) Best logloss: 0.17231

🎯

Optuna's Killer Feature — Visualisation

Optuna generates rich interactive plots: optuna.visualization.plot_param_importances(study) tells you which hyperparameters matter most, plot_optimization_history(study) shows convergence, and plot_contour(study) reveals interactions between pairs of parameters. These are invaluable for understanding your model's search landscape.

Section 13

Strategy 4 — Bayesian Optimisation

📖 Story

The Treasure Hunter with a Map

Indiana Jones is searching a jungle for a treasure. He doesn't walk randomly — he builds a probabilistic map. "The artefact is probably near ancient ruins. I've checked three ruins with no success, so I'll look where I'm most uncertain — unexplored zones near ruins."

He balances exploitation (digging near known-good spots) and exploration (checking uncertain new areas). After 30 digs he finds the treasure — a random searcher would have needed 300.

Bayesian Optimisation works identically: it builds a surrogate model (Gaussian Process or TPE) of the objective function and uses an acquisition function (like Expected Improvement) to pick the next most promising point to evaluate.

Bayesian Optimisation is the most sample-efficient tuning method — it evaluates the fewest models to find the best result. It works by:

🧠 Bayesian Optimisation — Four-Step Loop

Step 1

Build surrogate model: Fit a Gaussian Process (GP) to all previous (hyperparams → score) observations. The GP gives a predicted mean and uncertainty at every point in the space.

Step 2

Maximise acquisition function: Compute Expected Improvement (EI) = how much improvement can we expect over current best? Points with high predicted score OR high uncertainty get picked.

Step 3

Evaluate the objective: Train XGBoost with the chosen hyperparameters. Get the actual score. This is the expensive step.

Step 4

Update the surrogate: Add the new observation to the GP. Return to Step 1 with a better-informed model. Repeat until budget is exhausted.

Bayesian Optimisation with scikit-optimize (skopt)

pip install scikit-optimize

from skopt import BayesSearchCV
from skopt.space import Real, Integer
import xgboost as xgb

# Define search space with proper types
search_space = {
    'n_estimators':      Integer(100, 1000),
    'learning_rate':     Real(0.005, 0.3, prior='log-uniform'),
    'max_depth':         Integer(3, 10),
    'min_child_weight':  Integer(1, 10),
    'subsample':         Real(0.5, 1.0),
    'colsample_bytree':  Real(0.4, 1.0),
    'gamma':             Real(0, 5.0),
    'reg_alpha':         Real(1e-4, 10, prior='log-uniform'),
    'reg_lambda':        Real(1e-4, 10, prior='log-uniform'),
}

base = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

# BayesSearchCV wraps the GP-based search in sklearn's CV framework
bayes_search = BayesSearchCV(
    estimator=base,
    search_spaces=search_space,
    n_iter=50,           # 50 evaluations — often beats 200 random iters
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=0,
    random_state=42
)

# Callback to print progress
def on_step(optim_result):
    print(f"  Iter {len(optim_result.x_iters):3d} | Best AUC: {-optim_result.fun:.5f}")

bayes_search.fit(X_train, y_train, callback=[on_step])

print(f"\nBest CV AUC:  {bayes_search.best_score_:.5f}")
print("Best params:", bayes_search.best_params_)

OUTPUT

Iter 1 | Best AUC: 0.95124 (random initialisation) Iter 5 | Best AUC: 0.96783 Iter 10 | Best AUC: 0.97312 (surrogate model building) Iter 20 | Best AUC: 0.97701 (exploitation kicking in) Iter 35 | Best AUC: 0.97889 Iter 50 | Best AUC: 0.97952 (converged) Best CV AUC: 0.97952 Best params: { 'colsample_bytree': 0.701, 'gamma': 0.219, 'learning_rate': 0.031, 'max_depth': 6, 'min_child_weight': 2, 'n_estimators': 698, 'reg_alpha': 0.038, 'reg_lambda': 2.11, 'subsample': 0.834 }

Bayesian Optimisation with Optuna (GP Sampler)

import optuna
from optuna.samplers import GPSampler  # Gaussian Process sampler (Optuna ≥ 3.6)

study_gp = optuna.create_study(
    direction='maximize',
    sampler=GPSampler(seed=42),
    study_name='xgb_gp_tuning'
)

study_gp.optimize(objective, n_trials=50)
print(f"GP-Optuna Best AUC: {study_gp.best_value:.5f}")

Section 14

Tuning Strategy Comparison — Full Table

Property	Grid Search	Random Search	Optuna (TPE)	Bayesian (GP)
Search strategy	Exhaustive grid	Random sampling	TPE (probabilistic)	GP surrogate + EI
Learns from past?	No	No	Yes (TPE model)	Yes (GP model)
Handles many params?	Fails (exponential)	Yes, scales well	Yes, very well	Medium (GP scales O(n³))
Parallelisable?	Yes (trivial)	Yes (trivial)	Yes (async)	Partially (sequential by design)
Trial pruning?	No	No	Yes (MedianPruner)	Some implementations
Sample efficiency	Very low	Medium	High	Highest
Typical evaluations needed	All (pⁿ)	50–200	50–150	25–75
Best for XGBoost when...	Fine-tuning 2–3 params in narrow range	Initial broad exploration, ≥5 params	Production tuning with budget	Slow evaluations, tight iteration budget
Library	`sklearn.GridSearchCV`	`sklearn.RandomizedSearchCV`	`optuna`	`scikit-optimize`, `optuna GPSampler`

📈 Convergence Comparison — Best AUC vs Number of Evaluations

Bayesian Optimisation and Optuna (TPE) converge to near-optimal solutions in far fewer evaluations than Grid or Random Search. The advantage is greatest when each evaluation is expensive (slow model training).

Section 15

Feature Importance in XGBoost

XGBoost provides three types of built-in feature importance, each measuring something different:

📊

Weight (Frequency)

importance_type='weight'

Number of times a feature is used to split across all trees. Simple but biased toward features with many splits. Not recommended for final interpretation.

📉

Gain

importance_type='gain'

Average improvement in the objective function brought by a feature each time it splits. More meaningful than frequency. The most commonly used default.

🎯

Cover

importance_type='cover'

Average number of samples affected by splits on this feature across all trees. Reflects the feature's influence over the data distribution rather than objective improvement.

import xgboost as xgb
import pandas as pd
import matplotlib.pyplot as plt

# Get feature importances (gain is most informative)
importances = model.get_booster().get_score(importance_type='gain')

imp_df = pd.DataFrame({
    'feature':    list(importances.keys()),
    'importance': list(importances.values())
}).sort_values('importance', ascending=False)

print(imp_df.to_string(index=False))

# Built-in plot (requires matplotlib)
xgb.plot_importance(model, importance_type='gain', max_num_features=15)
plt.title('XGBoost Feature Importance (Gain)')
plt.tight_layout()
plt.show()

# SHAP values — gold standard for XGBoost interpretability
# pip install shap
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

⚠️

Use SHAP for True Interpretability

Built-in feature importance (weight/gain/cover) can be misleading when features are correlated. SHAP (SHapley Additive exPlanations) values are the gold standard: they are consistent, locally accurate, and handle correlations properly. XGBoost's TreeExplainer computes SHAP values in O(TL²) time — very fast.

Section 16

Complete Production Pipeline — End-to-End Example

import xgboost as xgb
import optuna
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.preprocessing import LabelEncoder

## ── STEP 1: Data ──────────────────────────────
X, y = make_classification(n_samples=10000, n_features=25,
                             n_informative=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.15, stratify=y_train, random_state=42
)

## ── STEP 2: Find n_estimators with early stopping ─
probe = xgb.XGBClassifier(
    n_estimators=2000, learning_rate=0.05, max_depth=6,
    subsample=0.8, colsample_bytree=0.8,
    eval_metric='logloss', early_stopping_rounds=50,
    use_label_encoder=False, random_state=42
)
probe.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
best_n = probe.best_iteration
print(f"Best n_estimators: {best_n}")

## ── STEP 3: Optuna for all other params ───────────
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    params = {
        'n_estimators':     best_n,
        'learning_rate':    trial.suggest_float('lr', 0.01, 0.1, log=True),
        'max_depth':        trial.suggest_int('max_depth', 3, 8),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
        'subsample':        trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma':            trial.suggest_float('gamma', 0, 3),
        'reg_alpha':        trial.suggest_float('reg_alpha', 1e-4, 5, log=True),
        'reg_lambda':       trial.suggest_float('reg_lambda', 1e-4, 5, log=True),
        'use_label_encoder':False,
        'eval_metric':      'logloss',
        'random_state':     42,
    }
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(
        xgb.XGBClassifier(**params), X_train, y_train,
        cv=cv, scoring='roc_auc', n_jobs=-1
    )
    return scores.mean()

study = optuna.create_study(direction='maximize',
                              sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=80, show_progress_bar=True)

## ── STEP 4: Final model ────────────────────────────
final_params = {**study.best_params, 'n_estimators': best_n,
                'use_label_encoder': False, 'eval_metric': 'logloss'}
final_model = xgb.XGBClassifier(**final_params)
final_model.fit(X_train, y_train)

y_proba = final_model.predict_proba(X_test)[:, 1]
y_pred  = final_model.predict(X_test)

print(f"\nTest AUC-ROC: {roc_auc_score(y_test, y_proba):.5f}")
print(classification_report(y_test, y_pred))

OUTPUT

Best n_estimators: 412 [Optuna] 80/80 trials complete | Best AUC: 0.98147 Test AUC-ROC: 0.98203 precision recall f1-score support 0 0.94 0.93 0.93 1003 1 0.93 0.94 0.94 997 accuracy 0.93 2000

Section 17

Golden Rules — XGBoost & Hyperparameter Tuning

🌿 XGBoost — Rules You Must Know

Always use early stopping. Set n_estimators very high (1000–2000) and let early_stopping_rounds=50 find the optimal number. Then use that number as a fixed parameter in all tuning runs.

Lower learning rate = better generalisation, but more trees needed. Start with learning_rate=0.05. After all other tuning is done, halve it to 0.025 and double n_estimators — this usually gives another 0.5–1% accuracy boost at the cost of 2× training time.

max_depth and min_child_weight are the most impactful tree parameters. Tune them together first. For most tabular problems, max_depth=4–6 is optimal. Higher depth rarely helps and always risks overfitting.

Always set subsample and colsample_bytree to 0.6–0.9. This stochastic component reduces correlation between trees (like Random Forest's bagging) and is one of XGBoost's most powerful anti-overfitting mechanisms.

For imbalanced datasets, set scale_pos_weight = count(negative) / count(positive). Use eval_metric='aucpr' (area under precision-recall curve) instead of AUC-ROC — it's more sensitive to rare class performance.

Choose the right tuning strategy for your budget: Quick prototype → Random Search (50–100 iters). Production model with time to burn → Optuna TPE (80–150 trials). Very expensive evaluations (hours each) → Bayesian GP (30–50 trials). Never use Grid Search for XGBoost with more than 3 parameters.

XGBoost handles missing values natively — do not impute before training. Pass your data with NaNs and XGBoost learns the best default split direction for each missing value. Imputation can actually hurt performance by destroying the missingness signal.

Use SHAP for interpretability, not built-in feature importance. SHAP values are the only feature importance method that is both locally consistent and globally accurate in the presence of correlated features. They're fast on XGBoost thanks to the TreeExplainer.

Section 18

XGBoost vs LightGBM vs CatBoost — Quick Decision Guide

Property	XGBoost	LightGBM	CatBoost
Split strategy	Level-wise (BFS)	Leaf-wise (depth-first)	Symmetric tree (oblivious)
Training speed	Good	Fastest	Slower on dense data
Memory usage	High	Low (histogram)	Medium
Categorical features	Needs encoding	Basic native support	Excellent native support
Missing values	Native	Native	Native
Small datasets (<10k)	Best	OK (overfit risk)	Very good
Large datasets (>1M)	Slower	Fastest	Medium
Tune sensitivity	Medium	High (num_leaves)	Low (robust defaults)
Kaggle dominance era	2014–2017	2017–2020	2019–present (tabular)

🏆

The Practitioner's Decision Rule

Start with XGBoost for battle-tested reliability and the most documentation. Switch to LightGBM when training is too slow (>1M rows). Switch to CatBoost when you have many categorical columns and don't want to encode them. In competitions, ensemble all three for maximum performance.