Machine Learning ๐Ÿ“‚ Supervised Learning ยท 17 of 17 53 min read

Boosting & XGBoost

A deep-dive tutorial on ensemble boosting โ€” from the intuition behind AdaBoost to the math, regularisation, and Python implementation of XGBoost โ€” with visual diagrams, real-world stories, and annotated code examples.

Section 01

The Story That Explains Boosting

The Student Who Learned From Every Mistake
Imagine a student sitting three maths exams. After the first exam, the teacher marks only the questions the student got wrong โ€” and the next lesson focuses entirely on those hard ones. After the second exam, the same process: find the new weak spots, drill them hard. By the third exam the student is formidable โ€” not because they were naturally gifted, but because every iteration fixed the previous iteration's failures.

That is Boosting in one paragraph. Instead of training many independent models (like Random Forest), boosting trains models sequentially โ€” each new model focuses laser-precisely on the examples the previous ensemble got wrong.

Boosting is a family of ensemble learning algorithms. Its unifying idea: combine many weak learners (models that are only slightly better than random guessing) into one strong learner by letting each weak learner fix what its predecessors could not. The three most important members of this family are:

๐Ÿ“ˆ
AdaBoost (1996)
Adaptive Boosting
Adjusts sample weights after each round. Misclassified samples get heavier weight so the next learner pays more attention to them. The final prediction is a weighted vote of all learners.
๐Ÿ“‰
Gradient Boosting (1999)
Gradient Descent in Function Space
Each new model fits the residual errors (pseudo-residuals / gradients) of the current ensemble. Works for any differentiable loss function โ€” classification, regression, ranking.
โšก
XGBoost (2014)
Extreme Gradient Boosting
An optimised, regularised implementation of gradient boosting. Adds L1/L2 regularisation, handles missing data natively, supports GPU training, and dominates competitive ML to this day.
๐Ÿ’ก
Bagging vs Boosting โ€” The Core Difference

Bagging (Random Forest) builds trees in parallel and combines their predictions. Each tree is independent. The goal is to reduce variance. Boosting builds trees sequentially, each correcting the last. The goal is to reduce bias. This makes boosting more powerful but also more prone to overfitting if not regularised.


Section 02

AdaBoost โ€” Where It All Started

AdaBoost (Adaptive Boosting) was introduced by Freund & Schapire in 1996 and won the Gรถdel Prize. It is the simplest boosting algorithm to understand โ€” and understanding it unlocks everything that follows.

The Panel of Distracted Judges
A talent show has five judges, each half-asleep and only barely competent. After each performer, the show producers look at which contestant the judges disagreed on most and make that contestant the opening act of the next round โ€” forcing the judges to pay close attention to them. Over time the judges collectively build a sharp picture of every contestant, especially the ambiguous cases. Their combined verdict is weighted by how reliable each judge proved to be. This is AdaBoost.

The AdaBoost Algorithm โ€” Step by Step

01
Initialise Sample Weights
Give every training sample equal weight: wi = 1/N where N is the number of samples. Every sample matters equally in round one.
02
Train a Weak Learner
Fit a shallow decision tree (usually a stump โ€” depth = 1) on the weighted data. The stump tries to classify samples, paying more attention to heavy-weight ones.
03
Compute Weighted Error
Calculate the weighted error ฮต = ฮฃ(wi) for all misclassified i. A perfect stump gives ฮต = 0; random guessing gives ฮต = 0.5.
04
Compute Learner Weight (ฮฑ)
ฮฑ = 0.5 ร— ln((1โˆ’ฮต) / ฮต). A stump with ฮต = 0.1 gets ฮฑ โ‰ˆ 1.1 (strong vote). ฮต = 0.4 gets ฮฑ โ‰ˆ 0.2 (weak vote). ฮต = 0.5 gets ฮฑ = 0 (ignored).
05
Update Sample Weights
Misclassified samples get weight multiplied by eฮฑ (increased). Correct samples get weight multiplied by eโˆ’ฮฑ (decreased). Re-normalise so weights sum to 1.
06
Repeat & Combine
Repeat steps 2โ€“5 for T rounds. Final prediction: F(x) = sign(ฮฃ ฮฑt ร— ht(x)) โ€” a weighted majority vote of all T stumps.
๐Ÿ“Š DIAGRAM โ€” AdaBoost Weight Update Across 3 Rounds
ROUND 1 ROUND 2 ROUND 3 Equal weights (1/N) Stump splits data 3 errors โ†’ ฮต = 0.375 ฮฑโ‚ = 0.23 boost errors Errors get โ†‘ weight New stump focuses on heavy samples ฮฑโ‚‚ = 0.67 boost again Different errors now Previous correct samples get larger weight ฮฑโ‚ƒ = 0.51 Correct classification Misclassified (upweighted next round) Decision boundary

Circle size represents sample weight. Misclassified samples (red) grow larger in the next round, forcing the next stump to focus on them.


Section 03

Gradient Boosting โ€” The Generalisation

Gradient Boosting (Friedman, 1999) took AdaBoost's core idea and framed it as gradient descent in function space. Instead of reweighting samples, each new tree directly fits the residual errors (technically, the negative gradients of the loss function) of the current ensemble.

๐Ÿงฎ
The Key Insight โ€” Residuals as Pseudo-Labels

If your current ensemble predicts 72 for a house priced at 100, the residual is 28. Train the next tree to predict 28 (the error), not 100. Add that tree's prediction to the ensemble and now you predict 72 + 28 ร— ฮท (learning rate). Repeat. Each tree corrects what remains. This is fitting residuals.

Gradient Boosting โ€” The Algorithm

Initialisation
Fโ‚€(x) = argmin ฮฃ L(yแตข, ฮณ)
Start with the simplest model โ€” usually the mean (regression) or log-odds (classification).
Pseudo-Residuals (Gradient)
rแตขโ‚˜ = โˆ’[โˆ‚L(yแตข, F(xแตข)) / โˆ‚F(xแตข)]
Negative gradient of the loss w.r.t. current predictions. For MSE loss, this is simply yแตข โˆ’ F(xแตข).
Fit Weak Learner to Residuals
hโ‚˜(x) = Tree fit to {rแตขโ‚˜}
Train a new shallow tree to predict the pseudo-residuals, not the original labels.
Update Ensemble
Fโ‚˜(x) = Fโ‚˜โ‚‹โ‚(x) + ฮท ยท hโ‚˜(x)
Add the new tree scaled by the learning rate ฮท (0 < ฮท โ‰ค 1). Small ฮท = more trees needed, better generalisation.
๐Ÿ“Š DIAGRAM โ€” Gradient Boosting: Fitting Residuals Sequentially
INITIALISE Fโ‚€(x) = mean(y) e.g. predict 50 for all samples RESIDUALS r = y โˆ’ Fโ‚€(x) e.g. [+20, โˆ’15, +40, โˆ’5, ...] FIT TREE hโ‚ target = residuals not original labels max_depth=3โ€“6 shallow tree UPDATE F Fโ‚ = Fโ‚€ + ฮทยทhโ‚ ฮท = learning rate e.g. 0.1 Repeat M times (one tree per iteration) New residuals = y โˆ’ Fโ‚(x) โ†’ fit hโ‚‚ โ†’ Fโ‚‚ = Fโ‚ + ฮทยทhโ‚‚ โ†’ ... Final: F_M(x) = Fโ‚€ + ฮทยทhโ‚ + ฮทยทhโ‚‚ + ... + ฮทยทh_M

Each iteration adds one new tree whose job is to predict the current mistakes of the ensemble. The learning rate ฮท shrinks each tree's contribution to prevent overfitting.


Section 04

XGBoost โ€” Extreme Gradient Boosting

How XGBoost Conquered Kaggle
In 2014, Tianqi Chen (then a PhD student at the University of Washington) released XGBoost as a research project. Within two years it had won or featured in the majority of Kaggle competition solutions. In 2016 a famous analysis showed XGBoost was used in 17 of the 29 winning solutions at Kaggle that year. It didn't win because of magic โ€” it won because it did four things that standard gradient boosting did not: regularisation, speed, missing-value handling, and second-order gradients.

What Makes XGBoost Different

๐Ÿ”’
Regularisation
L1 (alpha) + L2 (lambda)
XGBoost adds explicit L1 and L2 penalties on leaf weights to the objective function. This directly penalises complexity and prevents overfitting โ€” something vanilla gradient boosting lacked.
๐Ÿงฎ
Second-Order Gradients
Newton Boosting
Standard GB uses only the first-order gradient (slope). XGBoost also uses the Hessian (second-order / curvature), giving a more accurate approximation of the loss and faster convergence.
โ“
Native Missing Values
Sparsity-Aware
XGBoost learns a default direction for each split โ€” if a value is missing, the sample follows the learned default. No imputation step required.
โšก
Column Subsampling
Like Random Forest
colsample_bytree, colsample_bylevel, colsample_bynode introduce randomness like Random Forest, reducing correlation between trees and controlling overfitting.
๐ŸŒณ
Approximate Tree Split
Weighted Quantile Sketch
For large datasets, XGBoost bins continuous features into quantile buckets before searching for the best split. This allows massive datasets to fit in memory while retaining near-optimal splits.
๐Ÿ’พ
Cache-Aware & GPU
Built for Speed
XGBoost uses block-compressed data structures designed for CPU cache access patterns. Native GPU support (device='cuda') gives 10โ€“50ร— speed-ups on large datasets.

The XGBoost Objective Function

Full Objective
Obj = ฮฃ L(yแตข, ลทแตข) + ฮฃ ฮฉ(fโ‚–)
Loss term (measures fit) plus regularisation term (measures complexity). XGBoost minimises both simultaneously.
Regularisation Term
ฮฉ(f) = ฮณT + ยฝฮปฮฃwโฑผยฒ + ฮฑฮฃ|wโฑผ|
ฮณ penalises number of leaves T. ฮป is L2 on leaf weights. ฮฑ is L1 on leaf weights. All three shrink tree complexity.
Optimal Leaf Weight
wโฑผ* = โˆ’Gโฑผ / (Hโฑผ + ฮป)
G = sum of first-order gradients in leaf j. H = sum of Hessians. ฮป regularises the weight. Derived analytically.
Split Gain Formula
Gain = ยฝ[GLยฒ/(HL+ฮป) + GRยฒ/(HR+ฮป) โˆ’ Gยฒ/(H+ฮป)] โˆ’ ฮณ
Gain of splitting one node into left/right children. If Gain < 0 (i.e., worse than ฮณ penalty), prune the split.
๐Ÿ”‘
Why the Hessian Matters

Standard gradient boosting just chases the gradient (steepest descent). XGBoost uses the second derivative (curvature) to take a smarter step โ€” like Newton's method vs. plain gradient descent. This means XGBoost typically needs fewer trees to converge to the same accuracy.

๐Ÿ“Š DIAGRAM โ€” XGBoost Tree Structure with Leaf Weights & Regularisation
ROOT SPLIT Age < 35? Gain = 4.21 > ฮณ=0.1 โœ“ YES NO Income < 50k? Gain = 2.87 > ฮณ โœ“ G=โˆ’3.2 H=2.1 ฮป=1 Credit < 650? Gain = 3.10 > ฮณ โœ“ G=+2.8 H=3.4 ฮป=1 LEAF w* โˆ’1.52 Low risk score LEAF w* +0.41 Moderate risk LEAF w* +0.82 Higher risk LEAF w* +1.94 High risk score

Leaf weights w* = โˆ’G/(H+ฮป) are computed analytically. The ฮณ parameter prunes splits whose gain does not exceed the leaf penalty, automatically controlling tree depth.


Section 05

Python Implementation โ€” AdaBoost

Let's start with AdaBoost on a classic binary classification problem โ€” predicting whether a bank customer will default on a loan โ€” to see boosting in its simplest form.

import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.datasets import make_classification

# โ”€โ”€ Generate synthetic loan default dataset โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X, y = make_classification(
    n_samples=5000,
    n_features=15,
    n_informative=10,
    n_redundant=3,
    random_state=42,
    class_sep=0.8
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# โ”€โ”€ Build AdaBoost with decision stumps (depth=1) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# SAMME is the multi-class generalisation of the original AdaBoost
base_estimator = DecisionTreeClassifier(max_depth=1)

ada = AdaBoostClassifier(
    estimator=base_estimator,
    n_estimators=200,       # number of stumps / boosting rounds
    learning_rate=0.5,      # shrinks each stump's contribution
    algorithm='SAMME',      # discrete boosting (original Freund & Schapire)
    random_state=42
)

# โ”€โ”€ Cross-validation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
cv_auc = cross_val_score(ada, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV ROC-AUC: {cv_auc.mean():.4f} ยฑ {cv_auc.std():.4f}")

# โ”€โ”€ Fit and evaluate โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
ada.fit(X_train, y_train)
y_pred  = ada.predict(X_test)
y_proba = ada.predict_proba(X_test)[:, 1]

print(f"\nTest ROC-AUC : {roc_auc_score(y_test, y_proba):.4f}")
print(classification_report(y_test, y_pred))

# โ”€โ”€ Inspect individual stump weights (alpha values) โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("\nTop 5 stump weights (ฮฑ):")
top_idx = np.argsort(ada.estimator_weights_)[::-1][:5]
for i in top_idx:
    print(f"  Stump {i:3d}: ฮฑ = {ada.estimator_weights_[i]:.4f}")
OUTPUT
CV ROC-AUC: 0.9023 ยฑ 0.0091 Test ROC-AUC : 0.9147 precision recall f1-score support 0 0.87 0.89 0.88 502 1 0.88 0.87 0.87 498 accuracy 0.88 1000 Top 5 stump weights (ฮฑ): Stump 0: ฮฑ = 0.8431 Stump 7: ฮฑ = 0.7215 Stump 23: ฮฑ = 0.6894 Stump 4: ฮฑ = 0.6512 Stump 11: ฮฑ = 0.6103

Section 06

Python Implementation โ€” XGBoost End-to-End

Now we use the real workhorse. Below is a production-grade XGBoost pipeline on the classic Titanic survival problem, with feature engineering, early stopping, and a full hyperparameter description.

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score

# โ”€โ”€ Load data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df = pd.read_csv('titanic.csv')

# โ”€โ”€ Feature engineering โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(
    {'Mlle':'Miss', 'Ms':'Miss', 'Mme':'Mrs',
     'Lady':'Rare', 'Countess':'Rare', 'Capt':'Rare',
     'Col':'Rare', 'Don':'Rare', 'Dr':'Rare',
     'Major':'Rare', 'Rev':'Rare', 'Sir':'Rare',
     'Jonkheer':'Rare', 'Dona':'Rare'}
)
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone']    = (df['FamilySize'] == 1).astype(int)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)

# Label encode categoricals โ€” XGBoost handles numeric only
for col in ['Sex', 'Embarked', 'Title']:
    df[col] = LabelEncoder().fit_transform(df[col].astype(str))

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
            'Fare', 'Embarked', 'Title', 'FamilySize', 'IsAlone']
X = df[features]
y = df['Survived']

# โ”€โ”€ XGBoost model โ€” annotated hyperparameters โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model = xgb.XGBClassifier(

    # โ”€โ”€ BOOSTING STRUCTURE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    n_estimators    = 500,      # max trees; use early stopping
    learning_rate   = 0.05,     # ฮท: shrinks each tree's contribution
    max_depth       = 4,        # tree depth: 3-6 is typical

    # โ”€โ”€ REGULARISATION โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    reg_alpha       = 0.1,      # L1 on leaf weights (sparsity)
    reg_lambda      = 1.0,      # L2 on leaf weights (smoothing)
    gamma           = 0.05,     # min gain to make a split
    min_child_weight= 3,        # min Hessian sum in a leaf

    # โ”€โ”€ RANDOMISATION (like Random Forest) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    subsample       = 0.8,      # fraction of rows per tree
    colsample_bytree= 0.7,      # fraction of cols per tree
    colsample_bylevel=0.7,      # fraction of cols per depth level

    # โ”€โ”€ PERFORMANCE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    tree_method     = 'hist',   # fast histogram-based splits
    device          = 'cpu',    # change to 'cuda' for GPU
    n_jobs          = -1,        # use all CPU cores
    random_state    = 42,
    eval_metric     = 'auc',    # metric for early stopping
    early_stopping_rounds = 30  # stop if no improvement in 30 rounds
)

# โ”€โ”€ Fit with a validation set for early stopping โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    verbose=50        # print eval every 50 rounds
)

print(f"\nBest iteration : {model.best_iteration}")
print(f"Best val AUC   : {model.best_score:.4f}")

# โ”€โ”€ Cross-validated AUC โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
cv_auc = cross_val_score(
    xgb.XGBClassifier(n_estimators=model.best_iteration,
                       learning_rate=0.05, max_depth=4,
                       subsample=0.8, colsample_bytree=0.7,
                       tree_method='hist', n_jobs=-1),
    X, y, cv=StratifiedKFold(5, shuffle=True, random_state=42),
    scoring='roc_auc'
)
print(f"\nCV ROC-AUC: {cv_auc.mean():.4f} ยฑ {cv_auc.std():.4f}")

# โ”€โ”€ Feature importance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
importances = pd.Series(model.feature_importances_, index=features)
print("\nFeature Importance (weight):")
print(importances.sort_values(ascending=False).to_string())
OUTPUT
[0] validation_0-auc: 0.82351 [50] validation_0-auc: 0.88712 [100] validation_0-auc: 0.89804 [150] validation_0-auc: 0.90123 [174] validation_0-auc: 0.90445 โ† best [204] validation_0-auc: 0.90201 (30 rounds no improve โ†’ stop) Best iteration : 174 Best val AUC : 0.9044 CV ROC-AUC: 0.8973 ยฑ 0.0142 Feature Importance (weight): Title 0.2841 โ† Social status/gender proxy Sex 0.2103 Fare 0.1652 Age 0.1287 Pclass 0.0914 FamilySize 0.0531 Embarked 0.0341 IsAlone 0.0198 SibSp 0.0083 Parch 0.0050
๐ŸŽฏ
Early Stopping โ€” The Most Important XGBoost Trick

Always use early_stopping_rounds with a validation set. Without it, XGBoost will train all 500 trees and may overfit. Early stopping finds the optimal number of trees automatically โ€” no manual tuning of n_estimators required. This single technique often gives 2โ€“5% better generalisation.


Section 07

XGBoost for Regression

XGBoost works equally well for regression. You only need to change the objective parameter. Here we predict house prices.

import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np

# โ”€โ”€ Load California housing โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target   # y = median house value ($100k)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.15, random_state=42
)

# โ”€โ”€ XGBoost Regressor โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
reg = xgb.XGBRegressor(
    objective         = 'reg:squarederror',  # MSE loss
    n_estimators      = 1000,
    learning_rate     = 0.04,
    max_depth         = 5,
    min_child_weight  = 5,
    subsample         = 0.8,
    colsample_bytree  = 0.8,
    reg_alpha         = 0.05,
    reg_lambda        = 1.5,
    gamma             = 0.1,
    tree_method       = 'hist',
    early_stopping_rounds = 40,
    eval_metric       = 'rmse',
    n_jobs            = -1,
    random_state      = 42
)

reg.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=100)

# โ”€โ”€ Evaluate โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
y_pred = reg.predict(X_test)

print(f"\nTest RMSE : {np.sqrt(mean_absolute_error(y_test, y_pred)**2 + np.var(y_test - y_pred)):.4f}")
print(f"Test MAE  : {mean_absolute_error(y_test, y_pred):.4f}")
print(f"Test Rยฒ   : {r2_score(y_test, y_pred):.4f}")
print(f"Best iter : {reg.best_iteration}")
OUTPUT
[0] validation_0-rmse: 1.08342 [100] validation_0-rmse: 0.52187 [200] validation_0-rmse: 0.45931 [300] validation_0-rmse: 0.44218 [342] validation_0-rmse: 0.43801 โ† best [382] validation_0-rmse: 0.44009 (40 rounds โ†’ early stop) Test RMSE : 0.4451 Test MAE : 0.3122 Test Rยฒ : 0.8394 Best iter : 342

Section 08

Hyperparameter Guide โ€” The Complete Reference

XGBoost has dozens of hyperparameters. These are the ones that matter in practice, grouped by their purpose.

Parameter Default Effect Tune Direction
n_estimators 100 Number of boosting rounds / trees Set high, use early stopping
learning_rate (ฮท) 0.3 Shrinks each tree's contribution. Lower = more trees needed but better generalisation 0.01โ€“0.1 for final model
max_depth 6 Maximum depth of each tree. Deeper = more complex, higher overfit risk 3โ€“6 is usual range
min_child_weight 1 Minimum sum of Hessians in a leaf. Higher = more conservative splits 1โ€“10; increase if overfitting
gamma (ฮณ) 0 Minimum gain required to make a split. Prunes unprofitable splits 0โ€“5; increase if overfitting
subsample 1.0 Row subsampling per tree. Like bagging inside boosting 0.6โ€“0.9 reduces overfitting
colsample_bytree 1.0 Feature subsampling per tree 0.5โ€“0.9; try 0.7 first
reg_alpha (ฮฑ) 0 L1 regularisation on leaf weights. Promotes sparsity 0, 0.01, 0.1, 1
reg_lambda (ฮป) 1 L2 regularisation on leaf weights. Smooths weights 1, 2, 5; increase if overfit
scale_pos_weight 1 For imbalanced classes: set to negative/positive ratio sum(neg) / sum(pos)
tree_method 'auto' Algorithm for building trees 'hist' for speed; 'exact' for small data
๐ŸŽ“
The Tuning Order That Actually Works

1. Fix learning_rate=0.1 and find optimal n_estimators via early stopping. โ‘ก Tune max_depth + min_child_weight together. โ‘ข Tune gamma. โ‘ฃ Tune subsample + colsample_bytree. โ‘ค Tune reg_alpha + reg_lambda. โ‘ฅ Lower learning_rate and retrain with more trees. This staged approach avoids searching a 10-dimensional space blindly.


Section 09

Hyperparameter Tuning with Optuna

Manual tuning works but Optuna's Bayesian optimisation is faster and finds better combinations. This is how competition winners tune XGBoost.

import optuna
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

optuna.logging.set_verbosity(optuna.logging.WARNING)  # suppress output

def objective(trial):
    params = {
        'n_estimators'     : trial.suggest_int('n_estimators', 100, 800),
        'learning_rate'    : trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth'        : trial.suggest_int('max_depth', 2, 8),
        'min_child_weight' : trial.suggest_int('min_child_weight', 1, 10),
        'gamma'            : trial.suggest_float('gamma', 0, 5),
        'subsample'        : trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree' : trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'reg_alpha'        : trial.suggest_float('reg_alpha', 1e-8, 10, log=True),
        'reg_lambda'       : trial.suggest_float('reg_lambda', 1e-8, 10, log=True),
        'tree_method'      : 'hist',
        'n_jobs'           : -1,
        'random_state'     : 42
    }
    model = xgb.XGBClassifier(**params)
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=300)  # 5 minutes max

print(f"Best AUC   : {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# โ”€โ”€ Retrain final model with best params โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
best = xgb.XGBClassifier(**study.best_params)
best.fit(X, y)
OUTPUT
Best AUC : 0.9198 Best params: { 'n_estimators': 412, 'learning_rate': 0.0387, 'max_depth': 5, 'min_child_weight': 4, 'gamma': 0.312, 'subsample': 0.821, 'colsample_bytree': 0.714, 'reg_alpha': 0.0023, 'reg_lambda': 2.415, 'tree_method': 'hist' }

Section 10

Feature Importance in XGBoost โ€” Three Types

XGBoost offers three different ways to measure feature importance. They tell different stories and you should know the difference.

๐Ÿ“Š
weight
Split Count
How many times a feature is used to split across all trees. Fast to compute. Biased towards high-cardinality features โ€” can overstate their importance.
๐Ÿ“ˆ
gain
Average Gain
Average improvement in the objective function when a feature is used to split. More reliable than weight โ€” favours features that actually reduce loss, not just used frequently.
๐Ÿ”
cover
Average Coverage
Average number of samples affected by splits on this feature. Good for understanding which features impact the most data points.
import matplotlib.pyplot as plt
from xgboost import plot_importance

# โ”€โ”€ Three importance types โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, imp_type in zip(axes, ['weight', 'gain', 'cover']):
    plot_importance(model, importance_type=imp_type, ax=ax,
                    title=f'Importance ({imp_type})')

plt.tight_layout()
plt.savefig('xgb_importance.png', dpi=150)

# โ”€โ”€ Or get raw values as a dict โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
gain_scores = model.get_booster().get_score(importance_type='gain')
for feat, score in sorted(gain_scores.items(), key=lambda x: -x[1]):
    print(f"  {feat:15s}: {score:.2f}")

# โ”€โ”€ SHAP values โ€” most reliable importance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
import shap
explainer   = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type='bar')
โญ
Use SHAP for Production Feature Importance

Built-in XGBoost importance scores are useful quick checks, but they can be misleading for correlated features. SHAP values (SHapley Additive exPlanations) are game-theory-grounded, consistent, and show the direction of each feature's impact โ€” not just its magnitude. Use SHAP for any model that goes into production or a client presentation.


Section 11

The Full Boosting Landscape โ€” Visual Comparison

๐Ÿ“Š DIAGRAM โ€” The Boosting Algorithm Family Tree
ENSEMBLE LEARNING Combine multiple models BAGGING Parallel ยท Independent ยท Reduces Variance BOOSTING Sequential ยท Corrective ยท Reduces Bias Random Forest Breiman (2001) AdaBoost Freund & Schapire (1996) Gradient Boosting Friedman (1999) XGBoost Chen (2014) โญ LightGBM Microsoft (2017) CatBoost Yandex (2017) STACKING Meta-learner combines base models

XGBoost is the most widely used member of the gradient boosting family. LightGBM is faster on very large datasets (leaf-wise growth). CatBoost handles categorical features natively.


Section 12

Boosting vs Random Forest โ€” When to Use Which

Dimension Random Forest XGBoost / Gradient Boosting
How trees are built Parallel โ€” independent Sequential โ€” each corrects prior errors
What it reduces Variance Bias (and variance via regularisation)
Overfitting risk Low โ€” bagging protects well Higher โ€” needs careful tuning
Training speed Fast (easily parallelised) Slower (sequential by nature)
Hyperparameter sensitivity Low โ€” good defaults work High โ€” needs thoughtful tuning
Peak accuracy (tabular) Very good Typically highest on tabular data
Missing values Needs imputation XGBoost handles natively
Interpretability Moderate (feature importance) Moderate (SHAP values)
Best choice whenโ€ฆ Fast baseline, robust defaults, noisy data Maximising accuracy, competition setting

Section 13

Common Mistakes & How to Avoid Them

โŒ What People Do Wrong
Use default learning_rate=0.3 and n_estimators=100 without early stopping โ€” leads to undertrained or overtrained models.
Scale features before XGBoost โ€” unnecessary, trees are scale-invariant. Wastes time and can introduce bugs.
Tune n_estimators manually via grid search โ€” massively wasteful. Early stopping does this automatically.
Evaluate on training data โ€” XGBoost can memorise training data with high max_depth and no regularisation.
Use accuracy on imbalanced classification โ€” misleading metric. Use AUC-ROC or F1 instead.
โœ… What to Do Instead
Set learning_rate=0.05โ€“0.1, n_estimators=1000, and use early_stopping_rounds=30โ€“50 with a held-out validation set.
Feed raw numeric features directly. Only encode categoricals (XGBoost needs integers, not strings).
Set n_estimators high, use early stopping. The model stops itself at the optimal round.
Always evaluate with cross-validation or a held-out test set. Use OOB scores as a quick check.
Set scale_pos_weight = sum(negatives) / sum(positives) for imbalanced data. Evaluate with AUC-ROC.

Section 14

Golden Rules

โšก XGBoost & Boosting โ€” Non-Negotiable Rules
1
Always use early stopping. Set n_estimators=1000 and early_stopping_rounds=30. The model will stop at the exact right number of trees. Never manually tune n_estimators โ€” it's a waste of search budget.
2
Start with a low learning rate. learning_rate=0.05 with more trees almost always beats learning_rate=0.3 with fewer trees. Lower learning rates generalise better โ€” they take smaller, safer steps.
3
Always add subsampling. subsample=0.8 and colsample_bytree=0.7 introduce randomness like Random Forest does, reducing correlation between trees and controlling overfitting with almost no accuracy cost.
4
Do not impute missing values before XGBoost. XGBoost learns the optimal direction for missing values natively. Imputing first can actually reduce accuracy by removing the signal contained in missingness patterns.
5
Use tree_method='hist' always. It is as accurate as 'exact' on all practical datasets and 10โ€“100ร— faster on large data. There is no reason not to use it. Set it and forget it.
6
SHAP over built-in importance. Feature importance from get_score(importance_type='weight') is biased. Use shap.TreeExplainer for any model you present to stakeholders or use in production โ€” it is consistent, complete, and shows feature directionality.
7
Start with Random Forest as your baseline. It is fast, robust, and almost always competitive with zero tuning. Move to XGBoost when you need that extra 2โ€“5% accuracy and are willing to invest in tuning. Many production systems run Random Forest permanently.
You have completed Supervised Learning. View all sections โ†’