Boosting & XGBoost

Section 01

The Story That Explains Boosting

📖 Real-World Analogy

The Student Who Learned From Every Mistake

Imagine a student sitting three maths exams. After the first exam, the teacher marks only the questions the student got wrong — and the next lesson focuses entirely on those hard ones. After the second exam, the same process: find the new weak spots, drill them hard. By the third exam the student is formidable — not because they were naturally gifted, but because every iteration fixed the previous iteration's failures.

That is Boosting in one paragraph. Instead of training many independent models (like Random Forest), boosting trains models sequentially — each new model focuses laser-precisely on the examples the previous ensemble got wrong.

Boosting is a family of ensemble learning algorithms. Its unifying idea: combine many weak learners (models that are only slightly better than random guessing) into one strong learner by letting each weak learner fix what its predecessors could not. The three most important members of this family are:

📈

AdaBoost (1996)

Adaptive Boosting

Adjusts sample weights after each round. Misclassified samples get heavier weight so the next learner pays more attention to them. The final prediction is a weighted vote of all learners.

📉

Gradient Boosting (1999)

Gradient Descent in Function Space

Each new model fits the residual errors (pseudo-residuals / gradients) of the current ensemble. Works for any differentiable loss function — classification, regression, ranking.

⚡

XGBoost (2014)

Extreme Gradient Boosting

An optimised, regularised implementation of gradient boosting. Adds L1/L2 regularisation, handles missing data natively, supports GPU training, and dominates competitive ML to this day.

💡

Bagging vs Boosting — The Core Difference

Bagging (Random Forest) builds trees in parallel and combines their predictions. Each tree is independent. The goal is to reduce variance. Boosting builds trees sequentially, each correcting the last. The goal is to reduce bias. This makes boosting more powerful but also more prone to overfitting if not regularised.

Section 02

AdaBoost — Where It All Started

AdaBoost (Adaptive Boosting) was introduced by Freund & Schapire in 1996 and won the Gödel Prize. It is the simplest boosting algorithm to understand — and understanding it unlocks everything that follows.

📖 Story

The Panel of Distracted Judges

A talent show has five judges, each half-asleep and only barely competent. After each performer, the show producers look at which contestant the judges disagreed on most and make that contestant the opening act of the next round — forcing the judges to pay close attention to them. Over time the judges collectively build a sharp picture of every contestant, especially the ambiguous cases. Their combined verdict is weighted by how reliable each judge proved to be. This is AdaBoost.

The AdaBoost Algorithm — Step by Step

Initialise Sample Weights

Give every training sample equal weight: w_i = 1/N where N is the number of samples. Every sample matters equally in round one.

Train a Weak Learner

Fit a shallow decision tree (usually a stump — depth = 1) on the weighted data. The stump tries to classify samples, paying more attention to heavy-weight ones.

Compute Weighted Error

Calculate the weighted error ε = Σ(w_i) for all misclassified i. A perfect stump gives ε = 0; random guessing gives ε = 0.5.

Compute Learner Weight (α)

α = 0.5 × ln((1−ε) / ε). A stump with ε = 0.1 gets α ≈ 1.1 (strong vote). ε = 0.4 gets α ≈ 0.2 (weak vote). ε = 0.5 gets α = 0 (ignored).

Update Sample Weights

Misclassified samples get weight multiplied by e^α (increased). Correct samples get weight multiplied by e^−α (decreased). Re-normalise so weights sum to 1.

Repeat & Combine

Repeat steps 2–5 for T rounds. Final prediction: F(x) = sign(Σ α_t × h_t(x)) — a weighted majority vote of all T stumps.

📊 DIAGRAM — AdaBoost Weight Update Across 3 Rounds

Circle size represents sample weight. Misclassified samples (red) grow larger in the next round, forcing the next stump to focus on them.

Section 03

Gradient Boosting — The Generalisation

Gradient Boosting (Friedman, 1999) took AdaBoost's core idea and framed it as gradient descent in function space. Instead of reweighting samples, each new tree directly fits the residual errors (technically, the negative gradients of the loss function) of the current ensemble.

🧮

The Key Insight — Residuals as Pseudo-Labels

If your current ensemble predicts 72 for a house priced at 100, the residual is 28. Train the next tree to predict 28 (the error), not 100. Add that tree's prediction to the ensemble and now you predict 72 + 28 × η (learning rate). Repeat. Each tree corrects what remains. This is fitting residuals.

Gradient Boosting — The Algorithm

Initialisation

F₀(x) = argmin Σ L(yᵢ, γ)

Start with the simplest model — usually the mean (regression) or log-odds (classification).

Pseudo-Residuals (Gradient)

rᵢₘ = −[∂L(yᵢ, F(xᵢ)) / ∂F(xᵢ)]

Negative gradient of the loss w.r.t. current predictions. For MSE loss, this is simply yᵢ − F(xᵢ).

Fit Weak Learner to Residuals

hₘ(x) = Tree fit to {rᵢₘ}

Train a new shallow tree to predict the pseudo-residuals, not the original labels.

Update Ensemble

Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x)

Add the new tree scaled by the learning rate η (0 < η ≤ 1). Small η = more trees needed, better generalisation.

📊 DIAGRAM — Gradient Boosting: Fitting Residuals Sequentially

Each iteration adds one new tree whose job is to predict the current mistakes of the ensemble. The learning rate η shrinks each tree's contribution to prevent overfitting.

Section 04

XGBoost — Extreme Gradient Boosting

📖 Origin Story

How XGBoost Conquered Kaggle

In 2014, Tianqi Chen (then a PhD student at the University of Washington) released XGBoost as a research project. Within two years it had won or featured in the majority of Kaggle competition solutions. In 2016 a famous analysis showed XGBoost was used in 17 of the 29 winning solutions at Kaggle that year. It didn't win because of magic — it won because it did four things that standard gradient boosting did not: regularisation, speed, missing-value handling, and second-order gradients.

What Makes XGBoost Different

🔒

Regularisation

L1 (alpha) + L2 (lambda)

XGBoost adds explicit L1 and L2 penalties on leaf weights to the objective function. This directly penalises complexity and prevents overfitting — something vanilla gradient boosting lacked.

🧮

Second-Order Gradients

Newton Boosting

Standard GB uses only the first-order gradient (slope). XGBoost also uses the Hessian (second-order / curvature), giving a more accurate approximation of the loss and faster convergence.

❓

Native Missing Values

Sparsity-Aware

XGBoost learns a default direction for each split — if a value is missing, the sample follows the learned default. No imputation step required.

⚡

Column Subsampling

Like Random Forest

colsample_bytree, colsample_bylevel, colsample_bynode introduce randomness like Random Forest, reducing correlation between trees and controlling overfitting.

🌳

Approximate Tree Split

Weighted Quantile Sketch

For large datasets, XGBoost bins continuous features into quantile buckets before searching for the best split. This allows massive datasets to fit in memory while retaining near-optimal splits.

💾

Cache-Aware & GPU

Built for Speed

XGBoost uses block-compressed data structures designed for CPU cache access patterns. Native GPU support (device='cuda') gives 10–50× speed-ups on large datasets.

The XGBoost Objective Function

Full Objective

Obj = Σ L(yᵢ, ŷᵢ) + Σ Ω(fₖ)

Loss term (measures fit) plus regularisation term (measures complexity). XGBoost minimises both simultaneously.

Regularisation Term

Ω(f) = γT + ½λΣwⱼ² + αΣ|wⱼ|

γ penalises number of leaves T. λ is L2 on leaf weights. α is L1 on leaf weights. All three shrink tree complexity.

Optimal Leaf Weight

wⱼ* = −Gⱼ / (Hⱼ + λ)

G = sum of first-order gradients in leaf j. H = sum of Hessians. λ regularises the weight. Derived analytically.

Split Gain Formula

Gain = ½[GL²/(HL+λ) + GR²/(HR+λ) − G²/(H+λ)] − γ

Gain of splitting one node into left/right children. If Gain < 0 (i.e., worse than γ penalty), prune the split.

🔑

Why the Hessian Matters

Standard gradient boosting just chases the gradient (steepest descent). XGBoost uses the second derivative (curvature) to take a smarter step — like Newton's method vs. plain gradient descent. This means XGBoost typically needs fewer trees to converge to the same accuracy.

📊 DIAGRAM — XGBoost Tree Structure with Leaf Weights & Regularisation

Leaf weights w* = −G/(H+λ) are computed analytically. The γ parameter prunes splits whose gain does not exceed the leaf penalty, automatically controlling tree depth.

Section 05

Python Implementation — AdaBoost

Let's start with AdaBoost on a classic binary classification problem — predicting whether a bank customer will default on a loan — to see boosting in its simplest form.

import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.datasets import make_classification

# ── Generate synthetic loan default dataset ───────────────
X, y = make_classification(
    n_samples=5000,
    n_features=15,
    n_informative=10,
    n_redundant=3,
    random_state=42,
    class_sep=0.8
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── Build AdaBoost with decision stumps (depth=1) ─────────
# SAMME is the multi-class generalisation of the original AdaBoost
base_estimator = DecisionTreeClassifier(max_depth=1)

ada = AdaBoostClassifier(
    estimator=base_estimator,
    n_estimators=200,       # number of stumps / boosting rounds
    learning_rate=0.5,      # shrinks each stump's contribution
    algorithm='SAMME',      # discrete boosting (original Freund & Schapire)
    random_state=42
)

# ── Cross-validation ──────────────────────────────────────
cv_auc = cross_val_score(ada, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV ROC-AUC: {cv_auc.mean():.4f} ± {cv_auc.std():.4f}")

# ── Fit and evaluate ──────────────────────────────────────
ada.fit(X_train, y_train)
y_pred  = ada.predict(X_test)
y_proba = ada.predict_proba(X_test)[:, 1]

print(f"\nTest ROC-AUC : {roc_auc_score(y_test, y_proba):.4f}")
print(classification_report(y_test, y_pred))

# ── Inspect individual stump weights (alpha values) ───────
print("\nTop 5 stump weights (α):")
top_idx = np.argsort(ada.estimator_weights_)[::-1][:5]
for i in top_idx:
    print(f"  Stump {i:3d}: α = {ada.estimator_weights_[i]:.4f}")

OUTPUT

CV ROC-AUC: 0.9023 ± 0.0091 Test ROC-AUC : 0.9147 precision recall f1-score support 0 0.87 0.89 0.88 502 1 0.88 0.87 0.87 498 accuracy 0.88 1000 Top 5 stump weights (α): Stump 0: α = 0.8431 Stump 7: α = 0.7215 Stump 23: α = 0.6894 Stump 4: α = 0.6512 Stump 11: α = 0.6103

Section 06

Python Implementation — XGBoost End-to-End

Now we use the real workhorse. Below is a production-grade XGBoost pipeline on the classic Titanic survival problem, with feature engineering, early stopping, and a full hyperparameter description.

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score

# ── Load data ─────────────────────────────────────────────
df = pd.read_csv('titanic.csv')

# ── Feature engineering ───────────────────────────────────
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(
    {'Mlle':'Miss', 'Ms':'Miss', 'Mme':'Mrs',
     'Lady':'Rare', 'Countess':'Rare', 'Capt':'Rare',
     'Col':'Rare', 'Don':'Rare', 'Dr':'Rare',
     'Major':'Rare', 'Rev':'Rare', 'Sir':'Rare',
     'Jonkheer':'Rare', 'Dona':'Rare'}
)
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone']    = (df['FamilySize'] == 1).astype(int)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)

# Label encode categoricals — XGBoost handles numeric only
for col in ['Sex', 'Embarked', 'Title']:
    df[col] = LabelEncoder().fit_transform(df[col].astype(str))

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
            'Fare', 'Embarked', 'Title', 'FamilySize', 'IsAlone']
X = df[features]
y = df['Survived']

# ── XGBoost model — annotated hyperparameters ─────────────
model = xgb.XGBClassifier(

    # ── BOOSTING STRUCTURE ──────────────────────────────
    n_estimators    = 500,      # max trees; use early stopping
    learning_rate   = 0.05,     # η: shrinks each tree's contribution
    max_depth       = 4,        # tree depth: 3-6 is typical

    # ── REGULARISATION ──────────────────────────────────
    reg_alpha       = 0.1,      # L1 on leaf weights (sparsity)
    reg_lambda      = 1.0,      # L2 on leaf weights (smoothing)
    gamma           = 0.05,     # min gain to make a split
    min_child_weight= 3,        # min Hessian sum in a leaf

    # ── RANDOMISATION (like Random Forest) ─────────────
    subsample       = 0.8,      # fraction of rows per tree
    colsample_bytree= 0.7,      # fraction of cols per tree
    colsample_bylevel=0.7,      # fraction of cols per depth level

    # ── PERFORMANCE ─────────────────────────────────────
    tree_method     = 'hist',   # fast histogram-based splits
    device          = 'cpu',    # change to 'cuda' for GPU
    n_jobs          = -1,        # use all CPU cores
    random_state    = 42,
    eval_metric     = 'auc',    # metric for early stopping
    early_stopping_rounds = 30  # stop if no improvement in 30 rounds
)

# ── Fit with a validation set for early stopping ──────────
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    verbose=50        # print eval every 50 rounds
)

print(f"\nBest iteration : {model.best_iteration}")
print(f"Best val AUC   : {model.best_score:.4f}")

# ── Cross-validated AUC ───────────────────────────────────
cv_auc = cross_val_score(
    xgb.XGBClassifier(n_estimators=model.best_iteration,
                       learning_rate=0.05, max_depth=4,
                       subsample=0.8, colsample_bytree=0.7,
                       tree_method='hist', n_jobs=-1),
    X, y, cv=StratifiedKFold(5, shuffle=True, random_state=42),
    scoring='roc_auc'
)
print(f"\nCV ROC-AUC: {cv_auc.mean():.4f} ± {cv_auc.std():.4f}")

# ── Feature importance ────────────────────────────────────
importances = pd.Series(model.feature_importances_, index=features)
print("\nFeature Importance (weight):")
print(importances.sort_values(ascending=False).to_string())

OUTPUT

[0] validation_0-auc: 0.82351 [50] validation_0-auc: 0.88712 [100] validation_0-auc: 0.89804 [150] validation_0-auc: 0.90123 [174] validation_0-auc: 0.90445 ← best [204] validation_0-auc: 0.90201 (30 rounds no improve → stop) Best iteration : 174 Best val AUC : 0.9044 CV ROC-AUC: 0.8973 ± 0.0142 Feature Importance (weight): Title 0.2841 ← Social status/gender proxy Sex 0.2103 Fare 0.1652 Age 0.1287 Pclass 0.0914 FamilySize 0.0531 Embarked 0.0341 IsAlone 0.0198 SibSp 0.0083 Parch 0.0050

🎯

Early Stopping — The Most Important XGBoost Trick

Always use early_stopping_rounds with a validation set. Without it, XGBoost will train all 500 trees and may overfit. Early stopping finds the optimal number of trees automatically — no manual tuning of n_estimators required. This single technique often gives 2–5% better generalisation.

Section 07

XGBoost for Regression

XGBoost works equally well for regression. You only need to change the objective parameter. Here we predict house prices.

import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np

# ── Load California housing ───────────────────────────────
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target   # y = median house value ($100k)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.15, random_state=42
)

# ── XGBoost Regressor ─────────────────────────────────────
reg = xgb.XGBRegressor(
    objective         = 'reg:squarederror',  # MSE loss
    n_estimators      = 1000,
    learning_rate     = 0.04,
    max_depth         = 5,
    min_child_weight  = 5,
    subsample         = 0.8,
    colsample_bytree  = 0.8,
    reg_alpha         = 0.05,
    reg_lambda        = 1.5,
    gamma             = 0.1,
    tree_method       = 'hist',
    early_stopping_rounds = 40,
    eval_metric       = 'rmse',
    n_jobs            = -1,
    random_state      = 42
)

reg.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=100)

# ── Evaluate ──────────────────────────────────────────────
y_pred = reg.predict(X_test)

print(f"\nTest RMSE : {np.sqrt(mean_absolute_error(y_test, y_pred)**2 + np.var(y_test - y_pred)):.4f}")
print(f"Test MAE  : {mean_absolute_error(y_test, y_pred):.4f}")
print(f"Test R²   : {r2_score(y_test, y_pred):.4f}")
print(f"Best iter : {reg.best_iteration}")

OUTPUT

[0] validation_0-rmse: 1.08342 [100] validation_0-rmse: 0.52187 [200] validation_0-rmse: 0.45931 [300] validation_0-rmse: 0.44218 [342] validation_0-rmse: 0.43801 ← best [382] validation_0-rmse: 0.44009 (40 rounds → early stop) Test RMSE : 0.4451 Test MAE : 0.3122 Test R² : 0.8394 Best iter : 342

Section 08

Hyperparameter Guide — The Complete Reference

XGBoost has dozens of hyperparameters. These are the ones that matter in practice, grouped by their purpose.

Parameter	Default	Effect	Tune Direction
`n_estimators`	100	Number of boosting rounds / trees	Set high, use early stopping
`learning_rate` (η)	0.3	Shrinks each tree's contribution. Lower = more trees needed but better generalisation	0.01–0.1 for final model
`max_depth`	6	Maximum depth of each tree. Deeper = more complex, higher overfit risk	3–6 is usual range
`min_child_weight`	1	Minimum sum of Hessians in a leaf. Higher = more conservative splits	1–10; increase if overfitting
`gamma` (γ)	0	Minimum gain required to make a split. Prunes unprofitable splits	0–5; increase if overfitting
`subsample`	1.0	Row subsampling per tree. Like bagging inside boosting	0.6–0.9 reduces overfitting
`colsample_bytree`	1.0	Feature subsampling per tree	0.5–0.9; try 0.7 first
`reg_alpha` (α)	0	L1 regularisation on leaf weights. Promotes sparsity	0, 0.01, 0.1, 1
`reg_lambda` (λ)	1	L2 regularisation on leaf weights. Smooths weights	1, 2, 5; increase if overfit
`scale_pos_weight`	1	For imbalanced classes: set to negative/positive ratio	sum(neg) / sum(pos)
`tree_method`	'auto'	Algorithm for building trees	'hist' for speed; 'exact' for small data

🎓

The Tuning Order That Actually Works

1. Fix learning_rate=0.1 and find optimal n_estimators via early stopping. ② Tune max_depth + min_child_weight together. ③ Tune gamma. ④ Tune subsample + colsample_bytree. ⑤ Tune reg_alpha + reg_lambda. ⑥ Lower learning_rate and retrain with more trees. This staged approach avoids searching a 10-dimensional space blindly.

Section 09

Hyperparameter Tuning with Optuna

Manual tuning works but Optuna's Bayesian optimisation is faster and finds better combinations. This is how competition winners tune XGBoost.

import optuna
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

optuna.logging.set_verbosity(optuna.logging.WARNING)  # suppress output

def objective(trial):
    params = {
        'n_estimators'     : trial.suggest_int('n_estimators', 100, 800),
        'learning_rate'    : trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth'        : trial.suggest_int('max_depth', 2, 8),
        'min_child_weight' : trial.suggest_int('min_child_weight', 1, 10),
        'gamma'            : trial.suggest_float('gamma', 0, 5),
        'subsample'        : trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree' : trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'reg_alpha'        : trial.suggest_float('reg_alpha', 1e-8, 10, log=True),
        'reg_lambda'       : trial.suggest_float('reg_lambda', 1e-8, 10, log=True),
        'tree_method'      : 'hist',
        'n_jobs'           : -1,
        'random_state'     : 42
    }
    model = xgb.XGBClassifier(**params)
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=300)  # 5 minutes max

print(f"Best AUC   : {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# ── Retrain final model with best params ──────────────────
best = xgb.XGBClassifier(**study.best_params)
best.fit(X, y)

OUTPUT

Best AUC : 0.9198 Best params: { 'n_estimators': 412, 'learning_rate': 0.0387, 'max_depth': 5, 'min_child_weight': 4, 'gamma': 0.312, 'subsample': 0.821, 'colsample_bytree': 0.714, 'reg_alpha': 0.0023, 'reg_lambda': 2.415, 'tree_method': 'hist' }

Section 10

Feature Importance in XGBoost — Three Types

XGBoost offers three different ways to measure feature importance. They tell different stories and you should know the difference.

📊

weight

Split Count

How many times a feature is used to split across all trees. Fast to compute. Biased towards high-cardinality features — can overstate their importance.

📈

gain

Average Gain

Average improvement in the objective function when a feature is used to split. More reliable than weight — favours features that actually reduce loss, not just used frequently.

🔍

cover

Average Coverage

Average number of samples affected by splits on this feature. Good for understanding which features impact the most data points.

import matplotlib.pyplot as plt
from xgboost import plot_importance

# ── Three importance types ─────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, imp_type in zip(axes, ['weight', 'gain', 'cover']):
    plot_importance(model, importance_type=imp_type, ax=ax,
                    title=f'Importance ({imp_type})')

plt.tight_layout()
plt.savefig('xgb_importance.png', dpi=150)

# ── Or get raw values as a dict ───────────────────────────
gain_scores = model.get_booster().get_score(importance_type='gain')
for feat, score in sorted(gain_scores.items(), key=lambda x: -x[1]):
    print(f"  {feat:15s}: {score:.2f}")

# ── SHAP values — most reliable importance ────────────────
import shap
explainer   = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type='bar')

⭐

Use SHAP for Production Feature Importance

Built-in XGBoost importance scores are useful quick checks, but they can be misleading for correlated features. SHAP values (SHapley Additive exPlanations) are game-theory-grounded, consistent, and show the direction of each feature's impact — not just its magnitude. Use SHAP for any model that goes into production or a client presentation.

Section 11

The Full Boosting Landscape — Visual Comparison

📊 DIAGRAM — The Boosting Algorithm Family Tree

XGBoost is the most widely used member of the gradient boosting family. LightGBM is faster on very large datasets (leaf-wise growth). CatBoost handles categorical features natively.

Section 12

Boosting vs Random Forest — When to Use Which

Dimension	Random Forest	XGBoost / Gradient Boosting
How trees are built	Parallel — independent	Sequential — each corrects prior errors
What it reduces	Variance	Bias (and variance via regularisation)
Overfitting risk	Low — bagging protects well	Higher — needs careful tuning
Training speed	Fast (easily parallelised)	Slower (sequential by nature)
Hyperparameter sensitivity	Low — good defaults work	High — needs thoughtful tuning
Peak accuracy (tabular)	Very good	Typically highest on tabular data
Missing values	Needs imputation	XGBoost handles natively
Interpretability	Moderate (feature importance)	Moderate (SHAP values)
Best choice when…	Fast baseline, robust defaults, noisy data	Maximising accuracy, competition setting

Section 13

Common Mistakes & How to Avoid Them

❌ What People Do Wrong

Use default learning_rate=0.3 and n_estimators=100 without early stopping — leads to undertrained or overtrained models.

Scale features before XGBoost — unnecessary, trees are scale-invariant. Wastes time and can introduce bugs.

Tune n_estimators manually via grid search — massively wasteful. Early stopping does this automatically.

Evaluate on training data — XGBoost can memorise training data with high max_depth and no regularisation.

Use accuracy on imbalanced classification — misleading metric. Use AUC-ROC or F1 instead.

✅ What to Do Instead

Set learning_rate=0.05–0.1, n_estimators=1000, and use early_stopping_rounds=30–50 with a held-out validation set.

Feed raw numeric features directly. Only encode categoricals (XGBoost needs integers, not strings).

Set n_estimators high, use early stopping. The model stops itself at the optimal round.

Always evaluate with cross-validation or a held-out test set. Use OOB scores as a quick check.

Set scale_pos_weight = sum(negatives) / sum(positives) for imbalanced data. Evaluate with AUC-ROC.

Section 14

Golden Rules

⚡ XGBoost & Boosting — Non-Negotiable Rules

Always use early stopping. Set n_estimators=1000 and early_stopping_rounds=30. The model will stop at the exact right number of trees. Never manually tune n_estimators — it's a waste of search budget.

Start with a low learning rate. learning_rate=0.05 with more trees almost always beats learning_rate=0.3 with fewer trees. Lower learning rates generalise better — they take smaller, safer steps.

Always add subsampling. subsample=0.8 and colsample_bytree=0.7 introduce randomness like Random Forest does, reducing correlation between trees and controlling overfitting with almost no accuracy cost.

Do not impute missing values before XGBoost. XGBoost learns the optimal direction for missing values natively. Imputing first can actually reduce accuracy by removing the signal contained in missingness patterns.

Use tree_method='hist' always. It is as accurate as 'exact' on all practical datasets and 10–100× faster on large data. There is no reason not to use it. Set it and forget it.

SHAP over built-in importance. Feature importance from get_score(importance_type='weight') is biased. Use shap.TreeExplainer for any model you present to stakeholders or use in production — it is consistent, complete, and shows feature directionality.

Start with Random Forest as your baseline. It is fast, robust, and almost always competitive with zero tuning. Move to XGBoost when you need that extra 2–5% accuracy and are willing to invest in tuning. Many production systems run Random Forest permanently.