Machine Learning πŸ“‚ Supervised Learning Β· 7 of 17 43 min read

Decision Tree

Learn Decision Trees from scratch with intuitive visuals, real-world banking examples, step-by-step Gini and Entropy calculations, pruning techniques, hyperparameter tuning, and complete Python implementations using scikit-learn. This guide explains how trees make decisions, avoid overfitting, and power advanced models like Random Forest and XGBoost.

Section 01

The Story: Ramesh the Loan Officer

When a Human Brain Becomes an Algorithm
Ramesh is a senior loan officer at a Mumbai bank. After 20 years of experience, he can approve or reject a home loan application in under 5 minutes. When a junior analyst asks him how, he says:

"First I check the credit score. If it's above 700, I look at their income. Income over 50K? Approved. Below 50K? Rejected. If credit score is low, I check employment status. Not employed? Rejected. Employed but debt ratio over 40%? Rejected. Employed and debt ratio fine? Approved."

The analyst stares at him. "That's... a decision tree."

Ramesh had been running a Decision Tree algorithm in his head for 20 years without knowing it. Machine learning learns these questions and thresholds automatically from data.

Section 02

What Is a Decision Tree?

A Decision Tree is a supervised learning algorithm that models decisions as a flowchart-like tree structure. It learns a hierarchy of if-else questions from training data.

Interpretable
πŸ“–
  • You can read every decision
  • No black box β€” full transparency
  • Regulators and managers love it
  • Easy to visualise and debug
Non-Parametric
πŸ”€
  • No distribution assumptions
  • Handles non-linear boundaries
  • Works with mixed data types
  • No feature scaling required
Foundation
🌲
  • Base of Random Forests
  • Base of XGBoost / LightGBM
  • Used in medical diagnosis
  • Core of ensemble methods

Section 03

Anatomy of a Decision Tree

🌳 Ramesh's Loan Approval Decision Tree
YES NO YES NO YES NO YES NO ROOT NODE Credit Score β‰₯ 700? DECISION NODE Annual Income β‰₯ 50K? DECISION NODE Currently Employed? LEAF NODE APPROVED LEAF NODE REJECTED DECISION NODE Debt Ratio ≀ 40%? LEAF NODE REJECTED LEAF NODE APPROVED LEAF NODE REJECTED Depth 0 Depth 1 Depth 2 Depth 3

Purple = Root node. Blue = Internal decision nodes. Green = Approved leaf nodes. Red = Rejected leaf nodes. Each path from root to leaf is one complete decision rule.

TermDefinitionExample in Ramesh's Tree
Root NodeThe topmost node β€” the best initial split for the whole dataset"Credit Score >= 700?"
Decision NodeAn internal node that tests a feature and branches into sub-trees"Annual Income >= 50K?", "Currently Employed?", "Debt Ratio <= 40%?"
Leaf NodeA terminal node β€” outputs the final predictionAPPROVED or REJECTED
Branch / EdgeThe connection between nodes β€” labelled YES or NOThe arrows labelled YES / NO between nodes
DepthDistance from root to a node. Max depth = tree heightRoot=0, Income node=1, Debt node=2, final leaves=3
SplitThe threshold chosen to divide data at a nodeCredit >= 700, Income >= 50K, Debt <= 40%
PruningRemoving branches to prevent overfittingRemoving the Debt Ratio node if it doesn't help generalisation

Section 04

Real Worked Example: The "Play Tennis" Dataset

This is the canonical dataset used to teach ID3 β€” the original decision tree algorithm by Ross Quinlan (1986). The task: given weather conditions, should we play tennis today? We have 14 days of historical data with four weather features and the outcome.

DayOutlookTempHumidityWindPlay Tennis?
D1SunnyHotHighWeakNo
D2SunnyHotHighStrongNo
D3OvercastHotHighWeakYes
D4RainMildHighWeakYes
D5RainCoolNormalWeakYes
D6RainCoolNormalStrongNo
D7OvercastCoolNormalStrongYes
D8SunnyMildHighWeakNo
D9SunnyCoolNormalWeakYes
D10RainMildNormalWeakYes
D11SunnyMildNormalStrongYes
D12OvercastMildHighStrongYes
D13OvercastHotNormalWeakYes
D14RainMildHighStrongNo

The dataset has 9 Yes and 5 No β€” written as S = [9+, 5-]. ID3 must figure out which of the four features creates the most useful first split by calculating Information Gain for each.

💡
Parent Entropy: The Starting Point

Before any split, S = [9+, 5-] has entropy:
H(S) = -(9/14).log2(9/14) - (5/14).log2(5/14) = 0.940 bits
Every split's Information Gain is measured as a reduction from 0.940. Maximum binary entropy is 1.0 β€” we're close because 9:5 isn't far from equal.


Section 05

Entropy and Information Gain β€” Choosing the Root Split

Entropy of a Node
H = -Sum(pi . log2(pi))
pi = proportion of samples of class i. Range: 0 (pure) to log2(n) for n classes. Binary max = 1.0 at 50/50 split.
Information Gain
IG = H(parent) - Sum((|Sv|/|S|) . H(Sv))
The drop in entropy after splitting. Pick the attribute with highest IG β€” this is the most informative split available.

Attribute: Outlook β€” Gain = 0.246 bits (WINNER)

Outlook creates 3 groups: Sunny (D1,D2,D8,D9,D11), Overcast (D3,D7,D12,D13), Rain (D4,D5,D6,D10,D14).

Gain Calculation β€” Outlook
Sunny (5 days)
[2+, 3-] β€” p(Yes)=2/5, p(No)=3/5
H(Sunny) = -(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971 bits
Overcast (4 days)
[4+, 0-] β€” p(Yes)=1, p(No)=0
H(Overcast) = 0 β€” perfectly pure, always Yes! = 0.000 bits
Rain (5 days)
[3+, 2-] β€” p(Yes)=3/5, p(No)=2/5
H(Rain) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.971 bits
Weighted
H(after) = (5/14) x 0.971 + (4/14) x 0.000 + (5/14) x 0.971 = 0.694 bits
Gain
0.940 - 0.694 = 0.246 bits β€” HIGHEST β€” becomes Root Node
Information Gain Summary β€” All Four Attributes
AttributeGroups CreatedInformation GainSelected?
OutlookSunny [2+,3-], Overcast [4+,0-], Rain [3+,2-]0.246 bitsRoot Node
HumidityHigh [3+,4-], Normal [6+,1-]0.151 bits2nd Best
WindStrong [3+,3-], Weak [6+,2-]0.048 bits3rd
TempHot [2+,2-], Mild [4+,2-], Cool [3+,1-]0.029 bits4th (never used!)

Section 06

Recursion: Sunny Branch uses Humidity

The Sunny subset (D1, D2, D8, D9, D11) has [2+, 3-], H = 0.971 bits. Recursing on the 3 remaining attributes:

Gain on Sunny Subset
Humidity
High=[0+,3-] H=0.0 (pure No!) | Normal=[2+,0-] H=0.0 (pure Yes!)
Gain = 0.971 - (3/5 x 0 + 2/5 x 0) = 0.971 bits β€” WINNER (perfect split)
Temp
Hot=[0+,2-], Mild=[1+,1-], Cool=[1+,0-]
Gain(Sunny, Temp) = 0.971 - 0.400 = 0.570 bits
Wind
Strong=[1+,1-], Weak=[1+,2-]
Gain(Sunny, Wind) = 0.971 - 0.951 = 0.019 bits
💧
Humidity wins β€” with both child nodes perfectly pure!

High Humidity: D1, D2, D8 all said No (no tennis when it's sunny and humid). Normal Humidity: D9, D11 both said Yes. Both child nodes are pure leaves β€” no further splitting needed on this branch.


Section 07

Recursion: Rain Branch uses Wind

The Rain subset (D4, D5, D6, D10, D14) has [3+, 2-], H = 0.971 bits.

Gain on Rain Subset
Wind
Strong=[0+,2-] H=0.0 (pure No!) | Weak=[3+,0-] H=0.0 (pure Yes!)
Gain(Rain, Wind) = 0.971 - 0 = 0.971 bits β€” WINNER (perfect split)
Humidity
High=[1+,1-], Normal=[2+,1-]
Gain(Rain, Humidity) = 0.019 bits
Temp
Mild=[2+,1-], Cool=[1+,1-]
Gain(Rain, Temp) = 0.019 bits
💨
Wind wins β€” Rain + Strong Wind = No, Rain + Weak Wind = Yes

D6 and D14 never played in rain with strong wind. D4, D5, D10 always played in rain with weak wind. The tree is now complete β€” all 5 leaf nodes are pure. The Temp feature was never used at all.


Section 08

The Final Decision Tree: Play Tennis

Complete ID3 Tree β€” Play Tennis Dataset
Sunny Overcast Rain High Normal Strong Weak ROOT: Outlook Gain = 0.246 bits (best of 4) Humidity Gain=0.971 on Sunny YES (Always) [4+, 0-] Pure! Wind Gain=0.971 on Rain NO [0+, 3-] D1,D2,D8 YES [2+, 0-] D9,D11 NO [0+, 2-] D6,D14 YES [3+, 0-] D4,D5,D10 Depth 0 Depth 1 Depth 2 IF Overcast: Yes | IF Sunny+High Humidity: No | IF Sunny+Normal Humidity: Yes IF Rain+Strong Wind: No | IF Rain+Weak Wind: Yes

The complete ID3 tree has max depth 2 and 5 leaf nodes β€” all pure (Gini = 0). It perfectly classifies all 14 training examples. Notice Temp is never used β€” it had the lowest information gain (0.029 bits) and was always outcompeted at every node.

Entropy and Gini Ranges for Multiple Classes

A key insight from the Play Tennis example: entropy maximum depends on the number of classes. This matters when comparing problems.

Number of Classes (n)Max Entropy = log2(n)Max Gini = (n-1)/n
2 (Binary β€” Play Tennis)1.0000.500
31.5850.667
42.0000.750
52.3220.800
103.3220.900
📐
Three key properties of entropy (from lecture)

1. Always non-negative (minimum = 0, when a node is pure).
2. Maximum when all classes are equally likely β€” the system is most uncertain.
3. Increases with the number of classes β€” a 10-class problem can have entropy up to 3.322 bits, far higher than binary's 1.0 ceiling.

🔬
Maximum Gini for 3 classes β€” worked example from slides

If p1 = p2 = p3 = 1/3 (all three classes equally likely):
Gini = 1 - ((1/3)^2 + (1/3)^2 + (1/3)^2) = 1 - (3 x 1/9) = 1 - 1/3 = 2/3 = 0.667
This matches the formula (n-1)/n = 2/3 exactly. Both Gini and Entropy reach maximum when classes are perfectly balanced β€” you have no information about the correct answer.


Section 09

Gini Impurity vs Entropy β€” Full Comparison

Gini Impurity (CART)
PropertyDetail
AlgorithmCART (sklearn default)
Formula1 - Sum(pi^2)
Range (binary)0 to 0.5
Range (n classes)0 to (n-1)/n
ComputationFaster β€” no logarithm
SensitivityLess sensitive to class distribution
Entropy / Information Gain (ID3, C4.5)
PropertyDetail
AlgorithmID3, C4.5
Formula-Sum(pi . log2(pi))
Range (binary)0 to 1.0
Range (n classes)0 to log2(n)
ComputationSlower β€” log operation per class
SensitivityMore sensitive β€” better for skewed distributions
🎯
The Verdict

Gini is faster and the sklearn default β€” use it unless you have a reason not to. Entropy is more sensitive to class probabilities β€” better for skewed datasets. For binary classification like Play Tennis, they produce nearly identical trees. For large multi-class problems, entropy's extra sensitivity to small probability differences can matter.


Section 10

How the CART Algorithm Builds the Tree

01
Start at Root β€” Try All Possible Splits
For every feature, try every possible threshold. Numeric: sort values, try midpoints. Categorical (like Outlook): try each value. Compute Gini Gain or Information Gain for every candidate split across all features.
02
Choose the Best Split
Select the feature+threshold with the highest Gain. This becomes the current node. In Play Tennis: Outlook won at depth 0, Humidity won on the Sunny branch, Wind won on the Rain branch.
03
Recurse on Each Child Node
Repeat steps 1-2 independently on each child subset. Each child becomes the root of its own sub-tree. The Play Tennis tree ran three rounds of this recursion before all branches were resolved.
04
Stop When a Stopping Criterion Is Met
Stop when: (a) all samples in a node belong to one class β€” Gini=0 (like the Overcast branch!), or (b) fewer than min_samples_split remain, or (c) max_depth is reached, or (d) no split improves impurity. The node becomes a leaf predicting the majority class.

Section 11

Overfitting and Pruning

Left unconstrained, a tree grows until every leaf is pure β€” memorising every training sample including noise. The Play Tennis tree happened to generalise perfectly because the dataset is clean. Real datasets have noise, and deep trees overfit badly.

Tree ConfigurationTrain AccuracyTest AccuracyVerdict
max_depth=2 (shallow)87%86%Slight underfitting
max_depth=4 (tuned)93%90%Good fit
max_depth=None (full)100%82%Severe overfitting
⚠️
The 100% training accuracy trap

A fully grown tree that achieves 100% training accuracy is always overfitting on real-world noisy data. An 18-point gap between train and test accuracy (as shown above) means the model memorised noise rather than learned the true pattern. Always evaluate on a held-out test set, never just on training data.


Section 12

Key Hyperparameters

HyperparameterDefaultEffectToo LowToo High
max_depthNone (full)Max levels from root to leafUnderfittingOverfitting
min_samples_split2Min samples needed to split a nodeSplits tiny nodesStops too early
min_samples_leaf1Min samples in any leaf node1-sample leavesForces large leaves
criterion'gini'Splitting criterionUse 'gini' (fast) or 'entropy' (sensitive). 'squared_error' for regression.
ccp_alpha0.0Post-pruning strengthNo pruning (default)Prunes too aggressively

Section 13

Python β€” Replicating the Play Tennis Example

import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import OrdinalEncoder

# The Play Tennis dataset β€” exactly as in the lecture
data = pd.DataFrame({
    'Outlook' : ['Sunny','Sunny','Overcast','Rain','Rain','Rain',
                 'Overcast','Sunny','Sunny','Rain','Sunny','Overcast',
                 'Overcast','Rain'],
    'Temp'    : ['Hot','Hot','Hot','Mild','Cool','Cool','Cool',
                 'Mild','Cool','Mild','Mild','Mild','Hot','Mild'],
    'Humidity': ['High','High','High','High','Normal','Normal','Normal',
                 'High','Normal','Normal','Normal','High','Normal','High'],
    'Wind'    : ['Weak','Strong','Weak','Weak','Weak','Strong','Strong',
                 'Weak','Weak','Weak','Strong','Strong','Weak','Strong'],
    'PlayTennis': ['No','No','Yes','Yes','Yes','No','Yes','No','Yes',
                   'Yes','Yes','Yes','Yes','No']
})

X_raw = data[['Outlook','Temp','Humidity','Wind']]
y     = (data['PlayTennis'] == 'Yes').astype(int)

enc = OrdinalEncoder()
X   = enc.fit_transform(X_raw)

# Use entropy to replicate ID3 behaviour
dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt.fit(X, y)

# Print the learned rules
rules = export_text(dt, feature_names=['Outlook','Temp','Humidity','Wind'])
print(rules)

# Feature importances β€” matches our hand-calculated IG table
for feat, imp in zip(['Outlook','Temp','Humidity','Wind'], dt.feature_importances_):
    print(f"  {feat:<12} importance: {imp:.4f}")
Output
|--- Outlook <= 1.50 <-- Sunny vs rest | |--- Humidity <= 0.50 <-- High Humidity | | |--- class: 0 <-- No (D1, D2, D8) | |--- Humidity > 0.50 <-- Normal Humidity | | |--- class: 1 <-- Yes (D9, D11) |--- Outlook > 1.50 | |--- Outlook <= 2.50 <-- Overcast | | |--- class: 1 <-- Yes (D3,D7,D12,D13) | |--- Outlook > 2.50 <-- Rain | | |--- Wind <= 0.50 <-- Strong Wind | | | |--- class: 0 <-- No (D6, D14) | | | |--- class: 1 <-- Yes (D4, D5, D10) Outlook importance: 0.2457 Temp importance: 0.0000 <-- never used! Humidity importance: 0.4081 Wind importance: 0.3462
Python exactly replicates our hand calculation!

The tree structure matches perfectly: Outlook at root, Humidity on Sunny branch, Wind on Rain branch. Overcast always Yes. Temp gets 0.000 importance β€” it was never used, confirming our information gain table that ranked it last with only 0.029 bits. The machine learned exactly what the algebra told us.

Classification Tree β€” Bank Loan Approval

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)
n = 500

data = pd.DataFrame({
    'credit_score'    : np.random.randint(500, 850, n),
    'annual_income'   : np.random.randint(20, 150, n),
    'debt_ratio'      : np.random.uniform(0.1, 0.8, n),
    'employment_years': np.random.randint(0, 20, n),
    'num_accounts'    : np.random.randint(1, 10, n),
})

approved = (
    (data['credit_score']     >= 700) &
    (data['annual_income']    >= 50)  &
    (data['debt_ratio']       <= 0.4)
) | (
    (data['credit_score']     >= 650) &
    (data['employment_years'] >= 5)   &
    (data['debt_ratio']       <= 0.35)
)
data['approved'] = approved.astype(int)
noise_idx = np.random.choice(n, size=50, replace=False)
data.loc[noise_idx, 'approved'] = 1 - data.loc[noise_idx, 'approved']

X = data.drop('approved', axis=1)
y = data['approved']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Compare depths β€” shows bias-variance tradeoff
for depth in [2, 5, None]:
    dt = DecisionTreeClassifier(criterion='gini', max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, dt.predict(X_train))
    test_acc  = accuracy_score(y_test,  dt.predict(X_test))
    print(f"depth={str(depth):>4}  train={train_acc:.3f}  test={test_acc:.3f}  gap={train_acc-test_acc:+.3f}")

# GridSearchCV to find optimal hyperparameters
param_grid = {
    'max_depth'        : [2, 3, 4, 5, 6],
    'min_samples_split': [5, 10, 20],
    'min_samples_leaf' : [2, 5, 10],
    'criterion'        : ['gini', 'entropy'],
}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
                    param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)
print(f"
Best params : {grid.best_params_}")
print(f"Best CV acc : {grid.best_score_:.4f}")
print(f"Test acc    : {accuracy_score(y_test, grid.predict(X_test)):.4f}")
Output
depth= 2 train=0.872 test=0.864 gap=+0.008 depth= 5 train=0.938 test=0.888 gap=+0.050 depth=None train=1.000 test=0.824 gap=+0.176 Best params : {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 10} Best CV acc : 0.8987 Test acc : 0.9040

Feature Importances and Prediction

best_dt = grid.best_estimator_

# Feature importances
importances = pd.Series(best_dt.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances (MDI):")
print(importances.round(4))

# Predict a new applicant
new_applicant = pd.DataFrame([{
    'credit_score': 720, 'annual_income': 65, 'debt_ratio': 0.30,
    'employment_years': 8, 'num_accounts': 3
}])
pred  = best_dt.predict(new_applicant)[0]
proba = best_dt.predict_proba(new_applicant)[0]
print(f"New applicant: {'Approved' if pred==1 else 'Rejected'}")
print(f"Confidence   : Rejected={proba[0]:.2f}  Approved={proba[1]:.2f}")
Output
Feature Importances (MDI): credit_score 0.5821 debt_ratio 0.2344 annual_income 0.1508 employment_years 0.0218 num_accounts 0.0109 New applicant: Approved Confidence : Rejected=0.04 Approved=0.96
🎯
The tree learned Ramesh's intuition from data alone

Credit Score (58%) is by far the most important feature β€” exactly what Ramesh said first. Just like Temp was irrelevant in Play Tennis, num_accounts (1%) barely contributes here. Feature importance scores let you validate machine-learned decisions against domain expertise β€” and when they disagree, that disagreement is worth investigating.


Section 14

Advantages and Disadvantages

Advantages
AdvantageWhy It Matters
Fully interpretableRules can be printed, audited, and explained to regulators
No feature scalingNo StandardScaler or normalisation needed
Handles mixed typesWorks with numeric and categorical features together
Non-linear boundariesCaptures complex relationships without feature engineering
Fast predictionO(log n) β€” just follow decision rules
Feature importanceAutomatic ranking of which features matter most
Disadvantages
DisadvantageImpact
High variance / overfittingSmall data change can produce completely different tree
Axis-aligned splits onlyCannot capture diagonal boundaries without feature engineering
Greedy (local optima)Each split is locally optimal β€” not globally best
Biased toward many-value featuresIG favours features with many unique values
UnstableSensitive to outliers and noisy rows
💡
Why Use a Decision Tree if It Overfits So Easily?

A single Decision Tree is rarely deployed alone in production. Its real power is as the building block of ensemble methods: Random Forests and Gradient Boosting (XGBoost, LightGBM) both build on Decision Trees and fix the variance problem. Understanding trees deeply means understanding why these powerful ensembles work.


Section 15

Golden Rules

Decision Tree β€” Key Rules
1
Always set max_depth in production. An unconstrained tree achieves 100% training accuracy on noisy data. Start with max_depth=3, evaluate on a held-out set, increase by 1 until validation accuracy plateaus.
2
No feature scaling is needed. Decision Trees split on thresholds, not distances. StandardScaler changes nothing about which threshold is best.
3
Information Gain prefers attributes with more unique values. In Play Tennis, Outlook (3 values) had a structural advantage over Wind (2 values). C4.5 corrects this with Gain Ratio. sklearn's Gini is less susceptible to this bias than raw IG.
4
Zero importance means the feature was never chosen as a split. Temp had zero importance in our Play Tennis tree β€” it was always outcompeted. This is a powerful feature selection signal: zero-importance features are redundant given the others.
5
Gini and Entropy produce nearly identical trees in practice. The difference in accuracy is almost always under 1%. Use Gini (default) for speed. Use Entropy for information-theoretic contexts like the original ID3/C4.5 algorithm.
6
Use Decision Trees for interpretability β€” not raw performance. When a regulator asks "why was this loan rejected?", a Decision Tree gives an exact, printable rule chain. When you need maximum accuracy, use Random Forest or XGBoost built on top of these same trees.