Decision Tree Tutorial: Gini Impurity, Entropy, Pruning

Section 01

The Story: Ramesh the Loan Officer

📖 Real-World Story

When a Human Brain Becomes an Algorithm

Ramesh is a senior loan officer at a Mumbai bank. After 20 years of experience, he can approve or reject a home loan application in under 5 minutes. When a junior analyst asks him how, he says:

"First I check the credit score. If it's above 700, I look at their income. Income over 50K? Approved. Below 50K? Rejected. If credit score is low, I check employment status. Not employed? Rejected. Employed but debt ratio over 40%? Rejected. Employed and debt ratio fine? Approved."

The analyst stares at him. "That's... a decision tree."

Ramesh had been running a Decision Tree algorithm in his head for 20 years without knowing it. Machine learning learns these questions and thresholds automatically from data.

Section 02

What Is a Decision Tree?

A Decision Tree is a supervised learning algorithm that models decisions as a flowchart-like tree structure. It learns a hierarchy of if-else questions from training data.

Interpretable

📖

You can read every decision
No black box — full transparency
Regulators and managers love it
Easy to visualise and debug

Non-Parametric

🔀

No distribution assumptions
Handles non-linear boundaries
Works with mixed data types
No feature scaling required

Foundation

🌲

Base of Random Forests
Base of XGBoost / LightGBM
Used in medical diagnosis
Core of ensemble methods

Section 03

Anatomy of a Decision Tree

🌳 Ramesh's Loan Approval Decision Tree

Purple = Root node. Blue = Internal decision nodes. Green = Approved leaf nodes. Red = Rejected leaf nodes. Each path from root to leaf is one complete decision rule.

Term	Definition	Example in Ramesh's Tree
Root Node	The topmost node — the best initial split for the whole dataset	"Credit Score >= 700?"
Decision Node	An internal node that tests a feature and branches into sub-trees	"Annual Income >= 50K?", "Currently Employed?", "Debt Ratio <= 40%?"
Leaf Node	A terminal node — outputs the final prediction	APPROVED or REJECTED
Branch / Edge	The connection between nodes — labelled YES or NO	The arrows labelled YES / NO between nodes
Depth	Distance from root to a node. Max depth = tree height	Root=0, Income node=1, Debt node=2, final leaves=3
Split	The threshold chosen to divide data at a node	Credit >= 700, Income >= 50K, Debt <= 40%
Pruning	Removing branches to prevent overfitting	Removing the Debt Ratio node if it doesn't help generalisation

Section 04

Real Worked Example: The "Play Tennis" Dataset

This is the canonical dataset used to teach ID3 — the original decision tree algorithm by Ross Quinlan (1986). The task: given weather conditions, should we play tennis today? We have 14 days of historical data with four weather features and the outcome.

Day	Outlook	Temp	Humidity	Wind	Play Tennis?
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

The dataset has 9 Yes and 5 No — written as S = [9+, 5-]. ID3 must figure out which of the four features creates the most useful first split by calculating Information Gain for each.

💡

Parent Entropy: The Starting Point

Before any split, S = [9+, 5-] has entropy:
H(S) = -(9/14).log2(9/14) - (5/14).log2(5/14) = 0.940 bits
Every split's Information Gain is measured as a reduction from 0.940. Maximum binary entropy is 1.0 — we're close because 9:5 isn't far from equal.

Section 05

Entropy and Information Gain — Choosing the Root Split

Entropy of a Node

H = -Sum(pi . log2(pi))

pi = proportion of samples of class i. Range: 0 (pure) to log2(n) for n classes. Binary max = 1.0 at 50/50 split.

Information Gain

IG = H(parent) - Sum((|Sv|/|S|) . H(Sv))

The drop in entropy after splitting. Pick the attribute with highest IG — this is the most informative split available.

Attribute: Outlook — Gain = 0.246 bits (WINNER)

Outlook creates 3 groups: Sunny (D1,D2,D8,D9,D11), Overcast (D3,D7,D12,D13), Rain (D4,D5,D6,D10,D14).

Gain Calculation — Outlook

Sunny (5 days)

[2+, 3-] — p(Yes)=2/5, p(No)=3/5
H(Sunny) = -(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971 bits

Overcast (4 days)

[4+, 0-] — p(Yes)=1, p(No)=0
H(Overcast) = 0 — perfectly pure, always Yes! = 0.000 bits

Rain (5 days)

[3+, 2-] — p(Yes)=3/5, p(No)=2/5
H(Rain) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.971 bits

Weighted

H(after) = (5/14) x 0.971 + (4/14) x 0.000 + (5/14) x 0.971 = 0.694 bits

Gain

0.940 - 0.694 = 0.246 bits — HIGHEST — becomes Root Node

Information Gain Summary — All Four Attributes

Attribute	Groups Created	Information Gain	Selected?
Outlook	Sunny [2+,3-], Overcast [4+,0-], Rain [3+,2-]	0.246 bits	Root Node
Humidity	High [3+,4-], Normal [6+,1-]	0.151 bits	2nd Best
Wind	Strong [3+,3-], Weak [6+,2-]	0.048 bits	3rd
Temp	Hot [2+,2-], Mild [4+,2-], Cool [3+,1-]	0.029 bits	4th (never used!)

Section 06

Recursion: Sunny Branch uses Humidity

The Sunny subset (D1, D2, D8, D9, D11) has [2+, 3-], H = 0.971 bits. Recursing on the 3 remaining attributes:

Gain on Sunny Subset

Humidity

High=[0+,3-] H=0.0 (pure No!) | Normal=[2+,0-] H=0.0 (pure Yes!)
Gain = 0.971 - (3/5 x 0 + 2/5 x 0) = 0.971 bits — WINNER (perfect split)

Temp

Hot=[0+,2-], Mild=[1+,1-], Cool=[1+,0-]
Gain(Sunny, Temp) = 0.971 - 0.400 = 0.570 bits

Wind

Strong=[1+,1-], Weak=[1+,2-]
Gain(Sunny, Wind) = 0.971 - 0.951 = 0.019 bits

💧

Humidity wins — with both child nodes perfectly pure!

High Humidity: D1, D2, D8 all said No (no tennis when it's sunny and humid). Normal Humidity: D9, D11 both said Yes. Both child nodes are pure leaves — no further splitting needed on this branch.

Section 07

Recursion: Rain Branch uses Wind

The Rain subset (D4, D5, D6, D10, D14) has [3+, 2-], H = 0.971 bits.

Gain on Rain Subset

Wind

Strong=[0+,2-] H=0.0 (pure No!) | Weak=[3+,0-] H=0.0 (pure Yes!)
Gain(Rain, Wind) = 0.971 - 0 = 0.971 bits — WINNER (perfect split)

Humidity

High=[1+,1-], Normal=[2+,1-]
Gain(Rain, Humidity) = 0.019 bits

Temp

Mild=[2+,1-], Cool=[1+,1-]
Gain(Rain, Temp) = 0.019 bits

💨

Wind wins — Rain + Strong Wind = No, Rain + Weak Wind = Yes

D6 and D14 never played in rain with strong wind. D4, D5, D10 always played in rain with weak wind. The tree is now complete — all 5 leaf nodes are pure. The Temp feature was never used at all.

Section 08

The Final Decision Tree: Play Tennis

Complete ID3 Tree — Play Tennis Dataset

The complete ID3 tree has max depth 2 and 5 leaf nodes — all pure (Gini = 0). It perfectly classifies all 14 training examples. Notice Temp is never used — it had the lowest information gain (0.029 bits) and was always outcompeted at every node.

Entropy and Gini Ranges for Multiple Classes

A key insight from the Play Tennis example: entropy maximum depends on the number of classes. This matters when comparing problems.

Number of Classes (n)	Max Entropy = log2(n)	Max Gini = (n-1)/n
2 (Binary — Play Tennis)	1.000	0.500
3	1.585	0.667
4	2.000	0.750
5	2.322	0.800
10	3.322	0.900

📐

Three key properties of entropy (from lecture)

1. Always non-negative (minimum = 0, when a node is pure).
2. Maximum when all classes are equally likely — the system is most uncertain.
3. Increases with the number of classes — a 10-class problem can have entropy up to 3.322 bits, far higher than binary's 1.0 ceiling.

🔬

Maximum Gini for 3 classes — worked example from slides

If p1 = p2 = p3 = 1/3 (all three classes equally likely):
Gini = 1 - ((1/3)^2 + (1/3)^2 + (1/3)^2) = 1 - (3 x 1/9) = 1 - 1/3 = 2/3 = 0.667
This matches the formula (n-1)/n = 2/3 exactly. Both Gini and Entropy reach maximum when classes are perfectly balanced — you have no information about the correct answer.

Section 09

Gini Impurity vs Entropy — Full Comparison

Gini Impurity (CART)

Property	Detail
Algorithm	CART (sklearn default)
Formula	1 - Sum(pi^2)
Range (binary)	0 to 0.5
Range (n classes)	0 to (n-1)/n
Computation	Faster — no logarithm
Sensitivity	Less sensitive to class distribution

Entropy / Information Gain (ID3, C4.5)

Property	Detail
Algorithm	ID3, C4.5
Formula	-Sum(pi . log2(pi))
Range (binary)	0 to 1.0
Range (n classes)	0 to log2(n)
Computation	Slower — log operation per class
Sensitivity	More sensitive — better for skewed distributions

🎯

The Verdict

Gini is faster and the sklearn default — use it unless you have a reason not to. Entropy is more sensitive to class probabilities — better for skewed datasets. For binary classification like Play Tennis, they produce nearly identical trees. For large multi-class problems, entropy's extra sensitivity to small probability differences can matter.

Section 10

How the CART Algorithm Builds the Tree

Start at Root — Try All Possible Splits

For every feature, try every possible threshold. Numeric: sort values, try midpoints. Categorical (like Outlook): try each value. Compute Gini Gain or Information Gain for every candidate split across all features.

Choose the Best Split

Select the feature+threshold with the highest Gain. This becomes the current node. In Play Tennis: Outlook won at depth 0, Humidity won on the Sunny branch, Wind won on the Rain branch.

Recurse on Each Child Node

Repeat steps 1-2 independently on each child subset. Each child becomes the root of its own sub-tree. The Play Tennis tree ran three rounds of this recursion before all branches were resolved.

Stop When a Stopping Criterion Is Met

Stop when: (a) all samples in a node belong to one class — Gini=0 (like the Overcast branch!), or (b) fewer than min_samples_split remain, or (c) max_depth is reached, or (d) no split improves impurity. The node becomes a leaf predicting the majority class.

Section 11

Overfitting and Pruning

Left unconstrained, a tree grows until every leaf is pure — memorising every training sample including noise. The Play Tennis tree happened to generalise perfectly because the dataset is clean. Real datasets have noise, and deep trees overfit badly.

Tree Configuration	Train Accuracy	Test Accuracy	Verdict
max_depth=2 (shallow)	87%	86%	Slight underfitting
max_depth=4 (tuned)	93%	90%	Good fit
max_depth=None (full)	100%	82%	Severe overfitting

⚠️

The 100% training accuracy trap

A fully grown tree that achieves 100% training accuracy is always overfitting on real-world noisy data. An 18-point gap between train and test accuracy (as shown above) means the model memorised noise rather than learned the true pattern. Always evaluate on a held-out test set, never just on training data.

Section 12

Key Hyperparameters

Hyperparameter	Default	Effect	Too Low	Too High
`max_depth`	None (full)	Max levels from root to leaf	Underfitting	Overfitting
`min_samples_split`	2	Min samples needed to split a node	Splits tiny nodes	Stops too early
`min_samples_leaf`	1	Min samples in any leaf node	1-sample leaves	Forces large leaves
`criterion`	'gini'	Splitting criterion	Use 'gini' (fast) or 'entropy' (sensitive). 'squared_error' for regression.
`ccp_alpha`	0.0	Post-pruning strength	No pruning (default)	Prunes too aggressively

Section 13

Python — Replicating the Play Tennis Example

import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import OrdinalEncoder

# The Play Tennis dataset — exactly as in the lecture
data = pd.DataFrame({
    'Outlook' : ['Sunny','Sunny','Overcast','Rain','Rain','Rain',
                 'Overcast','Sunny','Sunny','Rain','Sunny','Overcast',
                 'Overcast','Rain'],
    'Temp'    : ['Hot','Hot','Hot','Mild','Cool','Cool','Cool',
                 'Mild','Cool','Mild','Mild','Mild','Hot','Mild'],
    'Humidity': ['High','High','High','High','Normal','Normal','Normal',
                 'High','Normal','Normal','Normal','High','Normal','High'],
    'Wind'    : ['Weak','Strong','Weak','Weak','Weak','Strong','Strong',
                 'Weak','Weak','Weak','Strong','Strong','Weak','Strong'],
    'PlayTennis': ['No','No','Yes','Yes','Yes','No','Yes','No','Yes',
                   'Yes','Yes','Yes','Yes','No']
})

X_raw = data[['Outlook','Temp','Humidity','Wind']]
y     = (data['PlayTennis'] == 'Yes').astype(int)

enc = OrdinalEncoder()
X   = enc.fit_transform(X_raw)

# Use entropy to replicate ID3 behaviour
dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt.fit(X, y)

# Print the learned rules
rules = export_text(dt, feature_names=['Outlook','Temp','Humidity','Wind'])
print(rules)

# Feature importances — matches our hand-calculated IG table
for feat, imp in zip(['Outlook','Temp','Humidity','Wind'], dt.feature_importances_):
    print(f"  {feat:<12} importance: {imp:.4f}")

Output

|--- Outlook <= 1.50 <-- Sunny vs rest | |--- Humidity <= 0.50 <-- High Humidity | | |--- class: 0 <-- No (D1, D2, D8) | |--- Humidity > 0.50 <-- Normal Humidity | | |--- class: 1 <-- Yes (D9, D11) |--- Outlook > 1.50 | |--- Outlook <= 2.50 <-- Overcast | | |--- class: 1 <-- Yes (D3,D7,D12,D13) | |--- Outlook > 2.50 <-- Rain | | |--- Wind <= 0.50 <-- Strong Wind | | | |--- class: 0 <-- No (D6, D14) | | | |--- class: 1 <-- Yes (D4, D5, D10) Outlook importance: 0.2457 Temp importance: 0.0000 <-- never used! Humidity importance: 0.4081 Wind importance: 0.3462

✅

Python exactly replicates our hand calculation!

The tree structure matches perfectly: Outlook at root, Humidity on Sunny branch, Wind on Rain branch. Overcast always Yes. Temp gets 0.000 importance — it was never used, confirming our information gain table that ranked it last with only 0.029 bits. The machine learned exactly what the algebra told us.

Classification Tree — Bank Loan Approval

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)
n = 500

data = pd.DataFrame({
    'credit_score'    : np.random.randint(500, 850, n),
    'annual_income'   : np.random.randint(20, 150, n),
    'debt_ratio'      : np.random.uniform(0.1, 0.8, n),
    'employment_years': np.random.randint(0, 20, n),
    'num_accounts'    : np.random.randint(1, 10, n),
})

approved = (
    (data['credit_score']     >= 700) &
    (data['annual_income']    >= 50)  &
    (data['debt_ratio']       <= 0.4)
) | (
    (data['credit_score']     >= 650) &
    (data['employment_years'] >= 5)   &
    (data['debt_ratio']       <= 0.35)
)
data['approved'] = approved.astype(int)
noise_idx = np.random.choice(n, size=50, replace=False)
data.loc[noise_idx, 'approved'] = 1 - data.loc[noise_idx, 'approved']

X = data.drop('approved', axis=1)
y = data['approved']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Compare depths — shows bias-variance tradeoff
for depth in [2, 5, None]:
    dt = DecisionTreeClassifier(criterion='gini', max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, dt.predict(X_train))
    test_acc  = accuracy_score(y_test,  dt.predict(X_test))
    print(f"depth={str(depth):>4}  train={train_acc:.3f}  test={test_acc:.3f}  gap={train_acc-test_acc:+.3f}")

# GridSearchCV to find optimal hyperparameters
param_grid = {
    'max_depth'        : [2, 3, 4, 5, 6],
    'min_samples_split': [5, 10, 20],
    'min_samples_leaf' : [2, 5, 10],
    'criterion'        : ['gini', 'entropy'],
}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
                    param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)
print(f"
Best params : {grid.best_params_}")
print(f"Best CV acc : {grid.best_score_:.4f}")
print(f"Test acc    : {accuracy_score(y_test, grid.predict(X_test)):.4f}")

Output

depth= 2 train=0.872 test=0.864 gap=+0.008 depth= 5 train=0.938 test=0.888 gap=+0.050 depth=None train=1.000 test=0.824 gap=+0.176 Best params : {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 10} Best CV acc : 0.8987 Test acc : 0.9040

Feature Importances and Prediction

best_dt = grid.best_estimator_

# Feature importances
importances = pd.Series(best_dt.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances (MDI):")
print(importances.round(4))

# Predict a new applicant
new_applicant = pd.DataFrame([{
    'credit_score': 720, 'annual_income': 65, 'debt_ratio': 0.30,
    'employment_years': 8, 'num_accounts': 3
}])
pred  = best_dt.predict(new_applicant)[0]
proba = best_dt.predict_proba(new_applicant)[0]
print(f"New applicant: {'Approved' if pred==1 else 'Rejected'}")
print(f"Confidence   : Rejected={proba[0]:.2f}  Approved={proba[1]:.2f}")

Output

Feature Importances (MDI): credit_score 0.5821 debt_ratio 0.2344 annual_income 0.1508 employment_years 0.0218 num_accounts 0.0109 New applicant: Approved Confidence : Rejected=0.04 Approved=0.96

🎯

The tree learned Ramesh's intuition from data alone

Credit Score (58%) is by far the most important feature — exactly what Ramesh said first. Just like Temp was irrelevant in Play Tennis, num_accounts (1%) barely contributes here. Feature importance scores let you validate machine-learned decisions against domain expertise — and when they disagree, that disagreement is worth investigating.

Section 14

Advantages and Disadvantages

Advantages

Advantage	Why It Matters
Fully interpretable	Rules can be printed, audited, and explained to regulators
No feature scaling	No StandardScaler or normalisation needed
Handles mixed types	Works with numeric and categorical features together
Non-linear boundaries	Captures complex relationships without feature engineering
Fast prediction	O(log n) — just follow decision rules
Feature importance	Automatic ranking of which features matter most

Disadvantages

Disadvantage	Impact
High variance / overfitting	Small data change can produce completely different tree
Axis-aligned splits only	Cannot capture diagonal boundaries without feature engineering
Greedy (local optima)	Each split is locally optimal — not globally best
Biased toward many-value features	IG favours features with many unique values
Unstable	Sensitive to outliers and noisy rows

💡

Why Use a Decision Tree if It Overfits So Easily?

A single Decision Tree is rarely deployed alone in production. Its real power is as the building block of ensemble methods: Random Forests and Gradient Boosting (XGBoost, LightGBM) both build on Decision Trees and fix the variance problem. Understanding trees deeply means understanding why these powerful ensembles work.

Section 15

Golden Rules

Decision Tree — Key Rules

Always set max_depth in production. An unconstrained tree achieves 100% training accuracy on noisy data. Start with max_depth=3, evaluate on a held-out set, increase by 1 until validation accuracy plateaus.

No feature scaling is needed. Decision Trees split on thresholds, not distances. StandardScaler changes nothing about which threshold is best.

Information Gain prefers attributes with more unique values. In Play Tennis, Outlook (3 values) had a structural advantage over Wind (2 values). C4.5 corrects this with Gain Ratio. sklearn's Gini is less susceptible to this bias than raw IG.

Zero importance means the feature was never chosen as a split. Temp had zero importance in our Play Tennis tree — it was always outcompeted. This is a powerful feature selection signal: zero-importance features are redundant given the others.

Gini and Entropy produce nearly identical trees in practice. The difference in accuracy is almost always under 1%. Use Gini (default) for speed. Use Entropy for information-theoretic contexts like the original ID3/C4.5 algorithm.

Use Decision Trees for interpretability — not raw performance. When a regulator asks "why was this loan rejected?", a Decision Tree gives an exact, printable rule chain. When you need maximum accuracy, use Random Forest or XGBoost built on top of these same trees.