The Story: Ramesh the Loan Officer
"First I check the credit score. If it's above 700, I look at their income. Income over 50K? Approved. Below 50K? Rejected. If credit score is low, I check employment status. Not employed? Rejected. Employed but debt ratio over 40%? Rejected. Employed and debt ratio fine? Approved."
The analyst stares at him. "That's... a decision tree."
Ramesh had been running a Decision Tree algorithm in his head for 20 years without knowing it. Machine learning learns these questions and thresholds automatically from data.
What Is a Decision Tree?
A Decision Tree is a supervised learning algorithm that models decisions as a flowchart-like tree structure. It learns a hierarchy of if-else questions from training data.
- You can read every decision
- No black box β full transparency
- Regulators and managers love it
- Easy to visualise and debug
- No distribution assumptions
- Handles non-linear boundaries
- Works with mixed data types
- No feature scaling required
- Base of Random Forests
- Base of XGBoost / LightGBM
- Used in medical diagnosis
- Core of ensemble methods
Anatomy of a Decision Tree
Purple = Root node. Blue = Internal decision nodes. Green = Approved leaf nodes. Red = Rejected leaf nodes. Each path from root to leaf is one complete decision rule.
| Term | Definition | Example in Ramesh's Tree |
|---|---|---|
| Root Node | The topmost node β the best initial split for the whole dataset | "Credit Score >= 700?" |
| Decision Node | An internal node that tests a feature and branches into sub-trees | "Annual Income >= 50K?", "Currently Employed?", "Debt Ratio <= 40%?" |
| Leaf Node | A terminal node β outputs the final prediction | APPROVED or REJECTED |
| Branch / Edge | The connection between nodes β labelled YES or NO | The arrows labelled YES / NO between nodes |
| Depth | Distance from root to a node. Max depth = tree height | Root=0, Income node=1, Debt node=2, final leaves=3 |
| Split | The threshold chosen to divide data at a node | Credit >= 700, Income >= 50K, Debt <= 40% |
| Pruning | Removing branches to prevent overfitting | Removing the Debt Ratio node if it doesn't help generalisation |
Real Worked Example: The "Play Tennis" Dataset
This is the canonical dataset used to teach ID3 β the original decision tree algorithm by Ross Quinlan (1986). The task: given weather conditions, should we play tennis today? We have 14 days of historical data with four weather features and the outcome.
| Day | Outlook | Temp | Humidity | Wind | Play Tennis? |
|---|---|---|---|---|---|
| D1 | Sunny | Hot | High | Weak | No |
| D2 | Sunny | Hot | High | Strong | No |
| D3 | Overcast | Hot | High | Weak | Yes |
| D4 | Rain | Mild | High | Weak | Yes |
| D5 | Rain | Cool | Normal | Weak | Yes |
| D6 | Rain | Cool | Normal | Strong | No |
| D7 | Overcast | Cool | Normal | Strong | Yes |
| D8 | Sunny | Mild | High | Weak | No |
| D9 | Sunny | Cool | Normal | Weak | Yes |
| D10 | Rain | Mild | Normal | Weak | Yes |
| D11 | Sunny | Mild | Normal | Strong | Yes |
| D12 | Overcast | Mild | High | Strong | Yes |
| D13 | Overcast | Hot | Normal | Weak | Yes |
| D14 | Rain | Mild | High | Strong | No |
The dataset has 9 Yes and 5 No β written as S = [9+, 5-].
ID3 must figure out which of the four features creates the most useful first split by calculating Information Gain for each.
Before any split, S = [9+, 5-] has entropy:
H(S) = -(9/14).log2(9/14) - (5/14).log2(5/14) = 0.940 bits
Every split's Information Gain is measured as a reduction from 0.940. Maximum binary entropy is 1.0 β we're close because 9:5 isn't far from equal.
Entropy and Information Gain β Choosing the Root Split
Attribute: Outlook β Gain = 0.246 bits (WINNER)
Outlook creates 3 groups: Sunny (D1,D2,D8,D9,D11), Overcast (D3,D7,D12,D13), Rain (D4,D5,D6,D10,D14).
H(Sunny) = -(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971 bits
H(Overcast) = 0 β perfectly pure, always Yes! = 0.000 bits
H(Rain) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.971 bits
| Attribute | Groups Created | Information Gain | Selected? |
|---|---|---|---|
| Outlook | Sunny [2+,3-], Overcast [4+,0-], Rain [3+,2-] | 0.246 bits | Root Node |
| Humidity | High [3+,4-], Normal [6+,1-] | 0.151 bits | 2nd Best |
| Wind | Strong [3+,3-], Weak [6+,2-] | 0.048 bits | 3rd |
| Temp | Hot [2+,2-], Mild [4+,2-], Cool [3+,1-] | 0.029 bits | 4th (never used!) |
Recursion: Sunny Branch uses Humidity
The Sunny subset (D1, D2, D8, D9, D11) has [2+, 3-], H = 0.971 bits. Recursing on the 3 remaining attributes:
Gain = 0.971 - (3/5 x 0 + 2/5 x 0) = 0.971 bits β WINNER (perfect split)
Gain(Sunny, Temp) = 0.971 - 0.400 = 0.570 bits
Gain(Sunny, Wind) = 0.971 - 0.951 = 0.019 bits
High Humidity: D1, D2, D8 all said No (no tennis when it's sunny and humid). Normal Humidity: D9, D11 both said Yes. Both child nodes are pure leaves β no further splitting needed on this branch.
Recursion: Rain Branch uses Wind
The Rain subset (D4, D5, D6, D10, D14) has [3+, 2-], H = 0.971 bits.
Gain(Rain, Wind) = 0.971 - 0 = 0.971 bits β WINNER (perfect split)
Gain(Rain, Humidity) = 0.019 bits
Gain(Rain, Temp) = 0.019 bits
D6 and D14 never played in rain with strong wind. D4, D5, D10 always played in rain with weak wind. The tree is now complete β all 5 leaf nodes are pure. The Temp feature was never used at all.
The Final Decision Tree: Play Tennis
The complete ID3 tree has max depth 2 and 5 leaf nodes β all pure (Gini = 0). It perfectly classifies all 14 training examples. Notice Temp is never used β it had the lowest information gain (0.029 bits) and was always outcompeted at every node.
Entropy and Gini Ranges for Multiple Classes
A key insight from the Play Tennis example: entropy maximum depends on the number of classes. This matters when comparing problems.
| Number of Classes (n) | Max Entropy = log2(n) | Max Gini = (n-1)/n |
|---|---|---|
| 2 (Binary β Play Tennis) | 1.000 | 0.500 |
| 3 | 1.585 | 0.667 |
| 4 | 2.000 | 0.750 |
| 5 | 2.322 | 0.800 |
| 10 | 3.322 | 0.900 |
1. Always non-negative (minimum = 0, when a node is pure).
2. Maximum when all classes are equally likely β the system is most uncertain.
3. Increases with the number of classes β a 10-class problem can have entropy up to 3.322 bits, far higher than binary's 1.0 ceiling.
If p1 = p2 = p3 = 1/3 (all three classes equally likely):
Gini = 1 - ((1/3)^2 + (1/3)^2 + (1/3)^2) = 1 - (3 x 1/9) = 1 - 1/3 = 2/3 = 0.667
This matches the formula (n-1)/n = 2/3 exactly. Both Gini and Entropy reach maximum when classes are perfectly balanced β you have no information about the correct answer.
Gini Impurity vs Entropy β Full Comparison
| Property | Detail |
|---|---|
| Algorithm | CART (sklearn default) |
| Formula | 1 - Sum(pi^2) |
| Range (binary) | 0 to 0.5 |
| Range (n classes) | 0 to (n-1)/n |
| Computation | Faster β no logarithm |
| Sensitivity | Less sensitive to class distribution |
| Property | Detail |
|---|---|
| Algorithm | ID3, C4.5 |
| Formula | -Sum(pi . log2(pi)) |
| Range (binary) | 0 to 1.0 |
| Range (n classes) | 0 to log2(n) |
| Computation | Slower β log operation per class |
| Sensitivity | More sensitive β better for skewed distributions |
Gini is faster and the sklearn default β use it unless you have a reason not to. Entropy is more sensitive to class probabilities β better for skewed datasets. For binary classification like Play Tennis, they produce nearly identical trees. For large multi-class problems, entropy's extra sensitivity to small probability differences can matter.
How the CART Algorithm Builds the Tree
Overfitting and Pruning
Left unconstrained, a tree grows until every leaf is pure β memorising every training sample including noise. The Play Tennis tree happened to generalise perfectly because the dataset is clean. Real datasets have noise, and deep trees overfit badly.
| Tree Configuration | Train Accuracy | Test Accuracy | Verdict |
|---|---|---|---|
| max_depth=2 (shallow) | 87% | 86% | Slight underfitting |
| max_depth=4 (tuned) | 93% | 90% | Good fit |
| max_depth=None (full) | 100% | 82% | Severe overfitting |
A fully grown tree that achieves 100% training accuracy is always overfitting on real-world noisy data. An 18-point gap between train and test accuracy (as shown above) means the model memorised noise rather than learned the true pattern. Always evaluate on a held-out test set, never just on training data.
Key Hyperparameters
| Hyperparameter | Default | Effect | Too Low | Too High |
|---|---|---|---|---|
max_depth | None (full) | Max levels from root to leaf | Underfitting | Overfitting |
min_samples_split | 2 | Min samples needed to split a node | Splits tiny nodes | Stops too early |
min_samples_leaf | 1 | Min samples in any leaf node | 1-sample leaves | Forces large leaves |
criterion | 'gini' | Splitting criterion | Use 'gini' (fast) or 'entropy' (sensitive). 'squared_error' for regression. | |
ccp_alpha | 0.0 | Post-pruning strength | No pruning (default) | Prunes too aggressively |
Python β Replicating the Play Tennis Example
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import OrdinalEncoder
# The Play Tennis dataset β exactly as in the lecture
data = pd.DataFrame({
'Outlook' : ['Sunny','Sunny','Overcast','Rain','Rain','Rain',
'Overcast','Sunny','Sunny','Rain','Sunny','Overcast',
'Overcast','Rain'],
'Temp' : ['Hot','Hot','Hot','Mild','Cool','Cool','Cool',
'Mild','Cool','Mild','Mild','Mild','Hot','Mild'],
'Humidity': ['High','High','High','High','Normal','Normal','Normal',
'High','Normal','Normal','Normal','High','Normal','High'],
'Wind' : ['Weak','Strong','Weak','Weak','Weak','Strong','Strong',
'Weak','Weak','Weak','Strong','Strong','Weak','Strong'],
'PlayTennis': ['No','No','Yes','Yes','Yes','No','Yes','No','Yes',
'Yes','Yes','Yes','Yes','No']
})
X_raw = data[['Outlook','Temp','Humidity','Wind']]
y = (data['PlayTennis'] == 'Yes').astype(int)
enc = OrdinalEncoder()
X = enc.fit_transform(X_raw)
# Use entropy to replicate ID3 behaviour
dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt.fit(X, y)
# Print the learned rules
rules = export_text(dt, feature_names=['Outlook','Temp','Humidity','Wind'])
print(rules)
# Feature importances β matches our hand-calculated IG table
for feat, imp in zip(['Outlook','Temp','Humidity','Wind'], dt.feature_importances_):
print(f" {feat:<12} importance: {imp:.4f}")
The tree structure matches perfectly: Outlook at root, Humidity on Sunny branch, Wind on Rain branch. Overcast always Yes. Temp gets 0.000 importance β it was never used, confirming our information gain table that ranked it last with only 0.029 bits. The machine learned exactly what the algebra told us.
Classification Tree β Bank Loan Approval
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
np.random.seed(42)
n = 500
data = pd.DataFrame({
'credit_score' : np.random.randint(500, 850, n),
'annual_income' : np.random.randint(20, 150, n),
'debt_ratio' : np.random.uniform(0.1, 0.8, n),
'employment_years': np.random.randint(0, 20, n),
'num_accounts' : np.random.randint(1, 10, n),
})
approved = (
(data['credit_score'] >= 700) &
(data['annual_income'] >= 50) &
(data['debt_ratio'] <= 0.4)
) | (
(data['credit_score'] >= 650) &
(data['employment_years'] >= 5) &
(data['debt_ratio'] <= 0.35)
)
data['approved'] = approved.astype(int)
noise_idx = np.random.choice(n, size=50, replace=False)
data.loc[noise_idx, 'approved'] = 1 - data.loc[noise_idx, 'approved']
X = data.drop('approved', axis=1)
y = data['approved']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# Compare depths β shows bias-variance tradeoff
for depth in [2, 5, None]:
dt = DecisionTreeClassifier(criterion='gini', max_depth=depth, random_state=42)
dt.fit(X_train, y_train)
train_acc = accuracy_score(y_train, dt.predict(X_train))
test_acc = accuracy_score(y_test, dt.predict(X_test))
print(f"depth={str(depth):>4} train={train_acc:.3f} test={test_acc:.3f} gap={train_acc-test_acc:+.3f}")
# GridSearchCV to find optimal hyperparameters
param_grid = {
'max_depth' : [2, 3, 4, 5, 6],
'min_samples_split': [5, 10, 20],
'min_samples_leaf' : [2, 5, 10],
'criterion' : ['gini', 'entropy'],
}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)
print(f"
Best params : {grid.best_params_}")
print(f"Best CV acc : {grid.best_score_:.4f}")
print(f"Test acc : {accuracy_score(y_test, grid.predict(X_test)):.4f}")
Feature Importances and Prediction
best_dt = grid.best_estimator_
# Feature importances
importances = pd.Series(best_dt.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances (MDI):")
print(importances.round(4))
# Predict a new applicant
new_applicant = pd.DataFrame([{
'credit_score': 720, 'annual_income': 65, 'debt_ratio': 0.30,
'employment_years': 8, 'num_accounts': 3
}])
pred = best_dt.predict(new_applicant)[0]
proba = best_dt.predict_proba(new_applicant)[0]
print(f"New applicant: {'Approved' if pred==1 else 'Rejected'}")
print(f"Confidence : Rejected={proba[0]:.2f} Approved={proba[1]:.2f}")
Credit Score (58%) is by far the most important feature β exactly what Ramesh said first. Just like Temp was irrelevant in Play Tennis, num_accounts (1%) barely contributes here. Feature importance scores let you validate machine-learned decisions against domain expertise β and when they disagree, that disagreement is worth investigating.
Advantages and Disadvantages
| Advantage | Why It Matters |
|---|---|
| Fully interpretable | Rules can be printed, audited, and explained to regulators |
| No feature scaling | No StandardScaler or normalisation needed |
| Handles mixed types | Works with numeric and categorical features together |
| Non-linear boundaries | Captures complex relationships without feature engineering |
| Fast prediction | O(log n) β just follow decision rules |
| Feature importance | Automatic ranking of which features matter most |
| Disadvantage | Impact |
|---|---|
| High variance / overfitting | Small data change can produce completely different tree |
| Axis-aligned splits only | Cannot capture diagonal boundaries without feature engineering |
| Greedy (local optima) | Each split is locally optimal β not globally best |
| Biased toward many-value features | IG favours features with many unique values |
| Unstable | Sensitive to outliers and noisy rows |
A single Decision Tree is rarely deployed alone in production. Its real power is as the building block of ensemble methods: Random Forests and Gradient Boosting (XGBoost, LightGBM) both build on Decision Trees and fix the variance problem. Understanding trees deeply means understanding why these powerful ensembles work.