The Story: Why "Impurity" Matters
Strategy A: Grab fruit randomly, put it anywhere.
Strategy B: First ask "Is it round?" Then "Is it yellow?" Then "Is it bumpy?"
After Strategy A your boxes are still a mess โ apples mixed with oranges, bananas mixed with limes. After Strategy B each box has mostly one type. The boxes are purer.
A Decision Tree is Strategy B. But it needs a number to measure how messy a box is right now, and how much cleaner each possible question would make it. That number is Entropy (or Gini Impurity). The improvement from asking a question is Information Gain.
These three concepts are the entire mathematical engine inside a decision tree. Everything else is just bookkeeping.
What Is Impurity? โ The Core Concept
Before we touch any formula, let's build the intuition. A node in a decision tree contains a group of training samples. Impurity measures how mixed the class labels are in that group.
A Decision Tree split is good if the child nodes are purer than the parent. The algorithm's job is to find the split that reduces impurity the most. Two different formulas measure impurity: Entropy (information theory) and Gini Impurity (probability theory). Both measure the same thing โ just with different maths.
Entropy โ Disorder Measured in Bits
If the text is all A's โ you need 0 questions. You already know the answer. Zero surprise. Zero entropy.
If every letter is equally likely โ you need about 4.7 questions (logโ 26 โ 4.7). Maximum surprise. Maximum entropy.
A node in a decision tree is like a bag of letters. If all samples belong to one class โ zero surprise, entropy = 0. If classes are perfectly balanced โ maximum surprise, entropy = 1.0 (for binary). The tree wants to ask questions that reduce surprise as fast as possible.
Sum over all classes. Convention: 0 ยท logโ(0) = 0.
Unit: bits. Range: 0 (pure) to logโ(n) for n classes.
Maximum = 1.0 bit when p = 0.5 (50/50 split).
Minimum = 0 bits when p = 0 or p = 1 (pure node).
The Entropy Curve โ Visualised
The entropy curve is a symmetric hill. It peaks at 1.0 bit when classes are 50/50 (maximum uncertainty) and falls to 0 at both extremes (complete certainty). The tree wants to move nodes leftward or rightward โ away from the peak.
Step-by-Step Entropy Calculation โ Weather Dataset
We have 14 days of weather data. The target is Play Tennis? The full set S contains 9 Yes and 5 No.
p(Yes) = 9/14 = 0.643 | p(No) = 5/14 = 0.357
โ(0.643 ยท logโ 0.643) = โ(0.643 ยท โ0.637) = +0.410
โ(0.357 ยท logโ 0.357) = โ(0.357 ยท โ1.486) = +0.530
This is the baseline. Every split's quality is measured against this.
Entropy for Multiple Classes
Entropy works for any number of classes โ not just binary. The formula is identical; you just sum more terms. The maximum possible entropy scales with the number of classes.
| Number of Classes (n) | Maximum Entropy = logโ(n) | Example Problem | Max Entropy Value |
|---|---|---|---|
| 2 (Binary) | logโ(2) | Play Tennis: Yes / No | 1.000 bits |
| 3 | logโ(3) | Weather: Sunny / Overcast / Rain | 1.585 bits |
| 4 | logโ(4) | Seasons: Spring / Summer / Autumn / Winter | 2.000 bits |
| 5 | logโ(5) | Risk: Very Low / Low / Med / High / Very High | 2.322 bits |
| 10 | logโ(10) | Digit recognition: 0 โ 9 | 3.322 bits |
p(Apple) = 6/12 = 0.50 | p(Orange) = 3/12 = 0.25 | p(Banana) = 3/12 = 0.25
= โ(0.50 ยท โ1.0) โ (0.25 ยท โ2.0) โ (0.25 ยท โ2.0)
= 0.500 + 0.500 + 0.500 = 1.500 bits
Our box is close to maximum impurity (1.5/1.585 = 94.6% of maximum). A good split should bring this down significantly.
1. Always โฅ 0 โ entropy is zero only when the node is perfectly pure (one class only).
2. Maximum when classes are equally likely โ this is when you have the least information.
3. Additive for independent events โ this is what makes it theoretically elegant and why Shannon chose it over alternatives.
Information Gain โ Measuring Split Quality
Information Gain answers the question: "How much does asking this question reduce my uncertainty about the answer?" It is simply the difference in entropy before and after the split, weighted by the size of each child group.
Sแตฅ = subset of samples where attribute A has value v.
The weighted entropy of all children is subtracted from parent entropy.
Worked Example: Which Attribute Splits the Weather Data Best?
| Day | Outlook | Temp | Humidity | Wind | Play Tennis? |
|---|---|---|---|---|---|
| D1 | Sunny | Hot | High | Weak | No |
| D2 | Sunny | Hot | High | Strong | No |
| D3 | Overcast | Hot | High | Weak | Yes |
| D4 | Rain | Mild | High | Weak | Yes |
| D5 | Rain | Cool | Normal | Weak | Yes |
| D6 | Rain | Cool | Normal | Strong | No |
| D7 | Overcast | Cool | Normal | Strong | Yes |
| D8 | Sunny | Mild | High | Weak | No |
| D9 | Sunny | Cool | Normal | Weak | Yes |
| D10 | Rain | Mild | Normal | Weak | Yes |
| D11 | Sunny | Mild | Normal | Strong | Yes |
| D12 | Overcast | Mild | High | Strong | Yes |
| D13 | Overcast | Hot | Normal | Weak | Yes |
| D14 | Rain | Mild | High | Strong | No |
Parent entropy H(S) = 0.940 bits. Now we calculate IG for each attribute.
H(Sunny) = โ0.4ยทlogโ0.4 โ 0.6ยทlogโ0.6 = 0.529 + 0.442 = 0.971 bits
H(Overcast) = 0 โ 0 = 0.000 bits
H(Rain) = โ0.6ยทlogโ0.6 โ 0.4ยทlogโ0.4 = 0.442 + 0.529 = 0.971 bits
= 0.347 + 0.000 + 0.347 = 0.694 bits
H(High) = โ(3/7)ยทlogโ(3/7) โ (4/7)ยทlogโ(4/7) = 0.524 + 0.461 = 0.985 bits
H(Normal) = โ(6/7)ยทlogโ(6/7) โ (1/7)ยทlogโ(1/7) = 0.191 + 0.401 = 0.592 bits
IG(S, Humidity) = 0.940 โ 0.789 = 0.151 bits
Weighted = (6/14)ยท1.0 + (8/14)ยท0.811 = 0.893
IG(S, Wind) = 0.940 โ 0.893 = 0.048 bits
Weighted = (4/14)ยท1.0 + (6/14)ยท0.918 + (4/14)ยท0.811 = 0.911
IG(S, Temp) = 0.940 โ 0.911 = 0.029 bits
Outlook is selected as the root node with IG = 0.246 bits. Temp earns only 0.029 bits โ it is outcompeted at every node and ends up never being used in the final tree.
Gini Impurity โ The CART Algorithm's Measure
If the node is pure (all one class), I'll never mislabel it โ Gini = 0.
If the node is 50/50, I'll mislabel half the time โ Gini = 0.5.
No logarithms. No bits. Just a probability between 0 and 0.5 (for binary problems). This is why scikit-learn defaults to Gini โ it's computationally faster while capturing the same information.
Sum the squared proportions, subtract from 1.
Range: 0 (pure) to (nโ1)/n for n classes. Binary max = 0.5.
Gini Gain = Gini(parent) โ Gini_split.
Pick the split with the highest Gini Gain.
Gini Values Across Class Distributions
| Distribution [Yes, No] | p(Yes) | Gini = 1 โ (pยฒ + qยฒ) | Interpretation |
|---|---|---|---|
| [10, 0] โ all Yes | 1.00 | 1 โ (1.00ยฒ + 0.00ยฒ) = 0.000 | Pure โ no impurity |
| [8, 2] โ mostly Yes | 0.80 | 1 โ (0.80ยฒ + 0.20ยฒ) = 1 โ 0.68 = 0.320 | Low impurity |
| [6, 4] โ slightly skewed | 0.60 | 1 โ (0.60ยฒ + 0.40ยฒ) = 1 โ 0.52 = 0.480 | Moderate impurity |
| [5, 5] โ perfectly mixed | 0.50 | 1 โ (0.50ยฒ + 0.50ยฒ) = 1 โ 0.50 = 0.500 | Maximum impurity (binary) |
Gini Calculation โ Same Weather Dataset
Gain = 0.460 โ 0.367 = 0.093
Gini for Multiple Classes โ Maximum Values
Just like entropy, the maximum Gini Impurity increases with the number of classes. The formula for maximum Gini is (nโ1)/n โ occurring when all classes are equally likely.
| Classes (n) | Max Gini = (nโ1)/n | Worked Example (equal probs) |
|---|---|---|
| 2 | 0.500 | p=0.5 each: 1โ(0.5ยฒ+0.5ยฒ) = 1โ0.5 = 0.500 |
| 3 | 0.667 | p=1/3 each: 1โ3ยท(1/3)ยฒ = 1โ3ยท(1/9) = 1โ1/3 = 0.667 |
| 4 | 0.750 | p=0.25 each: 1โ4ยท(0.25ยฒ) = 1โ4ยท0.0625 = 1โ0.25 = 0.750 |
| 5 | 0.800 | p=0.2 each: 1โ5ยท(0.2ยฒ) = 1โ5ยท0.04 = 1โ0.20 = 0.800 |
| 10 | 0.900 | p=0.1 each: 1โ10ยท(0.1ยฒ) = 1โ10ยท0.01 = 1โ0.10 = 0.900 |
Entropy vs Gini โ Head-to-Head
All three curves are symmetric and peak at p = 0.5. Entropy (purple) rises more steeply away from the extremes โ it is more sensitive to small probability changes near p=0 and p=1. Gini (blue) is a smoother quadratic approximation. Misclassification rate (dashed) is linear and less smooth โ rarely used because it lacks the mathematical properties the other two have.
| Property | Detail |
|---|---|
| Formula | 1 โ ฮฃ pแตขยฒ |
| Binary range | 0 to 0.5 |
| n-class max | (nโ1)/n |
| Computation | Fast โ squaring only, no log |
| Behaviour | Tends toward larger balanced partitions |
| Sensitivity | Less sensitive at extremes (near-pure nodes) |
| Used by | CART, sklearn DecisionTree (default) |
| Property | Detail |
|---|---|
| Formula | โฮฃ pแตข ยท logโ(pแตข) |
| Binary range | 0 to 1.0 |
| n-class max | logโ(n) |
| Computation | Slower โ log per class per node |
| Behaviour | More sensitive to subtle class changes |
| Sensitivity | Higher sensitivity at near-pure nodes |
| Used by | ID3, C4.5, sklearn with criterion='entropy' |
The Bias Problem โ and Gain Ratio (C4.5)
Imagine you add a "Day ID" column (D1, D2, โฆ D14) to the weather dataset. Splitting on Day ID gives 14 child nodes, each with exactly 1 sample โ perfectly pure! IG = 0.940 bits โ maximum possible. But it's useless โ it memorises training data and can't generalise. ID3 would always pick it. This is why Ross Quinlan upgraded ID3 to C4.5, which normalises IG by the feature's own entropy (its "split information").
SplitInfo = โ(5/14)ยทlogโ(5/14) โ (4/14)ยทlogโ(4/14) โ (5/14)ยทlogโ(5/14)
= 0.531 + 0.464 + 0.531 = 1.577 bits
GainRatio = 0.246 / 1.577 = 0.156
SplitInfo = 14 ร โ(1/14)ยทlogโ(1/14) = logโ(14) = 3.807 bits
GainRatio = 0.940 / 3.807 = 0.247
Still wins numerically here but much less dominant โ and in real datasets Day ID would never appear as a feature for prediction.
Python Implementation โ From Scratch and with sklearn
Manual Calculation โ Entropy, Gini and Information Gain
import numpy as np
# โโ Helper functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def entropy(labels):
"""Shannon entropy of a list of class labels."""
n = len(labels)
if n == 0:
return 0.0
classes, counts = np.unique(labels, return_counts=True)
probs = counts / n
# Convention: 0 * log2(0) = 0
return -np.sum(probs * np.log2(probs + 1e-12))
def gini(labels):
"""Gini impurity of a list of class labels."""
n = len(labels)
if n == 0:
return 0.0
_, counts = np.unique(labels, return_counts=True)
probs = counts / n
return 1 - np.sum(probs ** 2)
def information_gain(parent_labels, child_groups):
"""IG = H(parent) - weighted sum of H(children)."""
n_parent = len(parent_labels)
h_parent = entropy(parent_labels)
weighted_child = sum(
(len(group) / n_parent) * entropy(group)
for group in child_groups
)
return h_parent - weighted_child
def gini_gain(parent_labels, child_groups):
"""Gini Gain = Gini(parent) - weighted Gini(children)."""
n_parent = len(parent_labels)
g_parent = gini(parent_labels)
weighted_child = sum(
(len(group) / n_parent) * gini(group)
for group in child_groups
)
return g_parent - weighted_child
# โโ Play Tennis Dataset โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
y = np.array([0,0,1,1,1,0,1,0,1,1,1,1,1,0]) # 0=No, 1=Yes
# Outlook split: Sunny=[0,1,2,7,10] โ indices โ labels
sunny = y[[0,1,7,8,10]] # D1,D2,D8,D9,D11
overcast = y[[2,6,11,12]] # D3,D7,D12,D13
rain = y[[3,4,5,9,13]] # D4,D5,D6,D10,D14
print("=== ENTROPY ===")
print(f"H(parent) = {entropy(y):.4f}")
print(f"H(Sunny) = {entropy(sunny):.4f}")
print(f"H(Overcast)= {entropy(overcast):.4f}")
print(f"H(Rain) = {entropy(rain):.4f}")
print(f"IG(Outlook)= {information_gain(y, [sunny, overcast, rain]):.4f}")
print("\n=== GINI ===")
print(f"Gini(parent) = {gini(y):.4f}")
print(f"Gini(Sunny) = {gini(sunny):.4f}")
print(f"Gini(Overcast)= {gini(overcast):.4f}")
print(f"Gini(Rain) = {gini(rain):.4f}")
print(f"Gini Gain(Outlook) = {gini_gain(y, [sunny, overcast, rain]):.4f}")
Compare All Attributes at Once
import pandas as pd
# Full dataset as a DataFrame
df = pd.DataFrame({
'Outlook' : ['Sunny','Sunny','Overcast','Rain','Rain','Rain',
'Overcast','Sunny','Sunny','Rain','Sunny',
'Overcast','Overcast','Rain'],
'Temp' : ['Hot','Hot','Hot','Mild','Cool','Cool','Cool',
'Mild','Cool','Mild','Mild','Mild','Hot','Mild'],
'Humidity': ['High','High','High','High','Normal','Normal','Normal',
'High','Normal','Normal','Normal','High','Normal','High'],
'Wind' : ['Weak','Strong','Weak','Weak','Weak','Strong','Strong',
'Weak','Weak','Weak','Strong','Strong','Weak','Strong'],
'Play' : [0,0,1,1,1,0,1,0,1,1,1,1,1,0]
})
results = []
for col in ['Outlook', 'Temp', 'Humidity', 'Wind']:
groups = [df[df[col] == v]['Play'].values for v in df[col].unique()]
ig = information_gain(df['Play'].values, groups)
gg = gini_gain(df['Play'].values, groups)
results.append({'Attribute': col, 'Info Gain': ig, 'Gini Gain': gg})
summary = pd.DataFrame(results).sort_values('Info Gain', ascending=False)
print(summary.to_string(index=False, float_format='{:.4f}'.format))
sklearn: Switching Between Gini and Entropy
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
X_raw = df[['Outlook','Temp','Humidity','Wind']].values
y = df['Play'].values
enc = OrdinalEncoder()
X = enc.fit_transform(X_raw)
for criterion in ['gini', 'entropy']:
dt = DecisionTreeClassifier(criterion=criterion, random_state=42)
dt.fit(X, y)
print(f"\n=== criterion='{criterion}' ===")
print(export_text(dt, feature_names=['Outlook','Temp','Humidity','Wind']))
print("Feature importances:")
for f, imp in zip(['Outlook','Temp','Humidity','Wind'], dt.feature_importances_):
print(f" {f:<12} {imp:.4f}")
Gini and Entropy agree on Outlook as the root, Humidity on the Sunny branch, and Wind on the Rain branch. Temp gets zero importance from both. On real-world noisy data, they may diverge slightly โ but the difference in final accuracy is almost always under 1%. Use Gini for speed (default), Entropy when theoretical correctness matters.
Side-by-Side Split Quality Diagram
The diagram below shows the same parent node being split two ways. Split A is good โ children are purer. Split B is poor โ children are as mixed as the parent.