Machine Learning ๐Ÿ“‚ Supervised Learning ยท 8 of 17 53 min read

Entropy, Information Gain and Gini Impurity

Master the maths behind Decision Trees. Learn how Entropy measures disorder in bits, how Information Gain picks the best split, how Gini Impurity works without logarithms, and when to use each โ€” with full worked examples, Python code, and visual diagrams.

Section 01

The Story: Why "Impurity" Matters

The Sorting Machine That Learned to Ask Questions
Imagine you work at a fruit-packing factory. Your job: sort a pile of mixed fruit into boxes โ€” one box per type. You have two strategies.

Strategy A: Grab fruit randomly, put it anywhere.
Strategy B: First ask "Is it round?" Then "Is it yellow?" Then "Is it bumpy?"

After Strategy A your boxes are still a mess โ€” apples mixed with oranges, bananas mixed with limes. After Strategy B each box has mostly one type. The boxes are purer.

A Decision Tree is Strategy B. But it needs a number to measure how messy a box is right now, and how much cleaner each possible question would make it. That number is Entropy (or Gini Impurity). The improvement from asking a question is Information Gain.

These three concepts are the entire mathematical engine inside a decision tree. Everything else is just bookkeeping.

Section 02

What Is Impurity? โ€” The Core Concept

Before we touch any formula, let's build the intuition. A node in a decision tree contains a group of training samples. Impurity measures how mixed the class labels are in that group.

๐Ÿ”ต Impurity Spectrum โ€” From Pure to Chaotic
PURE NODE All same class Entropy = 0 Gini = 0 SKEWED NODE 4 green, 2 red Entropy โ‰ˆ 0.92 Gini โ‰ˆ 0.44 IMPURE NODE 3 green, 3 red (50/50) Entropy = 1.0 Gini = 0.5 increasing impurity
๐Ÿ’ก
The Goal of Every Split

A Decision Tree split is good if the child nodes are purer than the parent. The algorithm's job is to find the split that reduces impurity the most. Two different formulas measure impurity: Entropy (information theory) and Gini Impurity (probability theory). Both measure the same thing โ€” just with different maths.


Section 03

Entropy โ€” Disorder Measured in Bits

Claude Shannon's Letter Guessing Game
In 1948, Claude Shannon invented Information Theory to measure how much "surprise" is in a message. He asked: if I pick a letter from a text, how many yes/no questions do I need to guess it?

If the text is all A's โ€” you need 0 questions. You already know the answer. Zero surprise. Zero entropy.

If every letter is equally likely โ€” you need about 4.7 questions (logโ‚‚ 26 โ‰ˆ 4.7). Maximum surprise. Maximum entropy.

A node in a decision tree is like a bag of letters. If all samples belong to one class โ€” zero surprise, entropy = 0. If classes are perfectly balanced โ€” maximum surprise, entropy = 1.0 (for binary). The tree wants to ask questions that reduce surprise as fast as possible.
Entropy Formula
H(S) = โˆ’ฮฃ pแตข ยท logโ‚‚(pแตข)
pแตข = proportion of samples belonging to class i.
Sum over all classes. Convention: 0 ยท logโ‚‚(0) = 0.
Unit: bits. Range: 0 (pure) to logโ‚‚(n) for n classes.
Binary Entropy (2 classes)
H = โˆ’pยทlogโ‚‚p โˆ’ (1โˆ’p)ยทlogโ‚‚(1โˆ’p)
Most common case: Yes/No, Spam/Ham, Default/No Default.
Maximum = 1.0 bit when p = 0.5 (50/50 split).
Minimum = 0 bits when p = 0 or p = 1 (pure node).

The Entropy Curve โ€” Visualised

๐Ÿ“ˆ Binary Entropy H(p) vs Class Probability p
0 0.5 1.0 0 0.2 0.4 0.5 0.6 0.8 1.0 p (probability of positive class) Entropy H(p) Max: H=1.0 at p=0.5 H=0 (pure) H=0 (pure)

The entropy curve is a symmetric hill. It peaks at 1.0 bit when classes are 50/50 (maximum uncertainty) and falls to 0 at both extremes (complete certainty). The tree wants to move nodes leftward or rightward โ€” away from the peak.

Step-by-Step Entropy Calculation โ€” Weather Dataset

We have 14 days of weather data. The target is Play Tennis? The full set S contains 9 Yes and 5 No.

๐Ÿงฎ H(S) โ€” Parent Entropy for [9+, 5โˆ’]
Step 1
Calculate class proportions:
p(Yes) = 9/14 = 0.643  |  p(No) = 5/14 = 0.357
Step 2
Compute each term โˆ’pแตข ยท logโ‚‚(pแตข):
โˆ’(0.643 ยท logโ‚‚ 0.643) = โˆ’(0.643 ยท โˆ’0.637) = +0.410
โˆ’(0.357 ยท logโ‚‚ 0.357) = โˆ’(0.357 ยท โˆ’1.486) = +0.530
Step 3
Sum: H(S) = 0.410 + 0.530 = 0.940 bits
This is the baseline. Every split's quality is measured against this.

Section 04

Entropy for Multiple Classes

Entropy works for any number of classes โ€” not just binary. The formula is identical; you just sum more terms. The maximum possible entropy scales with the number of classes.

Number of Classes (n) Maximum Entropy = logโ‚‚(n) Example Problem Max Entropy Value
2 (Binary) logโ‚‚(2) Play Tennis: Yes / No 1.000 bits
3 logโ‚‚(3) Weather: Sunny / Overcast / Rain 1.585 bits
4 logโ‚‚(4) Seasons: Spring / Summer / Autumn / Winter 2.000 bits
5 logโ‚‚(5) Risk: Very Low / Low / Med / High / Very High 2.322 bits
10 logโ‚‚(10) Digit recognition: 0 โ€“ 9 3.322 bits
๐Ÿงฎ 3-Class Entropy Example โ€” Fruit Sorting
Dataset
A box contains 6 Apples, 3 Oranges, 3 Bananas. Total = 12 fruit.
p(Apple) = 6/12 = 0.50  |  p(Orange) = 3/12 = 0.25  |  p(Banana) = 3/12 = 0.25
Compute
H = โˆ’(0.50 ยท logโ‚‚0.50) โˆ’ (0.25 ยท logโ‚‚0.25) โˆ’ (0.25 ยท logโ‚‚0.25)
= โˆ’(0.50 ยท โˆ’1.0) โˆ’ (0.25 ยท โˆ’2.0) โˆ’ (0.25 ยท โˆ’2.0)
= 0.500 + 0.500 + 0.500 = 1.500 bits
Interpret
Maximum for 3 classes = logโ‚‚(3) = 1.585 bits.
Our box is close to maximum impurity (1.5/1.585 = 94.6% of maximum). A good split should bring this down significantly.
๐Ÿ“
Three Essential Properties of Entropy

1. Always โ‰ฅ 0 โ€” entropy is zero only when the node is perfectly pure (one class only).
2. Maximum when classes are equally likely โ€” this is when you have the least information.
3. Additive for independent events โ€” this is what makes it theoretically elegant and why Shannon chose it over alternatives.


Section 05

Information Gain โ€” Measuring Split Quality

Information Gain answers the question: "How much does asking this question reduce my uncertainty about the answer?" It is simply the difference in entropy before and after the split, weighted by the size of each child group.

Information Gain
IG(S, A) = H(S) โˆ’ ฮฃ (|Sแตฅ|/|S|) ยท H(Sแตฅ)
S = parent node samples. A = attribute being tested.
Sแตฅ = subset of samples where attribute A has value v.
The weighted entropy of all children is subtracted from parent entropy.
What Good IG Looks Like
IG = 0 โ†’ useless split
IG = 0 means the split didn't reduce entropy at all โ€” the attribute carries no information about the target. Maximum IG = H(parent) โ€” both children are perfectly pure. The algorithm picks the attribute with highest IG.

Worked Example: Which Attribute Splits the Weather Data Best?

DayOutlookTempHumidityWindPlay Tennis?
D1SunnyHotHighWeakNo
D2SunnyHotHighStrongNo
D3OvercastHotHighWeakYes
D4RainMildHighWeakYes
D5RainCoolNormalWeakYes
D6RainCoolNormalStrongNo
D7OvercastCoolNormalStrongYes
D8SunnyMildHighWeakNo
D9SunnyCoolNormalWeakYes
D10RainMildNormalWeakYes
D11SunnyMildNormalStrongYes
D12OvercastMildHighStrongYes
D13OvercastHotNormalWeakYes
D14RainMildHighStrongNo

Parent entropy H(S) = 0.940 bits. Now we calculate IG for each attribute.

๐Ÿงฎ IG(S, Outlook)
Sunny (5)
D1,D2,D8,D9,D11 โ†’ [2+, 3โˆ’]   p+=2/5=0.4, pโˆ’=3/5=0.6
H(Sunny) = โˆ’0.4ยทlogโ‚‚0.4 โˆ’ 0.6ยทlogโ‚‚0.6 = 0.529 + 0.442 = 0.971 bits
Overcast (4)
D3,D7,D12,D13 โ†’ [4+, 0โˆ’]   perfectly pure!
H(Overcast) = 0 โˆ’ 0 = 0.000 bits
Rain (5)
D4,D5,D6,D10,D14 โ†’ [3+, 2โˆ’]   p+=3/5=0.6, pโˆ’=2/5=0.4
H(Rain) = โˆ’0.6ยทlogโ‚‚0.6 โˆ’ 0.4ยทlogโ‚‚0.4 = 0.442 + 0.529 = 0.971 bits
Weighted
(5/14)ยท0.971 + (4/14)ยท0.000 + (5/14)ยท0.971
= 0.347 + 0.000 + 0.347 = 0.694 bits
IG
IG(S, Outlook) = 0.940 โˆ’ 0.694 = 0.246 bits โœ“ WINNER
๐Ÿงฎ IG(S, Humidity)
High (7)
D1,D2,D3,D4,D8,D12,D14 โ†’ [3+, 4โˆ’]
H(High) = โˆ’(3/7)ยทlogโ‚‚(3/7) โˆ’ (4/7)ยทlogโ‚‚(4/7) = 0.524 + 0.461 = 0.985 bits
Normal (7)
D5,D6,D7,D9,D10,D11,D13 โ†’ [6+, 1โˆ’]
H(Normal) = โˆ’(6/7)ยทlogโ‚‚(6/7) โˆ’ (1/7)ยทlogโ‚‚(1/7) = 0.191 + 0.401 = 0.592 bits
IG
Weighted = (7/14)ยท0.985 + (7/14)ยท0.592 = 0.493 + 0.296 = 0.789
IG(S, Humidity) = 0.940 โˆ’ 0.789 = 0.151 bits
๐Ÿงฎ IG(S, Wind) and IG(S, Temp)
Wind
Strong [3+,3โˆ’] H=1.0   Weak [6+,2โˆ’] H=0.811
Weighted = (6/14)ยท1.0 + (8/14)ยท0.811 = 0.893
IG(S, Wind) = 0.940 โˆ’ 0.893 = 0.048 bits
Temp
Hot [2+,2โˆ’] H=1.0   Mild [4+,2โˆ’] H=0.918   Cool [3+,1โˆ’] H=0.811
Weighted = (4/14)ยท1.0 + (6/14)ยท0.918 + (4/14)ยท0.811 = 0.911
IG(S, Temp) = 0.940 โˆ’ 0.911 = 0.029 bits
๐Ÿ“Š Information Gain Comparison โ€” All Four Attributes
Outlook 0.246 bits โ† Root Node Humidity 0.151 bits Wind 0.048 bits Temp 0.029 bits (never used!)

Outlook is selected as the root node with IG = 0.246 bits. Temp earns only 0.029 bits โ€” it is outcompeted at every node and ends up never being used in the final tree.


Section 06

Gini Impurity โ€” The CART Algorithm's Measure

The Probability of Mislabelling a Random Sample
Gini Impurity asks a different question: "If I randomly pick a sample from this node and randomly assign it a class label based on the class distribution in this node, what is the probability I label it wrong?"

If the node is pure (all one class), I'll never mislabel it โ€” Gini = 0.
If the node is 50/50, I'll mislabel half the time โ€” Gini = 0.5.

No logarithms. No bits. Just a probability between 0 and 0.5 (for binary problems). This is why scikit-learn defaults to Gini โ€” it's computationally faster while capturing the same information.
Gini Impurity
Gini(S) = 1 โˆ’ ฮฃ pแตขยฒ
pแตข = proportion of samples in class i.
Sum the squared proportions, subtract from 1.
Range: 0 (pure) to (nโˆ’1)/n for n classes. Binary max = 0.5.
Weighted Gini After Split
Gini_split = ฮฃ (|Sแตฅ|/|S|) ยท Gini(Sแตฅ)
Weighted average of child node Gini values.
Gini Gain = Gini(parent) โˆ’ Gini_split.
Pick the split with the highest Gini Gain.

Gini Values Across Class Distributions

Distribution [Yes, No] p(Yes) Gini = 1 โˆ’ (pยฒ + qยฒ) Interpretation
[10, 0] โ€” all Yes 1.00 1 โˆ’ (1.00ยฒ + 0.00ยฒ) = 0.000 Pure โ€” no impurity
[8, 2] โ€” mostly Yes 0.80 1 โˆ’ (0.80ยฒ + 0.20ยฒ) = 1 โˆ’ 0.68 = 0.320 Low impurity
[6, 4] โ€” slightly skewed 0.60 1 โˆ’ (0.60ยฒ + 0.40ยฒ) = 1 โˆ’ 0.52 = 0.480 Moderate impurity
[5, 5] โ€” perfectly mixed 0.50 1 โˆ’ (0.50ยฒ + 0.50ยฒ) = 1 โˆ’ 0.50 = 0.500 Maximum impurity (binary)

Gini Calculation โ€” Same Weather Dataset

๐Ÿงฎ Gini(S) โ€” Parent Node [9+, 5โˆ’]
Proportions
p(Yes) = 9/14 = 0.643  |  p(No) = 5/14 = 0.357
Formula
Gini(S) = 1 โˆ’ (0.643ยฒ + 0.357ยฒ) = 1 โˆ’ (0.413 + 0.127) = 1 โˆ’ 0.540 = 0.460
๐Ÿงฎ Gini Gain โ€” Outlook Split
Sunny (5) [2+,3โˆ’]
Gini = 1 โˆ’ ((2/5)ยฒ + (3/5)ยฒ) = 1 โˆ’ (0.16 + 0.36) = 0.480
Overcast (4) [4+,0โˆ’]
Gini = 1 โˆ’ (1ยฒ + 0ยฒ) = 0.000 โ† Pure!
Rain (5) [3+,2โˆ’]
Gini = 1 โˆ’ ((3/5)ยฒ + (2/5)ยฒ) = 1 โˆ’ (0.36 + 0.16) = 0.480
Weighted Gini
(5/14)ยท0.480 + (4/14)ยท0.000 + (5/14)ยท0.480 = 0.171 + 0 + 0.171 = 0.343
Gini Gain
0.460 โˆ’ 0.343 = 0.117
๐Ÿงฎ Gini Gain โ€” Humidity Split
High (7) [3+,4โˆ’]
Gini = 1 โˆ’ ((3/7)ยฒ + (4/7)ยฒ) = 1 โˆ’ (0.184 + 0.327) = 0.490
Normal (7) [6+,1โˆ’]
Gini = 1 โˆ’ ((6/7)ยฒ + (1/7)ยฒ) = 1 โˆ’ (0.735 + 0.020) = 0.245
Gini Gain
Weighted = (7/14)ยท0.490 + (7/14)ยท0.245 = 0.245 + 0.122 = 0.367
Gain = 0.460 โˆ’ 0.367 = 0.093

Section 07

Gini for Multiple Classes โ€” Maximum Values

Just like entropy, the maximum Gini Impurity increases with the number of classes. The formula for maximum Gini is (nโˆ’1)/n โ€” occurring when all classes are equally likely.

Classes (n) Max Gini = (nโˆ’1)/n Worked Example (equal probs)
2 0.500 p=0.5 each: 1โˆ’(0.5ยฒ+0.5ยฒ) = 1โˆ’0.5 = 0.500
3 0.667 p=1/3 each: 1โˆ’3ยท(1/3)ยฒ = 1โˆ’3ยท(1/9) = 1โˆ’1/3 = 0.667
4 0.750 p=0.25 each: 1โˆ’4ยท(0.25ยฒ) = 1โˆ’4ยท0.0625 = 1โˆ’0.25 = 0.750
5 0.800 p=0.2 each: 1โˆ’5ยท(0.2ยฒ) = 1โˆ’5ยท0.04 = 1โˆ’0.20 = 0.800
10 0.900 p=0.1 each: 1โˆ’10ยท(0.1ยฒ) = 1โˆ’10ยท0.01 = 1โˆ’0.10 = 0.900

Section 08

Entropy vs Gini โ€” Head-to-Head

๐Ÿ“ˆ Entropy vs Gini vs Misclassification Rate (binary, normalised to 0โ€“1)
0 0.5 1.0 0 0.5 1.0 p (positive class probability) Entropy (รท max) Gini (ร—2) Misclass. rate (ร—2)

All three curves are symmetric and peak at p = 0.5. Entropy (purple) rises more steeply away from the extremes โ€” it is more sensitive to small probability changes near p=0 and p=1. Gini (blue) is a smoother quadratic approximation. Misclassification rate (dashed) is linear and less smooth โ€” rarely used because it lacks the mathematical properties the other two have.

๐Ÿ”ต Gini Impurity (CART โ€” sklearn default)
PropertyDetail
Formula1 โˆ’ ฮฃ pแตขยฒ
Binary range0 to 0.5
n-class max(nโˆ’1)/n
ComputationFast โ€” squaring only, no log
BehaviourTends toward larger balanced partitions
SensitivityLess sensitive at extremes (near-pure nodes)
Used byCART, sklearn DecisionTree (default)
๐ŸŸข Entropy / Information Gain (ID3, C4.5)
PropertyDetail
Formulaโˆ’ฮฃ pแตข ยท logโ‚‚(pแตข)
Binary range0 to 1.0
n-class maxlogโ‚‚(n)
ComputationSlower โ€” log per class per node
BehaviourMore sensitive to subtle class changes
SensitivityHigher sensitivity at near-pure nodes
Used byID3, C4.5, sklearn with criterion='entropy'

Section 09

The Bias Problem โ€” and Gain Ratio (C4.5)

โš ๏ธ
ID3's Fatal Flaw: Information Gain Favours High-Cardinality Features

Imagine you add a "Day ID" column (D1, D2, โ€ฆ D14) to the weather dataset. Splitting on Day ID gives 14 child nodes, each with exactly 1 sample โ€” perfectly pure! IG = 0.940 bits โ€” maximum possible. But it's useless โ€” it memorises training data and can't generalise. ID3 would always pick it. This is why Ross Quinlan upgraded ID3 to C4.5, which normalises IG by the feature's own entropy (its "split information").

Split Information
SplitInfo(S, A) = โˆ’ฮฃ (|Sแตฅ|/|S|) ยท logโ‚‚(|Sแตฅ|/|S|)
Measures the entropy of the split itself โ€” how many branches it creates and how evenly data is distributed across them. High-cardinality features have high SplitInfo.
Gain Ratio (C4.5)
GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
Normalises IG by the split's own entropy. A split into 14 pure nodes of 1 sample has SplitInfo = logโ‚‚(14) โ‰ˆ 3.81, so GainRatio = 0.940/3.81 โ‰ˆ 0.247. A useful 3-way split scores proportionally higher.
๐Ÿงฎ Gain Ratio โ€” Outlook vs a Hypothetical "Day ID" Column
Outlook
IG = 0.246 bits
SplitInfo = โˆ’(5/14)ยทlogโ‚‚(5/14) โˆ’ (4/14)ยทlogโ‚‚(4/14) โˆ’ (5/14)ยทlogโ‚‚(5/14)
= 0.531 + 0.464 + 0.531 = 1.577 bits
GainRatio = 0.246 / 1.577 = 0.156
Day ID (14 values)
IG = 0.940 bits (max โ€” 14 pure leaf nodes)
SplitInfo = 14 ร— โˆ’(1/14)ยทlogโ‚‚(1/14) = logโ‚‚(14) = 3.807 bits
GainRatio = 0.940 / 3.807 = 0.247
Still wins numerically here but much less dominant โ€” and in real datasets Day ID would never appear as a feature for prediction.

Section 10

Python Implementation โ€” From Scratch and with sklearn

Manual Calculation โ€” Entropy, Gini and Information Gain

import numpy as np

# โ”€โ”€ Helper functions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def entropy(labels):
    """Shannon entropy of a list of class labels."""
    n = len(labels)
    if n == 0:
        return 0.0
    classes, counts = np.unique(labels, return_counts=True)
    probs = counts / n
    # Convention: 0 * log2(0) = 0
    return -np.sum(probs * np.log2(probs + 1e-12))

def gini(labels):
    """Gini impurity of a list of class labels."""
    n = len(labels)
    if n == 0:
        return 0.0
    _, counts = np.unique(labels, return_counts=True)
    probs = counts / n
    return 1 - np.sum(probs ** 2)

def information_gain(parent_labels, child_groups):
    """IG = H(parent) - weighted sum of H(children)."""
    n_parent = len(parent_labels)
    h_parent = entropy(parent_labels)
    weighted_child = sum(
        (len(group) / n_parent) * entropy(group)
        for group in child_groups
    )
    return h_parent - weighted_child

def gini_gain(parent_labels, child_groups):
    """Gini Gain = Gini(parent) - weighted Gini(children)."""
    n_parent = len(parent_labels)
    g_parent = gini(parent_labels)
    weighted_child = sum(
        (len(group) / n_parent) * gini(group)
        for group in child_groups
    )
    return g_parent - weighted_child

# โ”€โ”€ Play Tennis Dataset โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
y = np.array([0,0,1,1,1,0,1,0,1,1,1,1,1,0])  # 0=No, 1=Yes

# Outlook split: Sunny=[0,1,2,7,10] โ†’ indices โ†’ labels
sunny    = y[[0,1,7,8,10]]   # D1,D2,D8,D9,D11
overcast = y[[2,6,11,12]]    # D3,D7,D12,D13
rain     = y[[3,4,5,9,13]]   # D4,D5,D6,D10,D14

print("=== ENTROPY ===")
print(f"H(parent)  = {entropy(y):.4f}")
print(f"H(Sunny)   = {entropy(sunny):.4f}")
print(f"H(Overcast)= {entropy(overcast):.4f}")
print(f"H(Rain)    = {entropy(rain):.4f}")
print(f"IG(Outlook)= {information_gain(y, [sunny, overcast, rain]):.4f}")

print("\n=== GINI ===")
print(f"Gini(parent)  = {gini(y):.4f}")
print(f"Gini(Sunny)   = {gini(sunny):.4f}")
print(f"Gini(Overcast)= {gini(overcast):.4f}")
print(f"Gini(Rain)    = {gini(rain):.4f}")
print(f"Gini Gain(Outlook) = {gini_gain(y, [sunny, overcast, rain]):.4f}")
Output
=== ENTROPY === H(parent) = 0.9403 H(Sunny) = 0.9710 H(Overcast)= 0.0000 H(Rain) = 0.9710 IG(Outlook)= 0.2467 === GINI === Gini(parent) = 0.4592 Gini(Sunny) = 0.4800 Gini(Overcast)= 0.0000 Gini(Rain) = 0.4800 Gini Gain(Outlook) = 0.1167

Compare All Attributes at Once

import pandas as pd

# Full dataset as a DataFrame
df = pd.DataFrame({
    'Outlook' : ['Sunny','Sunny','Overcast','Rain','Rain','Rain',
                 'Overcast','Sunny','Sunny','Rain','Sunny',
                 'Overcast','Overcast','Rain'],
    'Temp'    : ['Hot','Hot','Hot','Mild','Cool','Cool','Cool',
                 'Mild','Cool','Mild','Mild','Mild','Hot','Mild'],
    'Humidity': ['High','High','High','High','Normal','Normal','Normal',
                 'High','Normal','Normal','Normal','High','Normal','High'],
    'Wind'    : ['Weak','Strong','Weak','Weak','Weak','Strong','Strong',
                 'Weak','Weak','Weak','Strong','Strong','Weak','Strong'],
    'Play'    : [0,0,1,1,1,0,1,0,1,1,1,1,1,0]
})

results = []
for col in ['Outlook', 'Temp', 'Humidity', 'Wind']:
    groups = [df[df[col] == v]['Play'].values for v in df[col].unique()]
    ig = information_gain(df['Play'].values, groups)
    gg = gini_gain(df['Play'].values, groups)
    results.append({'Attribute': col, 'Info Gain': ig, 'Gini Gain': gg})

summary = pd.DataFrame(results).sort_values('Info Gain', ascending=False)
print(summary.to_string(index=False, float_format='{:.4f}'.format))
Output
Attribute Info Gain Gini Gain Outlook 0.2467 0.1167 Humidity 0.1516 0.0926 Wind 0.0481 0.0286 Temp 0.0293 0.0183

sklearn: Switching Between Gini and Entropy

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

X_raw = df[['Outlook','Temp','Humidity','Wind']].values
y     = df['Play'].values

enc = OrdinalEncoder()
X   = enc.fit_transform(X_raw)

for criterion in ['gini', 'entropy']:
    dt = DecisionTreeClassifier(criterion=criterion, random_state=42)
    dt.fit(X, y)
    print(f"\n=== criterion='{criterion}' ===")
    print(export_text(dt, feature_names=['Outlook','Temp','Humidity','Wind']))
    print("Feature importances:")
    for f, imp in zip(['Outlook','Temp','Humidity','Wind'], dt.feature_importances_):
        print(f"  {f:<12} {imp:.4f}")
Output
=== criterion='gini' === |--- Outlook <= 1.50 | |--- Humidity <= 0.50 | | |--- class: 0 (No) | |--- Humidity > 0.50 | | |--- class: 1 (Yes) |--- Outlook > 1.50 | |--- Outlook <= 2.50 | | |--- class: 1 (Overcast โ†’ always Yes) | |--- Outlook > 2.50 | | |--- Wind <= 0.50 | | | |--- class: 0 (Rain + Strong) | | | |--- class: 1 (Rain + Weak) === criterion='entropy' === [Identical tree structure โ€” both criteria agree on this clean dataset] Feature importances (gini): Outlook 0.2457 Temp 0.0000 โ† never used Humidity 0.4081 Wind 0.3462
โœ…
Both criteria build the same tree on this clean dataset

Gini and Entropy agree on Outlook as the root, Humidity on the Sunny branch, and Wind on the Rain branch. Temp gets zero importance from both. On real-world noisy data, they may diverge slightly โ€” but the difference in final accuracy is almost always under 1%. Use Gini for speed (default), Entropy when theoretical correctness matters.


Section 11

Side-by-Side Split Quality Diagram

The diagram below shows the same parent node being split two ways. Split A is good โ€” children are purer. Split B is poor โ€” children are as mixed as the parent.

๐ŸŒณ Good Split vs Poor Split โ€” Entropy and Gini Before and After
GOOD SPLIT Parent: [6+, 6โˆ’] H=1.00 Gini=0.50 (50/50 โ€” maximum impurity) Yes No Child L: [6+, 0โˆ’] H=0.00 Gini=0.00 PURE โ† perfect! Child R: [0+, 6โˆ’] H=0.00 Gini=0.00 PURE โ† perfect! IG = 1.00 โˆ’ 0.00 = 1.00 bits (max!) Gini Gain = 0.50 โˆ’ 0.00 = 0.50 (max!) POOR SPLIT Parent: [6+, 6โˆ’] H=1.00 Gini=0.50 (same parent) Yes No Child L: [3+, 3โˆ’] H=1.00 Gini=0.50 No change at all Child R: [3+, 3โˆ’] H=1.00 Gini=0.50 No change at all IG = 1.00 โˆ’ 1.00 = 0.00 bits (useless!) Gini Gain = 0.50 โˆ’ 0.50 = 0.00 (useless!) Both Entropy and Gini agree: good split has IG/Gain > 0, poor split has IG/Gain = 0 The algorithm always picks the split with the highest gain โ€” whether measured by entropy or Gini. A split that produces zero gain is never chosen โ€” it wastes a node and increases tree depth for no benefit.

Section 12

Common Pitfalls and How to Avoid Them

Comparing IG across different datasets
IG = 0.246 on one dataset vs IG = 0.246 on another does NOT mean the splits are equally good. The parent entropy differs. Always compare IG as a fraction of the parent entropy (i.e. Gain Ratio) for cross-dataset comparisons.
Using IG with high-cardinality features
ID3 / raw IG will always prefer features with many unique values (zip codes, IDs, timestamps) because more branches = lower weighted entropy. Use Gain Ratio (C4.5) or Gini to mitigate. Or simply exclude ID-like columns.
Confusing Entropy with Gini ranges
Entropy can exceed 1.0 for multi-class problems (max = log2(n)). Gini always stays below 1.0 (max = (n-1)/n). Comparing raw values across criteria is meaningless โ€” always compare IG/Gain within the same criterion.
Forgetting the 0ยทlogโ‚‚(0) convention
When a class has zero probability (pure node), the term 0ยทlogโ‚‚(0) is mathematically undefined (0 ร— โˆ’โˆž). By convention it equals 0. This is why H(pure node) = 0, not undefined. Always add a small epsilon (1e-10) when implementing to avoid NaN.
Assuming Gini is always faster
Gini avoids logโ‚‚ but still iterates over all classes. For problems with very few classes (binary), the speed difference is negligible. For 100+ class problems, the logarithm cost becomes more significant โ€” but so does the benefit of entropy's richer gradient.
Ignoring class imbalance effects
Both Entropy and Gini behave strangely when one class dominates (e.g. 99/1 split). A node with 990 negatives and 10 positives has low entropy (โ‰ˆ0.08 bits) โ€” it looks almost pure. But those 10 positives might be critical (e.g. fraud cases). Use class_weight='balanced' or custom sample weights.

Section 13

Golden Rules

๐ŸŽฏ Entropy, Information Gain and Gini โ€” Key Rules
1
Entropy and Gini measure the same thing differently. Both are zero for a pure node and maximum for a balanced node. Entropy uses logarithms (sensitive to small probability changes), Gini uses squares (faster computation). For binary classification, they produce nearly identical trees โ€” the choice rarely matters in practice.
2
Information Gain is always non-negative. A split can never increase entropy in expectation โ€” the weighted average of children can only be โ‰ค parent entropy. If a split gives IG = 0, it provides no useful information and should not be made.
3
Raw Information Gain is biased toward high-cardinality features. Always consider Gain Ratio (C4.5) or Gini when features have widely different numbers of unique values. This is one reason sklearn defaults to Gini โ€” it is less susceptible to this bias than raw entropy IG.
4
Maximum entropy depends on the number of classes: logโ‚‚(n). Maximum Gini depends on the number of classes: (nโˆ’1)/n. Never compare impurity values across problems with different numbers of classes without normalising first.
5
These measures are only used during training โ€” never during prediction. Once the tree is built, prediction is simply following branches based on feature values. Entropy and Gini are training-time search criteria, not runtime components. They add zero cost to inference.
6
A feature with zero Information Gain is always outcompeted. In the Play Tennis example, Temperature scored only 0.029 bits of IG and was never selected at any node. Zero (or near-zero) feature importance in a trained tree is the direct result of that feature never having the highest IG/Gini Gain at any split โ€” a powerful signal for feature selection.