Entropy, Information Gain and Gini Impurity

Section 01

The Story: Why "Impurity" Matters

📖 Real-World Story

The Sorting Machine That Learned to Ask Questions

Imagine you work at a fruit-packing factory. Your job: sort a pile of mixed fruit into boxes — one box per type. You have two strategies.

Strategy A: Grab fruit randomly, put it anywhere.
Strategy B: First ask "Is it round?" Then "Is it yellow?" Then "Is it bumpy?"

After Strategy A your boxes are still a mess — apples mixed with oranges, bananas mixed with limes. After Strategy B each box has mostly one type. The boxes are purer.

A Decision Tree is Strategy B. But it needs a number to measure how messy a box is right now, and how much cleaner each possible question would make it. That number is Entropy (or Gini Impurity). The improvement from asking a question is Information Gain.

These three concepts are the entire mathematical engine inside a decision tree. Everything else is just bookkeeping.

Section 02

What Is Impurity? — The Core Concept

Before we touch any formula, let's build the intuition. A node in a decision tree contains a group of training samples. Impurity measures how mixed the class labels are in that group.

🔵 Impurity Spectrum — From Pure to Chaotic

💡

The Goal of Every Split

A Decision Tree split is good if the child nodes are purer than the parent. The algorithm's job is to find the split that reduces impurity the most. Two different formulas measure impurity: Entropy (information theory) and Gini Impurity (probability theory). Both measure the same thing — just with different maths.

Section 03

Entropy — Disorder Measured in Bits

💡 Intuition First

Claude Shannon's Letter Guessing Game

In 1948, Claude Shannon invented Information Theory to measure how much "surprise" is in a message. He asked: if I pick a letter from a text, how many yes/no questions do I need to guess it?

If the text is all A's — you need 0 questions. You already know the answer. Zero surprise. Zero entropy.

If every letter is equally likely — you need about 4.7 questions (log₂ 26 ≈ 4.7). Maximum surprise. Maximum entropy.

A node in a decision tree is like a bag of letters. If all samples belong to one class — zero surprise, entropy = 0. If classes are perfectly balanced — maximum surprise, entropy = 1.0 (for binary). The tree wants to ask questions that reduce surprise as fast as possible.

Entropy Formula

H(S) = −Σ pᵢ · log₂(pᵢ)

pᵢ = proportion of samples belonging to class i.
Sum over all classes. Convention: 0 · log₂(0) = 0.
Unit: bits. Range: 0 (pure) to log₂(n) for n classes.

Binary Entropy (2 classes)

H = −p·log₂p − (1−p)·log₂(1−p)

Most common case: Yes/No, Spam/Ham, Default/No Default.
Maximum = 1.0 bit when p = 0.5 (50/50 split).
Minimum = 0 bits when p = 0 or p = 1 (pure node).

The Entropy Curve — Visualised

📈 Binary Entropy H(p) vs Class Probability p

The entropy curve is a symmetric hill. It peaks at 1.0 bit when classes are 50/50 (maximum uncertainty) and falls to 0 at both extremes (complete certainty). The tree wants to move nodes leftward or rightward — away from the peak.

Step-by-Step Entropy Calculation — Weather Dataset

We have 14 days of weather data. The target is Play Tennis? The full set S contains 9 Yes and 5 No.

🧮 H(S) — Parent Entropy for [9+, 5−]

Step 1

Calculate class proportions:
p(Yes) = 9/14 = 0.643 | p(No) = 5/14 = 0.357

Step 2

Compute each term −pᵢ · log₂(pᵢ):
−(0.643 · log₂ 0.643) = −(0.643 · −0.637) = +0.410
−(0.357 · log₂ 0.357) = −(0.357 · −1.486) = +0.530

Step 3

Sum: H(S) = 0.410 + 0.530 = 0.940 bits
This is the baseline. Every split's quality is measured against this.

Section 04

Entropy for Multiple Classes

Entropy works for any number of classes — not just binary. The formula is identical; you just sum more terms. The maximum possible entropy scales with the number of classes.

Number of Classes (n)	Maximum Entropy = log₂(n)	Example Problem	Max Entropy Value
2 (Binary)	log₂(2)	Play Tennis: Yes / No	1.000 bits
3	log₂(3)	Weather: Sunny / Overcast / Rain	1.585 bits
4	log₂(4)	Seasons: Spring / Summer / Autumn / Winter	2.000 bits
5	log₂(5)	Risk: Very Low / Low / Med / High / Very High	2.322 bits
10	log₂(10)	Digit recognition: 0 – 9	3.322 bits

🧮 3-Class Entropy Example — Fruit Sorting

Dataset

A box contains 6 Apples, 3 Oranges, 3 Bananas. Total = 12 fruit.
p(Apple) = 6/12 = 0.50 | p(Orange) = 3/12 = 0.25 | p(Banana) = 3/12 = 0.25

Compute

H = −(0.50 · log₂0.50) − (0.25 · log₂0.25) − (0.25 · log₂0.25)
= −(0.50 · −1.0) − (0.25 · −2.0) − (0.25 · −2.0)
= 0.500 + 0.500 + 0.500 = 1.500 bits

Interpret

Maximum for 3 classes = log₂(3) = 1.585 bits.
Our box is close to maximum impurity (1.5/1.585 = 94.6% of maximum). A good split should bring this down significantly.

📐

Three Essential Properties of Entropy

1. Always ≥ 0 — entropy is zero only when the node is perfectly pure (one class only).
2. Maximum when classes are equally likely — this is when you have the least information.
3. Additive for independent events — this is what makes it theoretically elegant and why Shannon chose it over alternatives.

Section 05

Information Gain — Measuring Split Quality

Information Gain answers the question: "How much does asking this question reduce my uncertainty about the answer?" It is simply the difference in entropy before and after the split, weighted by the size of each child group.

Information Gain

IG(S, A) = H(S) − Σ (|Sᵥ|/|S|) · H(Sᵥ)

S = parent node samples. A = attribute being tested.
Sᵥ = subset of samples where attribute A has value v.
The weighted entropy of all children is subtracted from parent entropy.

What Good IG Looks Like

IG = 0 → useless split

IG = 0 means the split didn't reduce entropy at all — the attribute carries no information about the target. Maximum IG = H(parent) — both children are perfectly pure. The algorithm picks the attribute with highest IG.

Worked Example: Which Attribute Splits the Weather Data Best?

Day	Outlook	Temp	Humidity	Wind	Play Tennis?
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

Parent entropy H(S) = 0.940 bits. Now we calculate IG for each attribute.

🧮 IG(S, Outlook)

Sunny (5)

D1,D2,D8,D9,D11 → [2+, 3−] p+=2/5=0.4, p−=3/5=0.6
H(Sunny) = −0.4·log₂0.4 − 0.6·log₂0.6 = 0.529 + 0.442 = 0.971 bits

Overcast (4)

D3,D7,D12,D13 → [4+, 0−] perfectly pure!
H(Overcast) = 0 − 0 = 0.000 bits

Rain (5)

D4,D5,D6,D10,D14 → [3+, 2−] p+=3/5=0.6, p−=2/5=0.4
H(Rain) = −0.6·log₂0.6 − 0.4·log₂0.4 = 0.442 + 0.529 = 0.971 bits

Weighted

(5/14)·0.971 + (4/14)·0.000 + (5/14)·0.971
= 0.347 + 0.000 + 0.347 = 0.694 bits

IG(S, Outlook) = 0.940 − 0.694 = 0.246 bits ✓ WINNER

🧮 IG(S, Humidity)

High (7)

D1,D2,D3,D4,D8,D12,D14 → [3+, 4−]
H(High) = −(3/7)·log₂(3/7) − (4/7)·log₂(4/7) = 0.524 + 0.461 = 0.985 bits

Normal (7)

D5,D6,D7,D9,D10,D11,D13 → [6+, 1−]
H(Normal) = −(6/7)·log₂(6/7) − (1/7)·log₂(1/7) = 0.191 + 0.401 = 0.592 bits

Weighted = (7/14)·0.985 + (7/14)·0.592 = 0.493 + 0.296 = 0.789
IG(S, Humidity) = 0.940 − 0.789 = 0.151 bits

🧮 IG(S, Wind) and IG(S, Temp)

Wind

Strong [3+,3−] H=1.0 Weak [6+,2−] H=0.811
Weighted = (6/14)·1.0 + (8/14)·0.811 = 0.893
IG(S, Wind) = 0.940 − 0.893 = 0.048 bits

Temp

Hot [2+,2−] H=1.0 Mild [4+,2−] H=0.918 Cool [3+,1−] H=0.811
Weighted = (4/14)·1.0 + (6/14)·0.918 + (4/14)·0.811 = 0.911
IG(S, Temp) = 0.940 − 0.911 = 0.029 bits

📊 Information Gain Comparison — All Four Attributes

Outlook is selected as the root node with IG = 0.246 bits. Temp earns only 0.029 bits — it is outcompeted at every node and ends up never being used in the final tree.

Section 06

Gini Impurity — The CART Algorithm's Measure

💡 Different Formula, Same Goal

The Probability of Mislabelling a Random Sample

Gini Impurity asks a different question: "If I randomly pick a sample from this node and randomly assign it a class label based on the class distribution in this node, what is the probability I label it wrong?"

If the node is pure (all one class), I'll never mislabel it — Gini = 0.
If the node is 50/50, I'll mislabel half the time — Gini = 0.5.

No logarithms. No bits. Just a probability between 0 and 0.5 (for binary problems). This is why scikit-learn defaults to Gini — it's computationally faster while capturing the same information.

Gini Impurity

Gini(S) = 1 − Σ pᵢ²

pᵢ = proportion of samples in class i.
Sum the squared proportions, subtract from 1.
Range: 0 (pure) to (n−1)/n for n classes. Binary max = 0.5.

Weighted Gini After Split

Gini_split = Σ (|Sᵥ|/|S|) · Gini(Sᵥ)

Weighted average of child node Gini values.
Gini Gain = Gini(parent) − Gini_split.
Pick the split with the highest Gini Gain.

Gini Values Across Class Distributions

Distribution [Yes, No]	p(Yes)	Gini = 1 − (p² + q²)	Interpretation
[10, 0] — all Yes	1.00	1 − (1.00² + 0.00²) = 0.000	Pure — no impurity
[8, 2] — mostly Yes	0.80	1 − (0.80² + 0.20²) = 1 − 0.68 = 0.320	Low impurity
[6, 4] — slightly skewed	0.60	1 − (0.60² + 0.40²) = 1 − 0.52 = 0.480	Moderate impurity
[5, 5] — perfectly mixed	0.50	1 − (0.50² + 0.50²) = 1 − 0.50 = 0.500	Maximum impurity (binary)

Gini Calculation — Same Weather Dataset

🧮 Gini(S) — Parent Node [9+, 5−]

Proportions

p(Yes) = 9/14 = 0.643 | p(No) = 5/14 = 0.357

Formula

Gini(S) = 1 − (0.643² + 0.357²) = 1 − (0.413 + 0.127) = 1 − 0.540 = 0.460

🧮 Gini Gain — Outlook Split

Sunny (5) [2+,3−]

Gini = 1 − ((2/5)² + (3/5)²) = 1 − (0.16 + 0.36) = 0.480

Overcast (4) [4+,0−]

Gini = 1 − (1² + 0²) = 0.000 ← Pure!

Rain (5) [3+,2−]

Gini = 1 − ((3/5)² + (2/5)²) = 1 − (0.36 + 0.16) = 0.480

Weighted Gini

(5/14)·0.480 + (4/14)·0.000 + (5/14)·0.480 = 0.171 + 0 + 0.171 = 0.343

Gini Gain

0.460 − 0.343 = 0.117

🧮 Gini Gain — Humidity Split

High (7) [3+,4−]

Gini = 1 − ((3/7)² + (4/7)²) = 1 − (0.184 + 0.327) = 0.490

Normal (7) [6+,1−]

Gini = 1 − ((6/7)² + (1/7)²) = 1 − (0.735 + 0.020) = 0.245

Gini Gain

Weighted = (7/14)·0.490 + (7/14)·0.245 = 0.245 + 0.122 = 0.367
Gain = 0.460 − 0.367 = 0.093

Section 07

Gini for Multiple Classes — Maximum Values

Just like entropy, the maximum Gini Impurity increases with the number of classes. The formula for maximum Gini is (n−1)/n — occurring when all classes are equally likely.

Classes (n)	Max Gini = (n−1)/n	Worked Example (equal probs)
2	0.500	p=0.5 each: 1−(0.5²+0.5²) = 1−0.5 = 0.500
3	0.667	p=1/3 each: 1−3·(1/3)² = 1−3·(1/9) = 1−1/3 = 0.667
4	0.750	p=0.25 each: 1−4·(0.25²) = 1−4·0.0625 = 1−0.25 = 0.750
5	0.800	p=0.2 each: 1−5·(0.2²) = 1−5·0.04 = 1−0.20 = 0.800
10	0.900	p=0.1 each: 1−10·(0.1²) = 1−10·0.01 = 1−0.10 = 0.900

Section 08

Entropy vs Gini — Head-to-Head

📈 Entropy vs Gini vs Misclassification Rate (binary, normalised to 0–1)

All three curves are symmetric and peak at p = 0.5. Entropy (purple) rises more steeply away from the extremes — it is more sensitive to small probability changes near p=0 and p=1. Gini (blue) is a smoother quadratic approximation. Misclassification rate (dashed) is linear and less smooth — rarely used because it lacks the mathematical properties the other two have.

🔵 Gini Impurity (CART — sklearn default)

Property	Detail
Formula	1 − Σ pᵢ²
Binary range	0 to 0.5
n-class max	(n−1)/n
Computation	Fast — squaring only, no log
Behaviour	Tends toward larger balanced partitions
Sensitivity	Less sensitive at extremes (near-pure nodes)
Used by	CART, sklearn DecisionTree (default)

🟢 Entropy / Information Gain (ID3, C4.5)

Property	Detail
Formula	−Σ pᵢ · log₂(pᵢ)
Binary range	0 to 1.0
n-class max	log₂(n)
Computation	Slower — log per class per node
Behaviour	More sensitive to subtle class changes
Sensitivity	Higher sensitivity at near-pure nodes
Used by	ID3, C4.5, sklearn with criterion='entropy'

Section 09

The Bias Problem — and Gain Ratio (C4.5)

⚠️

ID3's Fatal Flaw: Information Gain Favours High-Cardinality Features

Imagine you add a "Day ID" column (D1, D2, … D14) to the weather dataset. Splitting on Day ID gives 14 child nodes, each with exactly 1 sample — perfectly pure! IG = 0.940 bits — maximum possible. But it's useless — it memorises training data and can't generalise. ID3 would always pick it. This is why Ross Quinlan upgraded ID3 to C4.5, which normalises IG by the feature's own entropy (its "split information").

Split Information

SplitInfo(S, A) = −Σ (|Sᵥ|/|S|) · log₂(|Sᵥ|/|S|)

Measures the entropy of the split itself — how many branches it creates and how evenly data is distributed across them. High-cardinality features have high SplitInfo.

Gain Ratio (C4.5)

GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)

Normalises IG by the split's own entropy. A split into 14 pure nodes of 1 sample has SplitInfo = log₂(14) ≈ 3.81, so GainRatio = 0.940/3.81 ≈ 0.247. A useful 3-way split scores proportionally higher.

🧮 Gain Ratio — Outlook vs a Hypothetical "Day ID" Column

Outlook

IG = 0.246 bits
SplitInfo = −(5/14)·log₂(5/14) − (4/14)·log₂(4/14) − (5/14)·log₂(5/14)
= 0.531 + 0.464 + 0.531 = 1.577 bits
GainRatio = 0.246 / 1.577 = 0.156

Day ID (14 values)

IG = 0.940 bits (max — 14 pure leaf nodes)
SplitInfo = 14 × −(1/14)·log₂(1/14) = log₂(14) = 3.807 bits
GainRatio = 0.940 / 3.807 = 0.247
Still wins numerically here but much less dominant — and in real datasets Day ID would never appear as a feature for prediction.

Section 10

Python Implementation — From Scratch and with sklearn

Manual Calculation — Entropy, Gini and Information Gain

import numpy as np

# ── Helper functions ──────────────────────────────────────────
def entropy(labels):
    """Shannon entropy of a list of class labels."""
    n = len(labels)
    if n == 0:
        return 0.0
    classes, counts = np.unique(labels, return_counts=True)
    probs = counts / n
    # Convention: 0 * log2(0) = 0
    return -np.sum(probs * np.log2(probs + 1e-12))

def gini(labels):
    """Gini impurity of a list of class labels."""
    n = len(labels)
    if n == 0:
        return 0.0
    _, counts = np.unique(labels, return_counts=True)
    probs = counts / n
    return 1 - np.sum(probs ** 2)

def information_gain(parent_labels, child_groups):
    """IG = H(parent) - weighted sum of H(children)."""
    n_parent = len(parent_labels)
    h_parent = entropy(parent_labels)
    weighted_child = sum(
        (len(group) / n_parent) * entropy(group)
        for group in child_groups
    )
    return h_parent - weighted_child

def gini_gain(parent_labels, child_groups):
    """Gini Gain = Gini(parent) - weighted Gini(children)."""
    n_parent = len(parent_labels)
    g_parent = gini(parent_labels)
    weighted_child = sum(
        (len(group) / n_parent) * gini(group)
        for group in child_groups
    )
    return g_parent - weighted_child

# ── Play Tennis Dataset ───────────────────────────────────────
y = np.array([0,0,1,1,1,0,1,0,1,1,1,1,1,0])  # 0=No, 1=Yes

# Outlook split: Sunny=[0,1,2,7,10] → indices → labels
sunny    = y[[0,1,7,8,10]]   # D1,D2,D8,D9,D11
overcast = y[[2,6,11,12]]    # D3,D7,D12,D13
rain     = y[[3,4,5,9,13]]   # D4,D5,D6,D10,D14

print("=== ENTROPY ===")
print(f"H(parent)  = {entropy(y):.4f}")
print(f"H(Sunny)   = {entropy(sunny):.4f}")
print(f"H(Overcast)= {entropy(overcast):.4f}")
print(f"H(Rain)    = {entropy(rain):.4f}")
print(f"IG(Outlook)= {information_gain(y, [sunny, overcast, rain]):.4f}")

print("\n=== GINI ===")
print(f"Gini(parent)  = {gini(y):.4f}")
print(f"Gini(Sunny)   = {gini(sunny):.4f}")
print(f"Gini(Overcast)= {gini(overcast):.4f}")
print(f"Gini(Rain)    = {gini(rain):.4f}")
print(f"Gini Gain(Outlook) = {gini_gain(y, [sunny, overcast, rain]):.4f}")

Output

=== ENTROPY === H(parent) = 0.9403 H(Sunny) = 0.9710 H(Overcast)= 0.0000 H(Rain) = 0.9710 IG(Outlook)= 0.2467 === GINI === Gini(parent) = 0.4592 Gini(Sunny) = 0.4800 Gini(Overcast)= 0.0000 Gini(Rain) = 0.4800 Gini Gain(Outlook) = 0.1167

Compare All Attributes at Once

import pandas as pd

# Full dataset as a DataFrame
df = pd.DataFrame({
    'Outlook' : ['Sunny','Sunny','Overcast','Rain','Rain','Rain',
                 'Overcast','Sunny','Sunny','Rain','Sunny',
                 'Overcast','Overcast','Rain'],
    'Temp'    : ['Hot','Hot','Hot','Mild','Cool','Cool','Cool',
                 'Mild','Cool','Mild','Mild','Mild','Hot','Mild'],
    'Humidity': ['High','High','High','High','Normal','Normal','Normal',
                 'High','Normal','Normal','Normal','High','Normal','High'],
    'Wind'    : ['Weak','Strong','Weak','Weak','Weak','Strong','Strong',
                 'Weak','Weak','Weak','Strong','Strong','Weak','Strong'],
    'Play'    : [0,0,1,1,1,0,1,0,1,1,1,1,1,0]
})

results = []
for col in ['Outlook', 'Temp', 'Humidity', 'Wind']:
    groups = [df[df[col] == v]['Play'].values for v in df[col].unique()]
    ig = information_gain(df['Play'].values, groups)
    gg = gini_gain(df['Play'].values, groups)
    results.append({'Attribute': col, 'Info Gain': ig, 'Gini Gain': gg})

summary = pd.DataFrame(results).sort_values('Info Gain', ascending=False)
print(summary.to_string(index=False, float_format='{:.4f}'.format))

Output

Attribute Info Gain Gini Gain Outlook 0.2467 0.1167 Humidity 0.1516 0.0926 Wind 0.0481 0.0286 Temp 0.0293 0.0183

sklearn: Switching Between Gini and Entropy

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

X_raw = df[['Outlook','Temp','Humidity','Wind']].values
y     = df['Play'].values

enc = OrdinalEncoder()
X   = enc.fit_transform(X_raw)

for criterion in ['gini', 'entropy']:
    dt = DecisionTreeClassifier(criterion=criterion, random_state=42)
    dt.fit(X, y)
    print(f"\n=== criterion='{criterion}' ===")
    print(export_text(dt, feature_names=['Outlook','Temp','Humidity','Wind']))
    print("Feature importances:")
    for f, imp in zip(['Outlook','Temp','Humidity','Wind'], dt.feature_importances_):
        print(f"  {f:<12} {imp:.4f}")

Output

✅

Both criteria build the same tree on this clean dataset

Gini and Entropy agree on Outlook as the root, Humidity on the Sunny branch, and Wind on the Rain branch. Temp gets zero importance from both. On real-world noisy data, they may diverge slightly — but the difference in final accuracy is almost always under 1%. Use Gini for speed (default), Entropy when theoretical correctness matters.

Section 11

Side-by-Side Split Quality Diagram

The diagram below shows the same parent node being split two ways. Split A is good — children are purer. Split B is poor — children are as mixed as the parent.

🌳 Good Split vs Poor Split — Entropy and Gini Before and After

Section 12

Common Pitfalls and How to Avoid Them

Comparing IG across different datasets

IG = 0.246 on one dataset vs IG = 0.246 on another does NOT mean the splits are equally good. The parent entropy differs. Always compare IG as a fraction of the parent entropy (i.e. Gain Ratio) for cross-dataset comparisons.

Using IG with high-cardinality features

ID3 / raw IG will always prefer features with many unique values (zip codes, IDs, timestamps) because more branches = lower weighted entropy. Use Gain Ratio (C4.5) or Gini to mitigate. Or simply exclude ID-like columns.

Confusing Entropy with Gini ranges

Entropy can exceed 1.0 for multi-class problems (max = log2(n)). Gini always stays below 1.0 (max = (n-1)/n). Comparing raw values across criteria is meaningless — always compare IG/Gain within the same criterion.

Forgetting the 0·log₂(0) convention

When a class has zero probability (pure node), the term 0·log₂(0) is mathematically undefined (0 × −∞). By convention it equals 0. This is why H(pure node) = 0, not undefined. Always add a small epsilon (1e-10) when implementing to avoid NaN.

Assuming Gini is always faster

Gini avoids log₂ but still iterates over all classes. For problems with very few classes (binary), the speed difference is negligible. For 100+ class problems, the logarithm cost becomes more significant — but so does the benefit of entropy's richer gradient.

Ignoring class imbalance effects

Both Entropy and Gini behave strangely when one class dominates (e.g. 99/1 split). A node with 990 negatives and 10 positives has low entropy (≈0.08 bits) — it looks almost pure. But those 10 positives might be critical (e.g. fraud cases). Use class_weight='balanced' or custom sample weights.

Section 13

Golden Rules

🎯 Entropy, Information Gain and Gini — Key Rules

Entropy and Gini measure the same thing differently. Both are zero for a pure node and maximum for a balanced node. Entropy uses logarithms (sensitive to small probability changes), Gini uses squares (faster computation). For binary classification, they produce nearly identical trees — the choice rarely matters in practice.

Information Gain is always non-negative. A split can never increase entropy in expectation — the weighted average of children can only be ≤ parent entropy. If a split gives IG = 0, it provides no useful information and should not be made.

Raw Information Gain is biased toward high-cardinality features. Always consider Gain Ratio (C4.5) or Gini when features have widely different numbers of unique values. This is one reason sklearn defaults to Gini — it is less susceptible to this bias than raw entropy IG.

Maximum entropy depends on the number of classes: log₂(n). Maximum Gini depends on the number of classes: (n−1)/n. Never compare impurity values across problems with different numbers of classes without normalising first.

These measures are only used during training — never during prediction. Once the tree is built, prediction is simply following branches based on feature values. Entropy and Gini are training-time search criteria, not runtime components. They add zero cost to inference.

A feature with zero Information Gain is always outcompeted. In the Play Tennis example, Temperature scored only 0.029 bits of IG and was never selected at any node. Zero (or near-zero) feature importance in a trained tree is the direct result of that feature never having the highest IG/Gini Gain at any split — a powerful signal for feature selection.