Random Forest Algorithm

Section 01

The Story That Explains Random Forest

📖 Real World Analogy

Who Wants to Be a Millionaire — Audience Lifeline

Imagine you are on a quiz show. The question is hard. You have two lifelines left: Phone a Friend (one very smart person) or Ask the Audience (500 random people vote). Which do you trust more?

The audience wins almost every time — not because any single person in it is smarter than your expert friend, but because the collective errors cancel out. When 500 people independently guess, their random mistakes go in all directions and average to near zero. Your expert friend's one blind spot could cost you everything.

That is the entire idea behind Random Forest.

A Random Forest is an ensemble of many decision trees, each trained on a slightly different random slice of your data and a random subset of features. When predicting, every tree votes, and the majority wins (classification) or the average is taken (regression). No single tree needs to be perfect — together they are.

🌲

The Core Insight

If you train many different but decent models and combine their predictions, the ensemble is almost always more accurate and more stable than any individual model. This is called ensemble learning — and Random Forest is its most battle-tested form.

Section 02

The Foundation — Decision Trees (Quick Recap)

Before building a forest, you need to understand a single tree. A decision tree splits your data by asking questions — one feature at a time — until it reaches a prediction.

🌳 A Single Decision Tree — Predicting Loan Default

Root

Income < £30,000? → If YES go left, if NO go right

Node

Credit Score < 600? → Splits further on both branches

Node

Loan Amount > £50,000? → Further refinement

Leaf

Final prediction: Default or No Default

⚠️

The Fatal Flaw of a Single Decision Tree

A single deep decision tree memorises the training data. It learns the noise, the outliers, the one weird row at position 347 — everything. It scores 99% on training data and 61% on new data. This is overfitting, and it kills model usefulness in production.

Section 03

The Two Ingredients of Random Forest

Random Forest solves overfitting by injecting randomness in two separate ways. Both are essential — neither alone is enough.

🎲

Ingredient 1 — Bagging

Bootstrap Aggregating

Each tree is trained on a random bootstrap sample of the training data — drawn with replacement. If you have 1,000 rows, each tree sees ~632 unique rows (the rest are duplicates). Each tree gets a different dataset. Different datasets → different trees → different errors.

🎯

Ingredient 2 — Feature Randomness

Random Subspace Method

At each split, a tree only considers a random subset of features, not all of them. For classification, the default is √p features. For regression it's p/3. This forces trees to be different from each other even when trained on similar data.

🗣️

The Aggregation

Majority Vote / Average

After training 100–500 trees, every tree makes a prediction for new data. For classification: majority vote wins. For regression: predictions are averaged. Individual mistakes cancel out. Consensus emerges.

🔑

Why Both Ingredients Are Needed

Bagging alone still lets every tree see all features — so dominant features dominate every tree and they stay correlated. Feature randomness alone without bagging means every tree trains on the same data. You need both to create truly diverse, independent trees whose errors cancel rather than amplify.

Section 04

How Bootstrap Sampling Works

📖 Story

The Hat Trick

You have a bag with tickets numbered 1 to 10. For each tree, you pull a ticket, write down the number, then put the ticket back before pulling again. After 10 pulls, you have a set of 10 numbers — but some are repeated, and some originals are missing. This is sampling with replacement. Each bootstrap sample is different. Each tree learns from a different "view" of reality.

Tree	Bootstrap Sample (row indices)	Out-of-Bag Rows (unseen)
Tree 1	1, 3, 3, 5, 7, 7, 9, 2, 4, 4	6, 8, 10
Tree 2	2, 4, 6, 6, 8, 1, 3, 9, 9, 5	7, 10
Tree 3	10, 2, 5, 5, 1, 8, 3, 7, 6, 6	4, 9
Tree 4	7, 9, 1, 4, 4, 6, 2, 10, 3, 8	5

🎁

Free Cross-Validation — Out-of-Bag Score

Each tree has rows it has never seen — the out-of-bag (OOB) rows. You can test each tree on its OOB rows and aggregate these scores to get an OOB score — a free, unbiased estimate of model accuracy without ever touching a separate test set. Set oob_score=True in sklearn to get this automatically.

Section 05

Visual Diagram — Inside a Random Forest

Full Training Dataset

Start with your complete labelled dataset — e.g. 10,000 rows × 20 features. This is the pool from which all bootstrap samples are drawn.

Bootstrap Sampling × N trees

For each of the N trees (say 300), draw a random sample of 10,000 rows with replacement. Each sample is different. Each is used to train exactly one tree.

Random Feature Selection at Each Split

As each tree grows, every node only evaluates √20 ≈ 4–5 randomly chosen features for the best split, not all 20. This forces diversity between trees.

Grow Each Tree to Full Depth

Unlike standalone trees, individual trees in a Random Forest are grown deep (no pruning). Depth adds variance — but the forest-level average reduces it.

Aggregate Predictions (Vote / Average)

For new data: all 300 trees predict. Classification → majority class wins. Regression → mean of 300 predictions. This is the final output.

Section 06

Classification vs Regression

🗣️ Classification (Voting)

Tree	Prediction
Tree 1	Spam
Tree 2	Spam
Tree 3	Not Spam
Tree 4	Spam
Tree 5	Spam
Final	Spam (4/5 votes)

📊 Regression (Averaging)

Tree	Predicted Price
Tree 1	£312,000
Tree 2	£298,500
Tree 3	£325,000
Tree 4	£310,000
Tree 5	£304,500
Final	£310,000 (mean)

Section 07

Python Implementation

Classification — Predicting Customer Churn

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
import numpy as np

# Simulated dataset: 5000 customers, 15 features
X, y = make_classification(
    n_samples=5000, n_features=15,
    n_informative=10, n_redundant=3,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Build the Random Forest
rf = RandomForestClassifier(
    n_estimators=300,        # number of trees
    max_features='sqrt',     # features per split -> sqrt(15) ~ 4
    max_depth=None,          # grow trees fully
    min_samples_leaf=1,      # minimum samples at a leaf
    bootstrap=True,          # enable bagging
    oob_score=True,          # free OOB validation
    n_jobs=-1,               # use all CPU cores
    random_state=42
)

rf.fit(X_train, y_train)

print(f"OOB Score (free accuracy estimate): {rf.oob_score_:.4f}")
print(f"Test Accuracy:                      {rf.score(X_test, y_test):.4f}")

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

OUTPUT

OOB Score (free accuracy estimate): 0.9085 Test Accuracy: 0.9120 precision recall f1-score support 0 0.92 0.91 0.91 508 1 0.91 0.91 0.91 492 accuracy 0.91 1000

Regression — Predicting House Prices

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

rf_reg = RandomForestRegressor(
    n_estimators=300,
    max_features='sqrt',     # or 1/3 of features
    oob_score=True,
    n_jobs=-1,
    random_state=42
)

rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)

print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:      ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")

OUTPUT

R2 Score: 0.8056 MAE: $32,847

⚡

Speed Tip

Always set n_jobs=-1. Random Forest trees are built independently so they are trivially parallelisable. On an 8-core machine this can be 6–8× faster with no change to results.

Section 08

Key Hyperparameters

Parameter	Default	What It Controls	Tuning Advice
`n_estimators`	100	Number of trees in the forest	More = better, up to a plateau. Start at 300.
`max_features`	'sqrt'	Features considered per split	Tune this first — biggest impact on accuracy.
`max_depth`	None	Maximum depth of each tree	None = full depth. Limit if RAM is a concern.
`min_samples_leaf`	1	Min samples required at a leaf	Increase (2–10) to reduce overfitting on noisy data.
`min_samples_split`	2	Min samples to split a node	Rarely needs tuning alongside min_samples_leaf.
`bootstrap`	True	Whether to use bagging	Keep True. False removes the core diversity mechanism.
`oob_score`	False	Compute OOB validation score	Always set True — it's free cross-validation.
`class_weight`	None	Weights for imbalanced classes	Use 'balanced' for imbalanced datasets.

Hyperparameter Tuning with RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators':      randint(100, 600),
    'max_features':      ['sqrt', 'log2', 0.3, 0.5],
    'max_depth':         [None, 10, 20, 30, 50],
    'min_samples_leaf':  randint(1, 10),
    'min_samples_split': randint(2, 15),
}

search = RandomizedSearchCV(
    RandomForestClassifier(n_jobs=-1, oob_score=True, random_state=42),
    param_distributions=param_dist,
    n_iter=50,           # try 50 random combinations
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV F1: ", search.best_score_)

Section 09

Feature Importance

One of Random Forest's greatest practical advantages is its built-in ability to rank which features drive predictions. This is called Mean Decrease in Impurity (MDI) — it measures how much each feature reduces Gini impurity (or MSE for regression) across all trees, on average.

import pandas as pd
import matplotlib.pyplot as plt

feature_names = housing.feature_names  # ['MedInc', 'HouseAge', 'AveRooms', ...]

importances = rf_reg.feature_importances_
std         = np.std([t.feature_importances_ for t in rf_reg.estimators_], axis=0)

feat_df = pd.DataFrame({
    'feature':    feature_names,
    'importance': importances,
    'std':        std
}).sort_values('importance', ascending=False)

print(feat_df.to_string(index=False))

# Plot
feat_df.plot.barh(x='feature', y='importance', xerr='std', legend=False)
plt.xlabel('Mean Decrease in Impurity')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

OUTPUT — California Housing Feature Importance

Feature Importance Std MedInc 0.5244 ±0.047 <- Dominant predictor AveOccup 0.1331 ±0.024 Latitude 0.0914 ±0.013 Longitude 0.0889 ±0.014 HouseAge 0.0530 ±0.009 AveRooms 0.0437 ±0.008 AveBedrms 0.0384 ±0.006 Population 0.0271 ±0.005

⚠️

MDI Bias — Watch Out

MDI feature importance is biased toward high-cardinality features (numerical columns with many unique values) and can mislead when features are correlated. For more reliable importance, use Permutation Importance (sklearn.inspection.permutation_importance) which is model-agnostic and corrects for these biases at the cost of more compute.

Permutation Importance (More Reliable)

from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf_reg, X_test, y_test,
    n_repeats=10,
    n_jobs=-1,
    random_state=42
)

perm_df = pd.DataFrame({
    'feature':    feature_names,
    'importance': result.importances_mean,
    'std':        result.importances_std
}).sort_values('importance', ascending=False)

print(perm_df.to_string(index=False))

Section 10

Bias–Variance Tradeoff: Single Tree vs Forest

📐

The Ensemble Math

If you have N trees each with variance σ² and pairwise correlation ρ, the variance of the forest's average is:

Variance(Forest) = ρ·σ² + (1−ρ)/N · σ²

As N → ∞ the second term vanishes. But the first term ρ·σ² remains — it's the irreducible correlation between trees. This is why reducing ρ (by randomising features) matters as much as increasing N.

Model	Bias	Variance	Overfit Risk	Typical Use
Single shallow tree	High	Low	Low	Simple, interpretable rules
Single deep tree	Low	Very High	Severe	Rarely useful alone
Random Forest	Low–Med	Low	Low	General-purpose workhorse ✅
Gradient Boosting	Very Low	Medium	Moderate	Maximum accuracy, needs tuning

Section 11

When to Use Random Forest

✅

Tabular Data

Random Forest excels on structured table data with numerical and categorical features. It is the first algorithm most practitioners reach for.

survey, logs, transactions, sensors

✅

Mixed Feature Types

Handles numerical, ordinal, and (encoded) categorical features natively without needing scaling or normalisation. No StandardScaler required.

no preprocessing overhead

✅

Feature Selection

Use feature importance to drop irrelevant columns before training a more expensive model. Acts as a built-in feature selector.

dimensionality reduction

❌

Very High-Dimensional Sparse Data

NLP bag-of-words matrices with 50,000+ features → Random Forest struggles. Linear models or gradient boosting with sparse support work better.

text, one-hot with many categories

❌

Extreme Real-Time Latency

Predicting with 500 trees takes more compute than a single model. For sub-millisecond inference, consider a distilled single tree or linear model.

edge devices, streaming at scale

❌

Pure Interpretability Required

A forest of 300 trees cannot be simply explained to a regulator. If you must justify every prediction in plain terms, a shallow single tree or logistic regression is better.

credit scoring regulations, medical

Section 12

Handling Class Imbalance

Random Forest can struggle with heavily imbalanced datasets (e.g. fraud detection: 99% normal, 1% fraud). The forest majority-votes, so rare classes get drowned out. Here are your options:

⚖️ Imbalance Fixes in Random Forest

Option 1

class_weight='balanced' — sklearn automatically weights classes inversely to their frequency. Easy, no data change needed.

Option 2

class_weight='balanced_subsample' — applies balanced weights independently to each bootstrap sample. Often better than global balancing.

Option 3

Use BalancedRandomForest from imbalanced-learn — undersamples the majority class in each bootstrap sample automatically.

Option 4

Change your prediction threshold. Instead of predicting the majority-vote class, use predict_proba and lower the threshold from 0.5 to 0.3 to favour recall on the minority class.

from imblearn.ensemble import BalancedRandomForestClassifier

brf = BalancedRandomForestClassifier(
    n_estimators=300,
    sampling_strategy='auto',  # undersample majority to match minority
    replacement=True,
    n_jobs=-1,
    random_state=42
)
brf.fit(X_train, y_train)

# Or with class_weight in standard sklearn:
rf_balanced = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced_subsample',
    n_jobs=-1,
    random_state=42
)
rf_balanced.fit(X_train, y_train)

Section 13

A Complete Real-World Example — Titanic Survival

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder

# Load data
df = pd.read_csv('titanic.csv')

# Simple preprocessing
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna('S', inplace=True)
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = df[features]
y = df['Survived']

# Note: Random Forest needs NO feature scaling
rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    max_depth=None,
    min_samples_leaf=2,
    class_weight='balanced',
    oob_score=True,
    n_jobs=-1,
    random_state=42
)

# 5-fold cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")

# Fit and check OOB
rf.fit(X, y)
print(f"OOB Score:   {rf.oob_score_:.4f}")

# Feature importance
for feat, imp in sorted(zip(features, rf.feature_importances_),
                        key=lambda x: -x[1]):
    print(f"  {feat:12s}: {imp:.4f}")

OUTPUT

CV Accuracy: 0.8249 +/- 0.0183 OOB Score: 0.8305 Sex : 0.2851 <- Most important Fare : 0.2234 Age : 0.2108 Pclass : 0.1271 SibSp : 0.0632 Parch : 0.0541 Embarked : 0.0363

🎯

No Scaling Required — Ever

Notice we did not scale Age, Fare, or any other feature. Decision trees split on thresholds, not distances — the magnitude of a feature is irrelevant. This makes Random Forest significantly easier to use than SVMs, KNN, or neural networks, which all require careful normalisation.

Section 14

Random Forest vs Gradient Boosting — Quick Comparison

Property	Random Forest	Gradient Boosting (XGBoost/LightGBM)
How trees are built	Parallel — independent	Sequential — each corrects prior errors
Overfitting risk	Low — bagging protects well	Higher — needs careful learning rate tuning
Training speed	Fast (parallelisable)	Slower (sequential by nature)
Hyperparameter sensitivity	Low — good defaults work well	High — needs tuning to shine
Peak accuracy	Very good	Typically highest on tabular data
Missing values	Needs imputation	XGBoost handles natively
Best choice when…	Fast baseline, robust defaults needed	Maximising accuracy, competition setting

🏆

The Practitioner's Rule

Start with Random Forest as your baseline — it's fast, robust, and almost always competitive with zero tuning. Once you need that last 2–3% accuracy, move to LightGBM or XGBoost and tune carefully. Many production systems run Random Forest permanently because the marginal gain from boosting doesn't justify the operational complexity.

Section 15

Golden Rules

🌲 Random Forest — Non-Negotiable Rules

Always set n_jobs=-1. Trees are independent — parallelisation is free performance. Leaving this at the default (1 core) on a 16-core machine means you're waiting 16× longer than necessary.

Always enable oob_score=True. You get a statistically valid estimate of generalisation accuracy at zero extra cost. Use it before even touching your test set.

Do not scale features. Random Forest uses decision tree splits — the relative order of values matters, not their magnitude. Adding StandardScaler is wasted compute and changes nothing.

Tune max_features first — it has the biggest impact on accuracy and tree diversity. Then min_samples_leaf (2–5 reduces overfitting on noisy data). Tune n_estimators last — more trees always help but with diminishing returns after ~300–500.

For imbalanced datasets always set class_weight='balanced_subsample' as a starting point. Evaluate using F1 or AUC-ROC, not accuracy — accuracy is misleading when classes are skewed.

Adding more trees never increases overfitting — it can only decrease it or plateau. The only cost of more trees is compute and memory. There is no "too many trees" from an accuracy standpoint.

If MDI feature importance rankings feel unreliable (correlated or high-cardinality features dominating), switch to Permutation Importance (sklearn.inspection.permutation_importance). Slower, but unbiased.