Machine Learning πŸ“‚ Supervised Learning Β· 9 of 17 30 min read

Random Forest

A comprehensive, story-driven tutorial on Random Forests β€” covering the core intuition, bias-variance theory, feature importance, OOB validation, fraud detection case study, and production-ready Python code.

Section 01

The Story That Explains Random Forest

Who Wants to Be a Millionaire β€” Audience Lifeline
Imagine you are on a quiz show. The question is hard. You have two lifelines left: Phone a Friend (one very smart person) or Ask the Audience (500 random people vote). Which do you trust more?

The audience wins almost every time β€” not because any single person in it is smarter than your expert friend, but because the collective errors cancel out. When 500 people independently guess, their random mistakes go in all directions and average to near zero. Your expert friend's one blind spot could cost you everything.

That is the entire idea behind Random Forest.

A Random Forest is an ensemble of many decision trees, each trained on a slightly different random slice of your data and a random subset of features. When predicting, every tree votes, and the majority wins (classification) or the average is taken (regression). No single tree needs to be perfect β€” together they are.

🌲
The Core Insight

If you train many different but decent models and combine their predictions, the ensemble is almost always more accurate and more stable than any individual model. This is called ensemble learning β€” and Random Forest is its most battle-tested form.


Section 02

The Foundation β€” Decision Trees (Quick Recap)

Before building a forest, you need to understand a single tree. A decision tree splits your data by asking questions β€” one feature at a time β€” until it reaches a prediction.

🌳 A Single Decision Tree β€” Predicting Loan Default
Root
Income < Β£30,000? β†’ If YES go left, if NO go right
Node
Credit Score < 600? β†’ Splits further on both branches
Node
Loan Amount > Β£50,000? β†’ Further refinement
Leaf
Final prediction: Default or No Default
⚠️
The Fatal Flaw of a Single Decision Tree

A single deep decision tree memorises the training data. It learns the noise, the outliers, the one weird row at position 347 β€” everything. It scores 99% on training data and 61% on new data. This is overfitting, and it kills model usefulness in production.


Section 03

The Two Ingredients of Random Forest

Random Forest solves overfitting by injecting randomness in two separate ways. Both are essential β€” neither alone is enough.

🎲
Ingredient 1 β€” Bagging
Bootstrap Aggregating
Each tree is trained on a random bootstrap sample of the training data β€” drawn with replacement. If you have 1,000 rows, each tree sees ~632 unique rows (the rest are duplicates). Each tree gets a different dataset. Different datasets β†’ different trees β†’ different errors.
🎯
Ingredient 2 β€” Feature Randomness
Random Subspace Method
At each split, a tree only considers a random subset of features, not all of them. For classification, the default is √p features. For regression it's p/3. This forces trees to be different from each other even when trained on similar data.
🗣️
The Aggregation
Majority Vote / Average
After training 100–500 trees, every tree makes a prediction for new data. For classification: majority vote wins. For regression: predictions are averaged. Individual mistakes cancel out. Consensus emerges.
🔑
Why Both Ingredients Are Needed

Bagging alone still lets every tree see all features β€” so dominant features dominate every tree and they stay correlated. Feature randomness alone without bagging means every tree trains on the same data. You need both to create truly diverse, independent trees whose errors cancel rather than amplify.


Section 04

How Bootstrap Sampling Works

The Hat Trick
You have a bag with tickets numbered 1 to 10. For each tree, you pull a ticket, write down the number, then put the ticket back before pulling again. After 10 pulls, you have a set of 10 numbers β€” but some are repeated, and some originals are missing. This is sampling with replacement. Each bootstrap sample is different. Each tree learns from a different "view" of reality.
Tree Bootstrap Sample (row indices) Out-of-Bag Rows (unseen)
Tree 11, 3, 3, 5, 7, 7, 9, 2, 4, 46, 8, 10
Tree 22, 4, 6, 6, 8, 1, 3, 9, 9, 57, 10
Tree 310, 2, 5, 5, 1, 8, 3, 7, 6, 64, 9
Tree 47, 9, 1, 4, 4, 6, 2, 10, 3, 85
🎁
Free Cross-Validation β€” Out-of-Bag Score

Each tree has rows it has never seen β€” the out-of-bag (OOB) rows. You can test each tree on its OOB rows and aggregate these scores to get an OOB score β€” a free, unbiased estimate of model accuracy without ever touching a separate test set. Set oob_score=True in sklearn to get this automatically.


Section 05

Visual Diagram β€” Inside a Random Forest

01
Full Training Dataset
Start with your complete labelled dataset β€” e.g. 10,000 rows Γ— 20 features. This is the pool from which all bootstrap samples are drawn.
02
Bootstrap Sampling Γ— N trees
For each of the N trees (say 300), draw a random sample of 10,000 rows with replacement. Each sample is different. Each is used to train exactly one tree.
03
Random Feature Selection at Each Split
As each tree grows, every node only evaluates √20 β‰ˆ 4–5 randomly chosen features for the best split, not all 20. This forces diversity between trees.
04
Grow Each Tree to Full Depth
Unlike standalone trees, individual trees in a Random Forest are grown deep (no pruning). Depth adds variance β€” but the forest-level average reduces it.
05
Aggregate Predictions (Vote / Average)
For new data: all 300 trees predict. Classification β†’ majority class wins. Regression β†’ mean of 300 predictions. This is the final output.

Section 06

Classification vs Regression

🗣️ Classification (Voting)
TreePrediction
Tree 1Spam
Tree 2Spam
Tree 3Not Spam
Tree 4Spam
Tree 5Spam
FinalSpam (4/5 votes)
📊 Regression (Averaging)
TreePredicted Price
Tree 1Β£312,000
Tree 2Β£298,500
Tree 3Β£325,000
Tree 4Β£310,000
Tree 5Β£304,500
FinalΒ£310,000 (mean)

Section 07

Python Implementation

Classification β€” Predicting Customer Churn

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
import numpy as np

# Simulated dataset: 5000 customers, 15 features
X, y = make_classification(
    n_samples=5000, n_features=15,
    n_informative=10, n_redundant=3,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Build the Random Forest
rf = RandomForestClassifier(
    n_estimators=300,        # number of trees
    max_features='sqrt',     # features per split -> sqrt(15) ~ 4
    max_depth=None,          # grow trees fully
    min_samples_leaf=1,      # minimum samples at a leaf
    bootstrap=True,          # enable bagging
    oob_score=True,          # free OOB validation
    n_jobs=-1,               # use all CPU cores
    random_state=42
)

rf.fit(X_train, y_train)

print(f"OOB Score (free accuracy estimate): {rf.oob_score_:.4f}")
print(f"Test Accuracy:                      {rf.score(X_test, y_test):.4f}")

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
OUTPUT
OOB Score (free accuracy estimate): 0.9085 Test Accuracy: 0.9120 precision recall f1-score support 0 0.92 0.91 0.91 508 1 0.91 0.91 0.91 492 accuracy 0.91 1000

Regression β€” Predicting House Prices

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

rf_reg = RandomForestRegressor(
    n_estimators=300,
    max_features='sqrt',     # or 1/3 of features
    oob_score=True,
    n_jobs=-1,
    random_state=42
)

rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)

print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:      ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
OUTPUT
R2 Score: 0.8056 MAE: $32,847
Speed Tip

Always set n_jobs=-1. Random Forest trees are built independently so they are trivially parallelisable. On an 8-core machine this can be 6–8Γ— faster with no change to results.


Section 08

Key Hyperparameters

ParameterDefaultWhat It ControlsTuning Advice
n_estimators100Number of trees in the forestMore = better, up to a plateau. Start at 300.
max_features'sqrt'Features considered per splitTune this first β€” biggest impact on accuracy.
max_depthNoneMaximum depth of each treeNone = full depth. Limit if RAM is a concern.
min_samples_leaf1Min samples required at a leafIncrease (2–10) to reduce overfitting on noisy data.
min_samples_split2Min samples to split a nodeRarely needs tuning alongside min_samples_leaf.
bootstrapTrueWhether to use baggingKeep True. False removes the core diversity mechanism.
oob_scoreFalseCompute OOB validation scoreAlways set True β€” it's free cross-validation.
class_weightNoneWeights for imbalanced classesUse 'balanced' for imbalanced datasets.

Hyperparameter Tuning with RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators':      randint(100, 600),
    'max_features':      ['sqrt', 'log2', 0.3, 0.5],
    'max_depth':         [None, 10, 20, 30, 50],
    'min_samples_leaf':  randint(1, 10),
    'min_samples_split': randint(2, 15),
}

search = RandomizedSearchCV(
    RandomForestClassifier(n_jobs=-1, oob_score=True, random_state=42),
    param_distributions=param_dist,
    n_iter=50,           # try 50 random combinations
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV F1: ", search.best_score_)

Section 09

Feature Importance

One of Random Forest's greatest practical advantages is its built-in ability to rank which features drive predictions. This is called Mean Decrease in Impurity (MDI) β€” it measures how much each feature reduces Gini impurity (or MSE for regression) across all trees, on average.

import pandas as pd
import matplotlib.pyplot as plt

feature_names = housing.feature_names  # ['MedInc', 'HouseAge', 'AveRooms', ...]

importances = rf_reg.feature_importances_
std         = np.std([t.feature_importances_ for t in rf_reg.estimators_], axis=0)

feat_df = pd.DataFrame({
    'feature':    feature_names,
    'importance': importances,
    'std':        std
}).sort_values('importance', ascending=False)

print(feat_df.to_string(index=False))

# Plot
feat_df.plot.barh(x='feature', y='importance', xerr='std', legend=False)
plt.xlabel('Mean Decrease in Impurity')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()
OUTPUT β€” California Housing Feature Importance
Feature Importance Std MedInc 0.5244 Β±0.047 <- Dominant predictor AveOccup 0.1331 Β±0.024 Latitude 0.0914 Β±0.013 Longitude 0.0889 Β±0.014 HouseAge 0.0530 Β±0.009 AveRooms 0.0437 Β±0.008 AveBedrms 0.0384 Β±0.006 Population 0.0271 Β±0.005
⚠️
MDI Bias β€” Watch Out

MDI feature importance is biased toward high-cardinality features (numerical columns with many unique values) and can mislead when features are correlated. For more reliable importance, use Permutation Importance (sklearn.inspection.permutation_importance) which is model-agnostic and corrects for these biases at the cost of more compute.

Permutation Importance (More Reliable)

from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf_reg, X_test, y_test,
    n_repeats=10,
    n_jobs=-1,
    random_state=42
)

perm_df = pd.DataFrame({
    'feature':    feature_names,
    'importance': result.importances_mean,
    'std':        result.importances_std
}).sort_values('importance', ascending=False)

print(perm_df.to_string(index=False))

Section 10

Bias–Variance Tradeoff: Single Tree vs Forest

📐
The Ensemble Math

If you have N trees each with variance σ² and pairwise correlation ρ, the variance of the forest's average is:

Variance(Forest) = ρ·σ² + (1βˆ’Ο)/N Β· σ²

As N β†’ ∞ the second term vanishes. But the first term ρ·σ² remains β€” it's the irreducible correlation between trees. This is why reducing ρ (by randomising features) matters as much as increasing N.

ModelBiasVarianceOverfit RiskTypical Use
Single shallow treeHighLowLowSimple, interpretable rules
Single deep treeLowVery HighSevereRarely useful alone
Random ForestLow–MedLowLowGeneral-purpose workhorse βœ…
Gradient BoostingVery LowMediumModerateMaximum accuracy, needs tuning

Section 11

When to Use Random Forest

Tabular Data
Random Forest excels on structured table data with numerical and categorical features. It is the first algorithm most practitioners reach for.
survey, logs, transactions, sensors
Mixed Feature Types
Handles numerical, ordinal, and (encoded) categorical features natively without needing scaling or normalisation. No StandardScaler required.
no preprocessing overhead
Feature Selection
Use feature importance to drop irrelevant columns before training a more expensive model. Acts as a built-in feature selector.
dimensionality reduction
Very High-Dimensional Sparse Data
NLP bag-of-words matrices with 50,000+ features β†’ Random Forest struggles. Linear models or gradient boosting with sparse support work better.
text, one-hot with many categories
Extreme Real-Time Latency
Predicting with 500 trees takes more compute than a single model. For sub-millisecond inference, consider a distilled single tree or linear model.
edge devices, streaming at scale
Pure Interpretability Required
A forest of 300 trees cannot be simply explained to a regulator. If you must justify every prediction in plain terms, a shallow single tree or logistic regression is better.
credit scoring regulations, medical

Section 12

Handling Class Imbalance

Random Forest can struggle with heavily imbalanced datasets (e.g. fraud detection: 99% normal, 1% fraud). The forest majority-votes, so rare classes get drowned out. Here are your options:

⚖️ Imbalance Fixes in Random Forest
Option 1
class_weight='balanced' β€” sklearn automatically weights classes inversely to their frequency. Easy, no data change needed.
Option 2
class_weight='balanced_subsample' β€” applies balanced weights independently to each bootstrap sample. Often better than global balancing.
Option 3
Use BalancedRandomForest from imbalanced-learn β€” undersamples the majority class in each bootstrap sample automatically.
Option 4
Change your prediction threshold. Instead of predicting the majority-vote class, use predict_proba and lower the threshold from 0.5 to 0.3 to favour recall on the minority class.
from imblearn.ensemble import BalancedRandomForestClassifier

brf = BalancedRandomForestClassifier(
    n_estimators=300,
    sampling_strategy='auto',  # undersample majority to match minority
    replacement=True,
    n_jobs=-1,
    random_state=42
)
brf.fit(X_train, y_train)

# Or with class_weight in standard sklearn:
rf_balanced = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced_subsample',
    n_jobs=-1,
    random_state=42
)
rf_balanced.fit(X_train, y_train)

Section 13

A Complete Real-World Example β€” Titanic Survival

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder

# Load data
df = pd.read_csv('titanic.csv')

# Simple preprocessing
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna('S', inplace=True)
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = df[features]
y = df['Survived']

# Note: Random Forest needs NO feature scaling
rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    max_depth=None,
    min_samples_leaf=2,
    class_weight='balanced',
    oob_score=True,
    n_jobs=-1,
    random_state=42
)

# 5-fold cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")

# Fit and check OOB
rf.fit(X, y)
print(f"OOB Score:   {rf.oob_score_:.4f}")

# Feature importance
for feat, imp in sorted(zip(features, rf.feature_importances_),
                        key=lambda x: -x[1]):
    print(f"  {feat:12s}: {imp:.4f}")
OUTPUT
CV Accuracy: 0.8249 +/- 0.0183 OOB Score: 0.8305 Sex : 0.2851 <- Most important Fare : 0.2234 Age : 0.2108 Pclass : 0.1271 SibSp : 0.0632 Parch : 0.0541 Embarked : 0.0363
🎯
No Scaling Required β€” Ever

Notice we did not scale Age, Fare, or any other feature. Decision trees split on thresholds, not distances β€” the magnitude of a feature is irrelevant. This makes Random Forest significantly easier to use than SVMs, KNN, or neural networks, which all require careful normalisation.


Section 14

Random Forest vs Gradient Boosting β€” Quick Comparison

PropertyRandom ForestGradient Boosting (XGBoost/LightGBM)
How trees are builtParallel β€” independentSequential β€” each corrects prior errors
Overfitting riskLow β€” bagging protects wellHigher β€” needs careful learning rate tuning
Training speedFast (parallelisable)Slower (sequential by nature)
Hyperparameter sensitivityLow β€” good defaults work wellHigh β€” needs tuning to shine
Peak accuracyVery goodTypically highest on tabular data
Missing valuesNeeds imputationXGBoost handles natively
Best choice when…Fast baseline, robust defaults neededMaximising accuracy, competition setting
🏆
The Practitioner's Rule

Start with Random Forest as your baseline β€” it's fast, robust, and almost always competitive with zero tuning. Once you need that last 2–3% accuracy, move to LightGBM or XGBoost and tune carefully. Many production systems run Random Forest permanently because the marginal gain from boosting doesn't justify the operational complexity.


Section 15

Golden Rules

🌲 Random Forest β€” Non-Negotiable Rules
1
Always set n_jobs=-1. Trees are independent β€” parallelisation is free performance. Leaving this at the default (1 core) on a 16-core machine means you're waiting 16Γ— longer than necessary.
2
Always enable oob_score=True. You get a statistically valid estimate of generalisation accuracy at zero extra cost. Use it before even touching your test set.
3
Do not scale features. Random Forest uses decision tree splits β€” the relative order of values matters, not their magnitude. Adding StandardScaler is wasted compute and changes nothing.
4
Tune max_features first β€” it has the biggest impact on accuracy and tree diversity. Then min_samples_leaf (2–5 reduces overfitting on noisy data). Tune n_estimators last β€” more trees always help but with diminishing returns after ~300–500.
5
For imbalanced datasets always set class_weight='balanced_subsample' as a starting point. Evaluate using F1 or AUC-ROC, not accuracy β€” accuracy is misleading when classes are skewed.
6
Adding more trees never increases overfitting β€” it can only decrease it or plateau. The only cost of more trees is compute and memory. There is no "too many trees" from an accuracy standpoint.
7
If MDI feature importance rankings feel unreliable (correlated or high-cardinality features dominating), switch to Permutation Importance (sklearn.inspection.permutation_importance). Slower, but unbiased.