The Story That Explains Random Forest
The audience wins almost every time β not because any single person in it is smarter than your expert friend, but because the collective errors cancel out. When 500 people independently guess, their random mistakes go in all directions and average to near zero. Your expert friend's one blind spot could cost you everything.
That is the entire idea behind Random Forest.
A Random Forest is an ensemble of many decision trees, each trained on a slightly different random slice of your data and a random subset of features. When predicting, every tree votes, and the majority wins (classification) or the average is taken (regression). No single tree needs to be perfect β together they are.
If you train many different but decent models and combine their predictions, the ensemble is almost always more accurate and more stable than any individual model. This is called ensemble learning β and Random Forest is its most battle-tested form.
The Foundation β Decision Trees (Quick Recap)
Before building a forest, you need to understand a single tree. A decision tree splits your data by asking questions β one feature at a time β until it reaches a prediction.
A single deep decision tree memorises the training data. It learns the noise, the outliers, the one weird row at position 347 β everything. It scores 99% on training data and 61% on new data. This is overfitting, and it kills model usefulness in production.
The Two Ingredients of Random Forest
Random Forest solves overfitting by injecting randomness in two separate ways. Both are essential β neither alone is enough.
Bagging alone still lets every tree see all features β so dominant features dominate every tree and they stay correlated. Feature randomness alone without bagging means every tree trains on the same data. You need both to create truly diverse, independent trees whose errors cancel rather than amplify.
How Bootstrap Sampling Works
| Tree | Bootstrap Sample (row indices) | Out-of-Bag Rows (unseen) |
|---|---|---|
| Tree 1 | 1, 3, 3, 5, 7, 7, 9, 2, 4, 4 | 6, 8, 10 |
| Tree 2 | 2, 4, 6, 6, 8, 1, 3, 9, 9, 5 | 7, 10 |
| Tree 3 | 10, 2, 5, 5, 1, 8, 3, 7, 6, 6 | 4, 9 |
| Tree 4 | 7, 9, 1, 4, 4, 6, 2, 10, 3, 8 | 5 |
Each tree has rows it has never seen β the out-of-bag (OOB) rows.
You can test each tree on its OOB rows and aggregate these scores to get
an OOB score β a free, unbiased estimate of model accuracy
without ever touching a separate test set. Set oob_score=True in sklearn
to get this automatically.
Visual Diagram β Inside a Random Forest
Classification vs Regression
| Tree | Prediction |
|---|---|
| Tree 1 | Spam |
| Tree 2 | Spam |
| Tree 3 | Not Spam |
| Tree 4 | Spam |
| Tree 5 | Spam |
| Final | Spam (4/5 votes) |
| Tree | Predicted Price |
|---|---|
| Tree 1 | Β£312,000 |
| Tree 2 | Β£298,500 |
| Tree 3 | Β£325,000 |
| Tree 4 | Β£310,000 |
| Tree 5 | Β£304,500 |
| Final | Β£310,000 (mean) |
Python Implementation
Classification β Predicting Customer Churn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
import numpy as np
# Simulated dataset: 5000 customers, 15 features
X, y = make_classification(
n_samples=5000, n_features=15,
n_informative=10, n_redundant=3,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Build the Random Forest
rf = RandomForestClassifier(
n_estimators=300, # number of trees
max_features='sqrt', # features per split -> sqrt(15) ~ 4
max_depth=None, # grow trees fully
min_samples_leaf=1, # minimum samples at a leaf
bootstrap=True, # enable bagging
oob_score=True, # free OOB validation
n_jobs=-1, # use all CPU cores
random_state=42
)
rf.fit(X_train, y_train)
print(f"OOB Score (free accuracy estimate): {rf.oob_score_:.4f}")
print(f"Test Accuracy: {rf.score(X_test, y_test):.4f}")
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
Regression β Predicting House Prices
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
rf_reg = RandomForestRegressor(
n_estimators=300,
max_features='sqrt', # or 1/3 of features
oob_score=True,
n_jobs=-1,
random_state=42
)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
Always set n_jobs=-1. Random Forest trees are built independently
so they are trivially parallelisable. On an 8-core machine this can be
6β8Γ faster with no change to results.
Key Hyperparameters
| Parameter | Default | What It Controls | Tuning Advice |
|---|---|---|---|
n_estimators | 100 | Number of trees in the forest | More = better, up to a plateau. Start at 300. |
max_features | 'sqrt' | Features considered per split | Tune this first β biggest impact on accuracy. |
max_depth | None | Maximum depth of each tree | None = full depth. Limit if RAM is a concern. |
min_samples_leaf | 1 | Min samples required at a leaf | Increase (2β10) to reduce overfitting on noisy data. |
min_samples_split | 2 | Min samples to split a node | Rarely needs tuning alongside min_samples_leaf. |
bootstrap | True | Whether to use bagging | Keep True. False removes the core diversity mechanism. |
oob_score | False | Compute OOB validation score | Always set True β it's free cross-validation. |
class_weight | None | Weights for imbalanced classes | Use 'balanced' for imbalanced datasets. |
Hyperparameter Tuning with RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
'n_estimators': randint(100, 600),
'max_features': ['sqrt', 'log2', 0.3, 0.5],
'max_depth': [None, 10, 20, 30, 50],
'min_samples_leaf': randint(1, 10),
'min_samples_split': randint(2, 15),
}
search = RandomizedSearchCV(
RandomForestClassifier(n_jobs=-1, oob_score=True, random_state=42),
param_distributions=param_dist,
n_iter=50, # try 50 random combinations
cv=5,
scoring='f1_weighted',
n_jobs=-1,
random_state=42
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV F1: ", search.best_score_)
Feature Importance
One of Random Forest's greatest practical advantages is its built-in ability to rank which features drive predictions. This is called Mean Decrease in Impurity (MDI) β it measures how much each feature reduces Gini impurity (or MSE for regression) across all trees, on average.
import pandas as pd
import matplotlib.pyplot as plt
feature_names = housing.feature_names # ['MedInc', 'HouseAge', 'AveRooms', ...]
importances = rf_reg.feature_importances_
std = np.std([t.feature_importances_ for t in rf_reg.estimators_], axis=0)
feat_df = pd.DataFrame({
'feature': feature_names,
'importance': importances,
'std': std
}).sort_values('importance', ascending=False)
print(feat_df.to_string(index=False))
# Plot
feat_df.plot.barh(x='feature', y='importance', xerr='std', legend=False)
plt.xlabel('Mean Decrease in Impurity')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()
MDI feature importance is biased toward high-cardinality features
(numerical columns with many unique values) and can mislead when features are
correlated. For more reliable importance, use
Permutation Importance (sklearn.inspection.permutation_importance)
which is model-agnostic and corrects for these biases at the cost of more compute.
Permutation Importance (More Reliable)
from sklearn.inspection import permutation_importance
result = permutation_importance(
rf_reg, X_test, y_test,
n_repeats=10,
n_jobs=-1,
random_state=42
)
perm_df = pd.DataFrame({
'feature': feature_names,
'importance': result.importances_mean,
'std': result.importances_std
}).sort_values('importance', ascending=False)
print(perm_df.to_string(index=False))
BiasβVariance Tradeoff: Single Tree vs Forest
If you have N trees each with variance ΟΒ² and pairwise correlation Ο,
the variance of the forest's average is:
Variance(Forest) = ΟΒ·ΟΒ² + (1βΟ)/N Β· ΟΒ²
As N β β the second term vanishes. But the first term ΟΒ·ΟΒ² remains β
it's the irreducible correlation between trees. This is why reducing Ο
(by randomising features) matters as much as increasing N.
| Model | Bias | Variance | Overfit Risk | Typical Use |
|---|---|---|---|---|
| Single shallow tree | High | Low | Low | Simple, interpretable rules |
| Single deep tree | Low | Very High | Severe | Rarely useful alone |
| Random Forest | LowβMed | Low | Low | General-purpose workhorse β |
| Gradient Boosting | Very Low | Medium | Moderate | Maximum accuracy, needs tuning |
When to Use Random Forest
Handling Class Imbalance
Random Forest can struggle with heavily imbalanced datasets (e.g. fraud detection: 99% normal, 1% fraud). The forest majority-votes, so rare classes get drowned out. Here are your options:
class_weight='balanced' β sklearn automatically weights classes inversely to their frequency. Easy, no data change needed.
class_weight='balanced_subsample' β applies balanced weights independently to each bootstrap sample. Often better than global balancing.
imbalanced-learn β undersamples the majority class in each bootstrap sample automatically.
predict_proba and lower the threshold from 0.5 to 0.3 to favour recall on the minority class.
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(
n_estimators=300,
sampling_strategy='auto', # undersample majority to match minority
replacement=True,
n_jobs=-1,
random_state=42
)
brf.fit(X_train, y_train)
# Or with class_weight in standard sklearn:
rf_balanced = RandomForestClassifier(
n_estimators=300,
class_weight='balanced_subsample',
n_jobs=-1,
random_state=42
)
rf_balanced.fit(X_train, y_train)
A Complete Real-World Example β Titanic Survival
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
# Load data
df = pd.read_csv('titanic.csv')
# Simple preprocessing
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna('S', inplace=True)
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = df[features]
y = df['Survived']
# Note: Random Forest needs NO feature scaling
rf = RandomForestClassifier(
n_estimators=500,
max_features='sqrt',
max_depth=None,
min_samples_leaf=2,
class_weight='balanced',
oob_score=True,
n_jobs=-1,
random_state=42
)
# 5-fold cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")
# Fit and check OOB
rf.fit(X, y)
print(f"OOB Score: {rf.oob_score_:.4f}")
# Feature importance
for feat, imp in sorted(zip(features, rf.feature_importances_),
key=lambda x: -x[1]):
print(f" {feat:12s}: {imp:.4f}")
Notice we did not scale Age, Fare, or any other feature. Decision trees split on thresholds, not distances β the magnitude of a feature is irrelevant. This makes Random Forest significantly easier to use than SVMs, KNN, or neural networks, which all require careful normalisation.
Random Forest vs Gradient Boosting β Quick Comparison
| Property | Random Forest | Gradient Boosting (XGBoost/LightGBM) |
|---|---|---|
| How trees are built | Parallel β independent | Sequential β each corrects prior errors |
| Overfitting risk | Low β bagging protects well | Higher β needs careful learning rate tuning |
| Training speed | Fast (parallelisable) | Slower (sequential by nature) |
| Hyperparameter sensitivity | Low β good defaults work well | High β needs tuning to shine |
| Peak accuracy | Very good | Typically highest on tabular data |
| Missing values | Needs imputation | XGBoost handles natively |
| Best choice when⦠| Fast baseline, robust defaults needed | Maximising accuracy, competition setting |
Start with Random Forest as your baseline β it's fast, robust, and almost always competitive with zero tuning. Once you need that last 2β3% accuracy, move to LightGBM or XGBoost and tune carefully. Many production systems run Random Forest permanently because the marginal gain from boosting doesn't justify the operational complexity.
Golden Rules
n_jobs=-1. Trees are independent β
parallelisation is free performance. Leaving this at the default (1 core) on a
16-core machine means you're waiting 16Γ longer than necessary.
oob_score=True. You get a statistically
valid estimate of generalisation accuracy at zero extra cost. Use it before even
touching your test set.
max_features first β it has the biggest impact on
accuracy and tree diversity. Then min_samples_leaf (2β5 reduces overfitting
on noisy data). Tune n_estimators last β more trees always help but
with diminishing returns after ~300β500.
class_weight='balanced_subsample'
as a starting point. Evaluate using F1 or AUC-ROC, not accuracy β accuracy is
misleading when classes are skewed.
sklearn.inspection.permutation_importance). Slower, but unbiased.