The Story: Coach Arjun's Two Bad Archers
Priya shoots ten arrows. Every single one lands in the bottom-left corner of the target β grouped tightly together, but far from the centre. She is consistent but systematically wrong. No matter how many times she shoots, she will never hit the bullseye. Her aim has a bias β a built-in offset she hasn't corrected for.
Ravi also shoots ten arrows. They are scattered all over the target β some near the centre, some at the edges, with no pattern. He might hit the bullseye by chance once, but you can't predict where his next arrow will go. He has high variance β he is sensitive to every tiny fluctuation in wind, grip, and mood.
Coach Arjun wants a third archer β one whose arrows cluster tightly around the bullseye. Low bias. Low variance. That is the goal of every machine learning model.
What Is Model Error? The Three-Part Decomposition
When a machine learning model makes predictions, its total expected error on unseen data can be mathematically decomposed into three independent sources. This is the Bias-Variance Decomposition.
Variance β error from sensitivity to fluctuations in the training set.
Irreducible Noise β natural randomness in the data that no model can remove. The floor β you can never go below this.
| Component | Source | Controllable? | Archer Analogy |
|---|---|---|---|
| BiasΒ² | Oversimplified model; wrong assumptions about the data shape | Yes β increase model complexity | Priya's arrows always landing bottom-left regardless of practice |
| Variance | Overcomplicated model; memorises training noise | Yes β simplify or regularise | Ravi's arrows scattered unpredictably all over the target |
| Irreducible Noise | Natural randomness in the real world (measurement error, missing variables) | No β cannot be reduced | A sudden gust of wind that no archer can predict or control |
The Four Scenarios β Archery Target Diagram
Each dot is one model trained on a different sample of training data. The bullseye is the true target value. Green = ideal, Amber = overfitting, Blue = underfitting, Red = worst possible outcome.
Bias β The Systematic Error
Variance β Sensitivity to the Training Set
The Bias-Variance Tradeoff β The Classic U-Curve
As model complexity increases, BiasΒ² falls (the model becomes more expressive) but Variance rises (the model becomes more sensitive to training noise). Total error forms a U-shape. The minimum of that U is the sweet spot β the optimal model complexity.
The blue curve (BiasΒ²) falls with complexity; the red curve (Variance) rises. Their sum β the green U-curve β has a minimum at the optimal complexity. Models left of the minimum underfit; models right of it overfit.
Underfitting, Normal Fit & Overfitting β Polynomial Example
The clearest illustration uses polynomial regression on the same scatter data with three different model complexities β degree 1 (line), degree 3 (curve), and degree 10 (wiggly). The underlying true relationship is a gentle cubic.
All three panels use exactly the same dataset. Only the model complexity changes. The degree-1 line misses the curvature. The degree-3 curve captures the true shape. The degree-10 curve memorises every noise fluctuation β and will fail on new data.
Underfitting vs Normal Fit vs Overfitting β Side by Side
| Property | Underfitting | Normal Fit | Overfitting |
|---|---|---|---|
| Bias | High | Low | Very Low |
| Variance | Low | Low | High |
| Training Error | High | Low | Very Low (near 0) |
| Validation Error | High | Low | High |
| TrainβVal Gap | Small (both bad) | Small (both good) | Large |
| Model Complexity | Too simple | Just right | Too complex |
| More training data helps? | No | Marginally | Yes, significantly |
| Example (Polynomial) | Degree 1 (line) | Degree 3β5 | Degree 12+ |
| Example (Decision Tree) | Max depth = 1 (stump) | Max depth = 5β10 | Max depth = None |
| Example (Neural Net) | 2 neurons, no layers | Appropriate architecture | Millions of params, tiny data |
Learning Curves β The Diagnostic Tool
A learning curve plots training error and validation error against either training set size or number of training epochs. The shape of these two curves diagnoses whether you have a bias or variance problem.
High Bias: both curves converge to a high error β adding data doesn't help. Normal Fit: curves converge to a low error with a small gap. High Variance: training error is near-zero while validation error stays high β the gap is large and only closes slowly with more data.
Fixes for Underfitting (High Bias)
You have an underfitting problem if: training error is high, validation error is also high, and the gap between them is small. Collecting more data will not help. You need to give the model more capacity to learn.
Fixes for Overfitting (High Variance)
You have an overfitting problem if: training error is very low, validation error is significantly higher, and the gap is large. The model has memorised training data. Fixes focus on constraining the model or exposing it to more diverse data.
Ξ» controls the strength.
max_depth), reduce neurons / layers, or use fewer
features. Directly reduces the model's capacity to memorise noise.
Regularisation β The Most Powerful Overfitting Fix
C = 1/Ξ» in sklearn) via
cross-validation. Try values on a logarithmic scale:
[0.0001, 0.001, 0.01, 0.1, 1, 10, 100].Use
RidgeCV, LassoCV, or
GridSearchCV to find the optimal value.
Python β Diagnosing Bias & Variance with Learning Curves
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.datasets import make_regression
# ββ Generate data with a cubic relationship + noise ββββββββββ
np.random.seed(42)
X_raw = np.random.uniform(-3, 3, (200, 1))
y = 0.5*X_raw[:,0]**3 - X_raw[:,0]**2 + np.random.randn(200)*1.5
# ββ Helper: plot learning curve βββββββββββββββββββββββββββββββ
def plot_learning_curve(model, X, y, title, ax, cv=5):
sizes, train_s, val_s = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=cv, scoring='neg_mean_squared_error'
)
train_err = -train_s.mean(axis=1)
val_err = -val_s.mean(axis=1)
ax.plot(sizes, train_err, 'o-', label='Train MSE', color='#f59e0b')
ax.plot(sizes, val_err, 's--', label='Validation MSE', color='#6366f1')
ax.set_title(title, fontsize=12)
ax.set_xlabel('Training set size')
ax.set_ylabel('MSE')
ax.legend()
ax.grid(True, alpha=0.3)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# ββ Three model complexities ββββββββββββββββββββββββββββββββββ
models = [
('Underfit β Degree 1',
Pipeline([('poly', PolynomialFeatures(1)),
('ridge', Ridge(alpha=1.0))])),
('Normal Fit β Degree 3',
Pipeline([('poly', PolynomialFeatures(3)),
('ridge', Ridge(alpha=1.0))])),
('Overfit β Degree 15',
Pipeline([('poly', PolynomialFeatures(15)),
('ridge', Ridge(alpha=0.0001))])),
]
for ax, (title, model) in zip(axes, models):
plot_learning_curve(model, X_raw, y, title, ax)
plt.tight_layout()
plt.savefig('learning_curves.png', dpi=150)
plt.show()
Python β Validation Curve & Regularisation Tuning
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge, Lasso, RidgeCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import (
validation_curve, cross_val_score,
train_test_split, KFold
)
from sklearn.preprocessing import StandardScaler
X, y = load_diabetes(return_X_y=True)
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
# ββ 1. Validation curve: Decision Tree max_depth βββββββββββββ
depths = np.arange(1, 16)
train_s, val_s = validation_curve(
DecisionTreeRegressor(), X_sc, y,
param_name='max_depth', param_range=depths,
scoring='neg_mean_squared_error', cv=5
)
plt.figure(figsize=(8, 4))
plt.plot(depths, -train_s.mean(1), 'o-', label='Train MSE', color='#f59e0b')
plt.plot(depths, -val_s.mean(1), 's--', label='Validation MSE', color='#6366f1')
plt.axvline(x=depths[-val_s.mean(1).argmin()-1],
color='#34d399', linestyle='--', label='Optimal depth')
plt.xlabel('max_depth')
plt.ylabel('MSE')
plt.title('Validation Curve β Bias-Variance vs Tree Depth')
plt.legend()
plt.show()
# ββ 2. Ridge alpha tuning via RidgeCV ββββββββββββββββββββββββ
alphas = np.logspace(-4, 4, 50)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_sc, y)
print(f"Best Ridge alpha (Ξ»): {ridge_cv.alpha_:.4f}")
# ββ 3. Cross-validation score comparison βββββββββββββββββββββ
kf = KFold(n_splits=5, shuffle=True, random_state=42)
models_cv = {
'Ridge (best Ξ»)' : Ridge(alpha=ridge_cv.alpha_),
'Decision Tree d=2 (underfit)': DecisionTreeRegressor(max_depth=2),
'Decision Tree d=5 (good)' : DecisionTreeRegressor(max_depth=5),
'Decision Tree d=None (overfit)': DecisionTreeRegressor(),
}
print(f"\n{'Model':<35} {'CV MSE':>10} {'Std':>8}")
print("-"*58)
for name, model in models_cv.items():
scores = cross_val_score(
model, X_sc, y,
cv=kf, scoring='neg_mean_squared_error'
)
print(f"{name:<35} {-scores.mean():>10.1f} {scores.std():>8.1f}")
The depth-2 tree underfits (high CV MSE, low std β consistent but wrong). The unlimited tree overfits (high CV MSE, high std β inconsistent across folds). The depth-5 tree and tuned Ridge both sit near the sweet spot with lower MSE and moderate variance. Notice that the overfitting model also has the highest standard deviation β it changes dramatically across folds, confirming high variance.
Golden Rules
RidgeCV, LassoCV, or GridSearchCV
on your training data.