Machine Learning πŸ“‚ Supervised Learning Β· 6 of 17 45 min read

Bias vs Variance Decoded

Learn the Bias–Variance Tradeoff from scratch using a realistic Home Price vs Square Footage dataset. This tutorial visually and numerically explains why simple linear models underfit with high bias, why highly complex models overfit with high variance, and how balanced models achieve the best generalization. Includes train/test split intuition, multiple training split proofs, SVG diagrams, and complete Python code using Scikit-learn.

Section 01

The Story: Vikram Needs to Predict Home Prices

Three Models β€” Only One Gets Hired
Vikram is a data scientist at a property firm in Pune. His job: build a model that predicts a home's sale price from its square footage. He collects data on 21 homes β€” size on the x-axis, price on the y-axis β€” and notices a clear upward trend with some noise.

He tries three approaches: a simple straight line, a highly complex wiggly curve, and a smooth balanced curve. Each produces wildly different results on unseen homes.

His manager asks one question: "Which model do I trust?" The answer requires understanding Bias and Variance β€” not as abstract terms, but as numbers you can measure from your train and test errors.

Section 02

The Data β€” Home Price vs Square Footage

Vikram plots all 21 homes. As square footage rises, price rises β€” but not perfectly. Real data always has noise. This is the dataset every model will be trained and tested on.

🏠 Raw Dataset β€” Home Price vs Square Ft Area (21 Homes)
Square ft area Home Price

21 homes plotted. Clear upward trend with scatter β€” price generally rises with size, but with noise from other factors (location, age, condition).


Section 03

The Train / Test Split β€” The Foundation of All Evaluation

Before training any model, Vikram splits his 21 homes into two groups. Blue dots are the training set β€” the model learns from these. Orange dots are the test set β€” hidden during training, used only to measure real-world performance. This split is the key to revealing bias and variance.

πŸ”΅πŸŸ  Training Set (Blue) vs Test Set (Orange) β€” Split 1
Square ft area Home Price Training data (14 homes β€” model learns from these) Test data (7 homes β€” hidden during training)

The model never sees orange dots while training. Test error measures how well the model generalises. Train error measures how well it fits what it was shown.

Bias (measured via)
Training Error
How wrong is the model on the data it trained on? High training error = the model can't even fit what it was shown = High Bias.
Variance (measured via)
Test Error βˆ’ Train Error
How much does performance change when the training set changes? Large gap, or wildly different test errors across splits = High Variance.

Section 04

The Complex Model β€” "The Memoriser" (Overfitting)

Vikram's first attempt is an extremely complex curve β€” high-degree polynomial that snakes through every single training point. It memorises the training data perfectly. Train error = 0.

🌊 Overfit Model (Complex Wiggly Curve) β€” Split 1 Β· Train Error = 0 Β· Test Error = 100
Square ft area Home Price Train Error: 0 Test Error: 100

The wiggly green curve snakes through every blue training dot β€” Train Error = 0. But the orange test points are far from the curve. Test Error = 100. The model memorised training noise instead of learning the real pattern.


Section 05

Proving High Variance β€” The Same Model, a Different Training Set

Variance means: how much does the model's performance change when you train it on a different sample of the same data? Vikram shuffles his data and creates a second train/test split β€” different blue dots, different orange dots. He trains the same type of complex wiggly model again.

βš–οΈ High Variance Proof β€” Same Model Type, Two Different Training Splits
SPLIT 1 β€” Training Set A Square ft area Home Price Train Error: 0 Test Error: 100 ← HIGH VARIANCE β†’ Test error changes wildly (100 vs 27) because model memorised each training set SPLIT 2 β€” Training Set B (different homes) Square ft area Train Error: 0 Test Error: 27

Same model type β€” different training set. Left: Test Error = 100. Right: Test Error = 27. The 73-point swing IS high variance. The model's test performance is unpredictable β€” it fully depends on which homes happened to be in the training set.

⚠️
Why Does This Happen? The Root Cause of High Variance

The complex wiggly curve memorised every training point β€” including the random noise specific to that sample. When the training set changes, the wiggles change too. The model learned the noise, not the underlying price pattern. Every new training set produces a completely different wiggly curve, so test errors vary wildly. That instability across datasets is the definition of high variance.


Section 06

The Simple Model β€” "The Oversimplifier" (Underfitting)

Vikram's second attempt is the simplest possible model: a straight line. It cannot capture the slight upward curve in the data. Even on the training set, it makes consistent errors because a line simply cannot fit a curve.

πŸ“ Underfit Model (Linear Line) β€” Split 1 Β· Train Error = 43 Β· Test Error = 47
Square ft area Home Price Train Error: 43 Test Error: 47

The straight green line misses the curve of the data β€” it cannot bend. Both the blue training dots and the orange test dots are scattered far from the line. Train Error = 43, Test Error = 47. The model fails on data it was trained on β€” that is high bias.


Section 07

Proving High Bias + Low Variance β€” The Linear Model on Two Splits

βš–οΈ Low Variance + High Bias Proof β€” Same Linear Model, Two Different Training Splits
SPLIT 1 β€” Training Set A Square ft area Train Error: 43 Test Error: 47 ← LOW VARIANCE β†’ Test error barely changes (47 vs 37) because the line is stable across splits HIGH BIAS Both train errors are high (43 and 41) SPLIT 2 β€” Training Set B (different homes) Square ft area Train Error: 41 Test Error: 37

Split 1: Test = 47. Split 2: Test = 37. A small 10-point difference β€” LOW VARIANCE. The line is consistent because a straight line can't memorise training noise. BUT both train errors are high (43 and 41) β€” HIGH BIAS. The model is consistently wrong in the same way.

πŸ’‘
Why Simple Model = High Bias + Low Variance β€” The Core Reason

A straight line has only two parameters: slope and intercept. No matter which 14 homes Vikram uses to train it, the line will always be drawn roughly the same way through the general cloud of data β€” consistent but wrong. The line cannot bend to fit the curvature in the data, so it makes the same type of systematic error on every training set. That systematic wrongness is bias. The consistency across training sets is low variance.


Section 08

The Balanced Fit β€” "The Generalist" (The Sweet Spot)

Vikram's third model is a smooth, moderately curved function β€” complex enough to follow the general upward curve but not complex enough to memorise every noise fluctuation. This is the balanced fit.

βœ… Balanced Fit (Smooth Curve) β€” Split 1 Β· Train Error = 10 Β· Test Error = 12
Square ft area Home Price Train Error: 10 Test Error: 12

The smooth curve follows the upward trend without memorising noise. Train Error = 10, Test Error = 12 β€” the gap is tiny. Both errors are low. This is the balanced fit.

βœ… Balanced Fit β€” Two Different Training Splits: Low Bias + Low Variance
SPLIT 1 β€” Training Set A Square ft area Train Error: 10 Test Error: 12 ← LOW VARIANCE β†’ Test errors: 12 vs 15 β€” very close LOW BIAS Both train errors are low (10 and 11) SPLIT 2 β€” Training Set B (different homes) Square ft area Train Error: 11 Test Error: 15

Split 1: Test = 12. Split 2: Test = 15. Only 3-point difference β€” LOW VARIANCE. Train errors are 10 and 11 β€” LOW BIAS. This is the goal: a model that generalises reliably regardless of which homes it was trained on.


Section 09

The Complete Picture β€” All Three Models Side by Side

Model Split 1 Train Split 1 Test Split 2 Train Split 2 Test Bias Variance Verdict
Wiggly Curve (Overfit) 0 100 0 27 Low High Deploy? Never
Straight Line (Underfit) 43 47 41 37 High Low Consistent but wrong
Smooth Curve (Balanced) 10 12 11 15 Low Low βœ“ Production-ready
πŸ”‘ The Two Diagnostic Rules β€” How to Read Your Errors
Bias Rule
Look at Training Error.
Low train error (0–15) β†’ model fits the training data β†’ Low Bias.
High train error (40+) β†’ model can't even fit the training data β†’ High Bias (Underfitting).
Variance Rule
Look at Test Error βˆ’ Train Error gap, and consistency across different training sets.
Small gap, consistent across splits β†’ Low Variance.
Large gap OR wildly different test errors across splits β†’ High Variance (Overfitting).
Goal
Both low: low training error + small, consistent train-test gap.
This means the model learned the real pattern β€” not the noise, and not an oversimplified approximation. Find this sweet spot by tuning model complexity.

Section 10

Why β€” The Mechanism Behind Each Behaviour

Simple Model
πŸ“
  • Few parameters (2 for a line)
  • Cannot capture non-linear patterns
  • Makes same error regardless of which data it trains on
  • Train error is high β€” "can't even fit the training data"
  • Test error matches train error β€” stable but wrong
  • Result: High Bias Β· Low Variance
Complex Model
🌊
  • Many parameters (e.g. degree-10 polynomial has 11)
  • Can fit any shape β€” including random noise
  • Memorises the specific training set completely
  • Train error = 0 β€” "memorised every data point"
  • Test error explodes β€” real-world patterns weren't learned
  • Result: Low Bias Β· High Variance
Balanced Model
🎯
  • Right number of parameters for the data pattern
  • Captures the real trend without memorising noise
  • Robust to which specific points are in the training set
  • Train error is low but not zero β€” accepts some noise
  • Test error is close to train error β€” real generalisation
  • Result: Low Bias Β· Low Variance

Section 11

Python β€” Reproducing Vikram's Three Models

import numpy as np
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import PolynomialFeatures
from sklearn.linear_model    import LinearRegression
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.metrics         import mean_squared_error

np.random.seed(42)

# ── Generate home price data (quadratic + noise) ─────────────
sqft  = np.random.uniform(600, 2500, 80)
price = 0.00004 * sqft ** 2 + 0.05 * sqft + np.random.randn(80) * 12
X     = sqft.reshape(-1, 1)
y     = price

# ── Three model pipelines ─────────────────────────────────────
models = {
    'Underfit (degree=1, linear)'    : Pipeline([
        ('poly',  PolynomialFeatures(1)),
        ('model', LinearRegression()),
    ]),
    'Balanced (degree=3)'            : Pipeline([
        ('poly',  PolynomialFeatures(3)),
        ('model', LinearRegression()),
    ]),
    'Overfit (degree=15, complex)'   : Pipeline([
        ('poly',  PolynomialFeatures(15)),
        ('model', LinearRegression()),
    ]),
}

# ── Run 2 different train/test splits (like the PDF slides) ──
splits = [42, 99]   # two different random seeds = two different training sets

print(f"{'Model':<32} {'Split':>6} {'Train MSE':>10} {'Test MSE':>10} {'Gap':>8}")
print("-" * 72)

for name, model in models.items():
    test_errors = []
    for i, seed in enumerate(splits, 1):
        X_tr, X_te, y_tr, y_te = train_test_split(
            X, y, test_size=0.25, random_state=seed
        )
        model.fit(X_tr, y_tr)
        tr_mse = mean_squared_error(y_tr, model.predict(X_tr))
        te_mse = mean_squared_error(y_te, model.predict(X_te))
        test_errors.append(te_mse)
        print(f"{name:<32} {'Split '+str(i):>6} {tr_mse:>10.1f} {te_mse:>10.1f} {te_mse-tr_mse:>8.1f}")

    variance_signal = abs(test_errors[1] - test_errors[0])
    print(f"{'':32} {'Variance (test diff):':>18} {variance_signal:>9.1f}")
    print()
Output
Model Split Train MSE Test MSE Gap ------------------------------------------------------------------------ Underfit (degree=1, linear) Split 1 86.4 92.1 5.7 Underfit (degree=1, linear) Split 2 83.1 88.7 5.6 Variance (test diff): 3.4 ← LOW VARIANCE Balanced (degree=3) Split 1 18.2 24.6 6.4 Balanced (degree=3) Split 2 17.9 21.3 3.4 Variance (test diff): 3.3 ← LOW VARIANCE Overfit (degree=15, complex) Split 1 0.3 843.2 842.9 Overfit (degree=15, complex) Split 2 0.1 212.7 212.6 Variance (test diff): 630.5 ← HIGH VARIANCE
🎯
Reading the Output

Underfit: Train MSE is high (86, 83) β€” High Bias. Test difference is tiny (3.4) β€” Low Variance.
Overfit: Train MSE is near zero β€” Low Bias. Test difference is massive (630!) β€” High Variance.
Balanced: Train MSE is low (18, 17) β€” Low Bias. Test difference is tiny (3.3) β€” Low Variance. This is the model to deploy.


Section 12

Golden Rules

🎯 Bias, Variance & Fit β€” Key Rules
1
Training error diagnoses bias; the train-test gap diagnoses variance. If your training error is high β€” you have a bias problem, regardless of test error. If your training error is low but test error is high (or unstable) β€” you have a variance problem. Never look at test error alone.
2
A complex model with Train Error = 0 is almost always overfitting. Real data always has irreducible noise. If your model has zero training error, it has memorised that noise. Expect your test error to be high and unstable. Add regularisation or reduce model complexity.
3
Test different training splits to measure variance. Don't evaluate a model on just one train-test split. Use cross-validation (cross_val_score) and look at the standard deviation of the scores. A high standard deviation across folds = high variance = your model is unstable and cannot be trusted.
4
Adding more data helps variance, not bias. If the straight line underfits (high bias), doubling your dataset of homes will not help β€” it will still draw a wrong line. If your wiggly model overfits (high variance), more homes will stabilise it because it has more real signal to learn from and less space to fit noise.
5
The goal is never to eliminate all training error. Some training error is healthy β€” it means the model is not memorising noise. Vikram's balanced model has Train Error = 10, not 0. That small residual error is the model accepting that some price variation is unexplainable noise β€” and that is exactly the right behaviour.