The Story: Vikram Needs to Predict Home Prices
He tries three approaches: a simple straight line, a highly complex wiggly curve, and a smooth balanced curve. Each produces wildly different results on unseen homes.
His manager asks one question: "Which model do I trust?" The answer requires understanding Bias and Variance β not as abstract terms, but as numbers you can measure from your train and test errors.
The Data β Home Price vs Square Footage
Vikram plots all 21 homes. As square footage rises, price rises β but not perfectly. Real data always has noise. This is the dataset every model will be trained and tested on.
21 homes plotted. Clear upward trend with scatter β price generally rises with size, but with noise from other factors (location, age, condition).
The Train / Test Split β The Foundation of All Evaluation
Before training any model, Vikram splits his 21 homes into two groups. Blue dots are the training set β the model learns from these. Orange dots are the test set β hidden during training, used only to measure real-world performance. This split is the key to revealing bias and variance.
The model never sees orange dots while training. Test error measures how well the model generalises. Train error measures how well it fits what it was shown.
The Complex Model β "The Memoriser" (Overfitting)
Vikram's first attempt is an extremely complex curve β high-degree polynomial that snakes through every single training point. It memorises the training data perfectly. Train error = 0.
The wiggly green curve snakes through every blue training dot β Train Error = 0. But the orange test points are far from the curve. Test Error = 100. The model memorised training noise instead of learning the real pattern.
Proving High Variance β The Same Model, a Different Training Set
Variance means: how much does the model's performance change when you train it on a different sample of the same data? Vikram shuffles his data and creates a second train/test split β different blue dots, different orange dots. He trains the same type of complex wiggly model again.
Same model type β different training set. Left: Test Error = 100. Right: Test Error = 27. The 73-point swing IS high variance. The model's test performance is unpredictable β it fully depends on which homes happened to be in the training set.
The complex wiggly curve memorised every training point β including the random noise specific to that sample. When the training set changes, the wiggles change too. The model learned the noise, not the underlying price pattern. Every new training set produces a completely different wiggly curve, so test errors vary wildly. That instability across datasets is the definition of high variance.
The Simple Model β "The Oversimplifier" (Underfitting)
Vikram's second attempt is the simplest possible model: a straight line. It cannot capture the slight upward curve in the data. Even on the training set, it makes consistent errors because a line simply cannot fit a curve.
The straight green line misses the curve of the data β it cannot bend. Both the blue training dots and the orange test dots are scattered far from the line. Train Error = 43, Test Error = 47. The model fails on data it was trained on β that is high bias.
Proving High Bias + Low Variance β The Linear Model on Two Splits
Split 1: Test = 47. Split 2: Test = 37. A small 10-point difference β LOW VARIANCE. The line is consistent because a straight line can't memorise training noise. BUT both train errors are high (43 and 41) β HIGH BIAS. The model is consistently wrong in the same way.
A straight line has only two parameters: slope and intercept. No matter which 14 homes Vikram uses to train it, the line will always be drawn roughly the same way through the general cloud of data β consistent but wrong. The line cannot bend to fit the curvature in the data, so it makes the same type of systematic error on every training set. That systematic wrongness is bias. The consistency across training sets is low variance.
The Balanced Fit β "The Generalist" (The Sweet Spot)
Vikram's third model is a smooth, moderately curved function β complex enough to follow the general upward curve but not complex enough to memorise every noise fluctuation. This is the balanced fit.
The smooth curve follows the upward trend without memorising noise. Train Error = 10, Test Error = 12 β the gap is tiny. Both errors are low. This is the balanced fit.
Split 1: Test = 12. Split 2: Test = 15. Only 3-point difference β LOW VARIANCE. Train errors are 10 and 11 β LOW BIAS. This is the goal: a model that generalises reliably regardless of which homes it was trained on.
The Complete Picture β All Three Models Side by Side
| Model | Split 1 Train | Split 1 Test | Split 2 Train | Split 2 Test | Bias | Variance | Verdict |
|---|---|---|---|---|---|---|---|
| Wiggly Curve (Overfit) | 0 | 100 | 0 | 27 | Low | High | Deploy? Never |
| Straight Line (Underfit) | 43 | 47 | 41 | 37 | High | Low | Consistent but wrong |
| Smooth Curve (Balanced) | 10 | 12 | 11 | 15 | Low | Low | β Production-ready |
Low train error (0β15) β model fits the training data β Low Bias.
High train error (40+) β model can't even fit the training data β High Bias (Underfitting).
Small gap, consistent across splits β Low Variance.
Large gap OR wildly different test errors across splits β High Variance (Overfitting).
This means the model learned the real pattern β not the noise, and not an oversimplified approximation. Find this sweet spot by tuning model complexity.
Why β The Mechanism Behind Each Behaviour
- Few parameters (2 for a line)
- Cannot capture non-linear patterns
- Makes same error regardless of which data it trains on
- Train error is high β "can't even fit the training data"
- Test error matches train error β stable but wrong
- Result: High Bias Β· Low Variance
- Many parameters (e.g. degree-10 polynomial has 11)
- Can fit any shape β including random noise
- Memorises the specific training set completely
- Train error = 0 β "memorised every data point"
- Test error explodes β real-world patterns weren't learned
- Result: Low Bias Β· High Variance
- Right number of parameters for the data pattern
- Captures the real trend without memorising noise
- Robust to which specific points are in the training set
- Train error is low but not zero β accepts some noise
- Test error is close to train error β real generalisation
- Result: Low Bias Β· Low Variance
Python β Reproducing Vikram's Three Models
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.metrics import mean_squared_error
np.random.seed(42)
# ββ Generate home price data (quadratic + noise) βββββββββββββ
sqft = np.random.uniform(600, 2500, 80)
price = 0.00004 * sqft ** 2 + 0.05 * sqft + np.random.randn(80) * 12
X = sqft.reshape(-1, 1)
y = price
# ββ Three model pipelines βββββββββββββββββββββββββββββββββββββ
models = {
'Underfit (degree=1, linear)' : Pipeline([
('poly', PolynomialFeatures(1)),
('model', LinearRegression()),
]),
'Balanced (degree=3)' : Pipeline([
('poly', PolynomialFeatures(3)),
('model', LinearRegression()),
]),
'Overfit (degree=15, complex)' : Pipeline([
('poly', PolynomialFeatures(15)),
('model', LinearRegression()),
]),
}
# ββ Run 2 different train/test splits (like the PDF slides) ββ
splits = [42, 99] # two different random seeds = two different training sets
print(f"{'Model':<32} {'Split':>6} {'Train MSE':>10} {'Test MSE':>10} {'Gap':>8}")
print("-" * 72)
for name, model in models.items():
test_errors = []
for i, seed in enumerate(splits, 1):
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.25, random_state=seed
)
model.fit(X_tr, y_tr)
tr_mse = mean_squared_error(y_tr, model.predict(X_tr))
te_mse = mean_squared_error(y_te, model.predict(X_te))
test_errors.append(te_mse)
print(f"{name:<32} {'Split '+str(i):>6} {tr_mse:>10.1f} {te_mse:>10.1f} {te_mse-tr_mse:>8.1f}")
variance_signal = abs(test_errors[1] - test_errors[0])
print(f"{'':32} {'Variance (test diff):':>18} {variance_signal:>9.1f}")
print()
Underfit: Train MSE is high (86, 83) β High Bias. Test difference is tiny (3.4) β Low Variance.
Overfit: Train MSE is near zero β Low Bias. Test difference is massive (630!) β High Variance.
Balanced: Train MSE is low (18, 17) β Low Bias. Test difference is tiny (3.3) β Low Variance.
This is the model to deploy.
Golden Rules
cross_val_score) and look at the standard deviation of the scores.
A high standard deviation across folds = high variance = your model is
unstable and cannot be trusted.