Machine Learning πŸ“‚ Supervised Learning Β· 5 of 17 46 min read

Bias, Variance, Underfitting & Overfitting

Learn the complete Bias-Variance Tradeoff in Machine Learning with intuitive archery analogies, visual diagrams, learning curves, polynomial fitting examples, regularisation techniques, and practical Python implementations. Understand how underfitting and overfitting happen, how to diagnose them, and how to fix them effectively.

Section 01

The Story: Coach Arjun's Two Bad Archers

Why Being Consistently Wrong and Being Erratically Right Are Both Useless
Coach Arjun trains archers at a national academy. Two of his students frustrate him in completely different ways.

Priya shoots ten arrows. Every single one lands in the bottom-left corner of the target β€” grouped tightly together, but far from the centre. She is consistent but systematically wrong. No matter how many times she shoots, she will never hit the bullseye. Her aim has a bias β€” a built-in offset she hasn't corrected for.

Ravi also shoots ten arrows. They are scattered all over the target β€” some near the centre, some at the edges, with no pattern. He might hit the bullseye by chance once, but you can't predict where his next arrow will go. He has high variance β€” he is sensitive to every tiny fluctuation in wind, grip, and mood.

Coach Arjun wants a third archer β€” one whose arrows cluster tightly around the bullseye. Low bias. Low variance. That is the goal of every machine learning model.

Section 02

What Is Model Error? The Three-Part Decomposition

When a machine learning model makes predictions, its total expected error on unseen data can be mathematically decomposed into three independent sources. This is the Bias-Variance Decomposition.

Total Expected Prediction Error (MSE)
Error(x) = BiasΒ² + Variance + Irreducible Noise
BiasΒ² β€” systematic error from wrong assumptions in the model.
Variance β€” error from sensitivity to fluctuations in the training set.
Irreducible Noise β€” natural randomness in the data that no model can remove. The floor β€” you can never go below this.
Component Source Controllable? Archer Analogy
BiasΒ² Oversimplified model; wrong assumptions about the data shape Yes β€” increase model complexity Priya's arrows always landing bottom-left regardless of practice
Variance Overcomplicated model; memorises training noise Yes β€” simplify or regularise Ravi's arrows scattered unpredictably all over the target
Irreducible Noise Natural randomness in the real world (measurement error, missing variables) No β€” cannot be reduced A sudden gust of wind that no archer can predict or control

Section 03

The Four Scenarios β€” Archery Target Diagram

🎯 Bias vs Variance β€” Four Archery Targets
LOW VARIANCE HIGH VARIANCE LOW BIAS HIGH BIAS βœ“ Low Bias + Low Variance IDEAL β€” The Sweet Spot Low Bias + High Variance Scattered around centre High Bias + Low Variance Consistent but off-target High Bias + High Variance Worst case β€” scattered off-target

Each dot is one model trained on a different sample of training data. The bullseye is the true target value. Green = ideal, Amber = overfitting, Blue = underfitting, Red = worst possible outcome.


Section 04

Bias β€” The Systematic Error

Mathematical Definition
Bias = E[Ε·] βˆ’ y_true
The difference between the expected (average) prediction over many training sets and the true value. If a model always predicts too low, it has positive bias. If always too high, negative bias.
In Plain English
Bias = "Wrong Assumptions About the Data"
A linear model fitted to data that follows a curve has high bias β€” it cannot fit the curve no matter how much data you give it or how long you train it. The model is fundamentally too simple.
01
High-Bias Model Behaviour
Performs poorly on the training set itself. Train error is high. The model hasn't learned the true pattern β€” it's too constrained to fit even the data it was trained on. More training data does not help a high-bias model.
02
Common Causes of High Bias
Using a linear model for non-linear data Β· Too few features (insufficient information) Β· Excessive regularisation that penalises the weights too hard Β· Too few hidden layers / neurons in a neural network Β· Hand-crafted features that miss key patterns.
03
Signs in Practice
Training accuracy is low. Validation accuracy is also low and roughly equal to training accuracy. The gap between train and validation error is small β€” but both are bad. Adding more training data barely moves either curve.

Section 05

Variance β€” Sensitivity to the Training Set

Mathematical Definition
Variance = E[ (Ε· βˆ’ E[Ε·])Β² ]
How much the model's predictions change when trained on different samples drawn from the same distribution. High variance = the model changes drastically with small changes in the training data.
In Plain English
Variance = "Memorised the Training Noise"
A degree-20 polynomial fitted to 25 data points has high variance β€” it passes through (or near) every training point but produces wild oscillations between them. It has learned the noise, not the signal.
01
High-Variance Model Behaviour
Performs very well on the training set but poorly on unseen data. Train error is low or near-zero. Validation / test error is significantly higher. The model has memorised β€” not learned.
02
Common Causes of High Variance
Model too complex for the dataset (degree too high, tree too deep) Β· Too few training examples Β· Too many features relative to samples Β· No regularisation Β· Training too long without early stopping (neural networks) Β· Noisy or mislabelled training data.
03
Signs in Practice
Training accuracy is very high (often near 100%). Validation accuracy is significantly lower. The gap between train and validation error is large. Adding more training data gradually closes this gap over time.

Section 06

The Bias-Variance Tradeoff β€” The Classic U-Curve

As model complexity increases, BiasΒ² falls (the model becomes more expressive) but Variance rises (the model becomes more sensitive to training noise). Total error forms a U-shape. The minimum of that U is the sweet spot β€” the optimal model complexity.

πŸ“ˆ BiasΒ², Variance & Total Error vs Model Complexity
Model Complexity β†’ Error Underfitting Sweet Spot Overfitting Low High Noise floor ⭐ Optimal Complexity BiasΒ² Variance Total Error (BiasΒ²+Var+Noise)

The blue curve (BiasΒ²) falls with complexity; the red curve (Variance) rises. Their sum β€” the green U-curve β€” has a minimum at the optimal complexity. Models left of the minimum underfit; models right of it overfit.


Section 07

Underfitting, Normal Fit & Overfitting β€” Polynomial Example

The clearest illustration uses polynomial regression on the same scatter data with three different model complexities β€” degree 1 (line), degree 3 (curve), and degree 10 (wiggly). The underlying true relationship is a gentle cubic.

πŸ“Š Three Degrees of Polynomial Fit on the Same Dataset
UNDERFITTING Degree 1 β€” High Bias Train error: HIGH NORMAL FIT Degree 3 β€” Balanced Train β‰ˆ Val error: LOW OVERFITTING Degree 10 β€” High Variance Train: LOW Β· Val: HIGH

All three panels use exactly the same dataset. Only the model complexity changes. The degree-1 line misses the curvature. The degree-3 curve captures the true shape. The degree-10 curve memorises every noise fluctuation β€” and will fail on new data.


Section 08

Underfitting vs Normal Fit vs Overfitting β€” Side by Side

Property Underfitting Normal Fit Overfitting
BiasHighLowVery Low
VarianceLowLowHigh
Training ErrorHighLowVery Low (near 0)
Validation ErrorHighLowHigh
Train–Val GapSmall (both bad)Small (both good)Large
Model ComplexityToo simpleJust rightToo complex
More training data helps?NoMarginallyYes, significantly
Example (Polynomial)Degree 1 (line)Degree 3–5Degree 12+
Example (Decision Tree)Max depth = 1 (stump)Max depth = 5–10Max depth = None
Example (Neural Net)2 neurons, no layersAppropriate architectureMillions of params, tiny data

Section 09

Learning Curves β€” The Diagnostic Tool

A learning curve plots training error and validation error against either training set size or number of training epochs. The shape of these two curves diagnoses whether you have a bias or variance problem.

πŸ“ˆ Learning Curves β€” Three Scenarios
High Bias (Underfit) Training Set Size β†’ Error Gap β‰ˆ 0 Both high Normal Fit (Sweet Spot) Training Set Size β†’ Small gap High Variance (Overfit) Training Set Size β†’ Large Gap! Training Error Validation Error

High Bias: both curves converge to a high error β€” adding data doesn't help. Normal Fit: curves converge to a low error with a small gap. High Variance: training error is near-zero while validation error stays high β€” the gap is large and only closes slowly with more data.


Section 10

Fixes for Underfitting (High Bias)

πŸ’‘
Diagnosis First

You have an underfitting problem if: training error is high, validation error is also high, and the gap between them is small. Collecting more data will not help. You need to give the model more capacity to learn.

πŸ“ˆ
Increase Model Complexity
Use a higher-degree polynomial, a deeper decision tree, more neurons, or more layers. Switch from a linear model to a non-linear one (e.g. SVM with RBF kernel, Random Forest).
βœ“ Directly reduces bias
βœ— Risk of introducing variance β€” monitor validation error
πŸ› οΈ
Add More Features
Provide the model with richer information. Engineer interaction terms (x₁·xβ‚‚), polynomial features (xΒ²), domain-specific ratios, or aggregations that capture the true underlying structure.
βœ“ Gives the model more signal to work with
βœ— Irrelevant features add variance without reducing bias
πŸ”§
Reduce Regularisation
If you applied L1/L2 regularisation, the penalty may be too strong, preventing the model from fitting even the training data. Lower Ξ» (or raise C in sklearn) to give the model more freedom.
βœ“ Immediately loosens constraints on weights
βœ— Too little regularisation β†’ overfitting
⏱️
Train Longer
In neural networks and gradient boosting, early stopping or too few epochs/estimators can leave a model undertrained. Allow more iterations so the optimiser can reach a better minimum.
βœ“ Simple fix β€” no architecture change needed
βœ— Watch for overfitting as epochs increase
πŸ”„
Switch Algorithm Family
A fundamentally mismatched algorithm cannot be saved by tuning. If data has complex non-linear boundaries, move from Logistic Regression β†’ Gradient Boosting or Neural Networks.
βœ“ Sometimes the only real fix
βœ— New hyperparameters to tune from scratch
πŸ“‰
Remove Excessive Feature Selection
If you removed too many features during preprocessing, you may have dropped important signal. Re-examine feature importances and restore features whose removal hurt training performance.
βœ“ Recovers lost information
βœ— May re-introduce correlated / noisy features

Section 11

Fixes for Overfitting (High Variance)

⚠️
Diagnosis First

You have an overfitting problem if: training error is very low, validation error is significantly higher, and the gap is large. The model has memorised training data. Fixes focus on constraining the model or exposing it to more diverse data.

πŸ“¦
Get More Training Data
The most reliable fix. More diverse examples force the model to learn generalised patterns rather than memorising specifics. Even synthetic data augmentation (flips, rotations, noise) can help in vision and NLP tasks.
Always try this first if possible
βš–οΈ
Regularisation (L1 / L2)
Add a penalty to the loss function to keep weights small. L2 (Ridge): shrinks all weights β€” keeps all features with smaller influence. L1 (Lasso): forces some weights to exactly zero β€” automatic feature selection. Ξ» controls the strength.
sklearn: C parameter in LogReg / alpha in Ridge, Lasso
βœ‚οΈ
Reduce Model Complexity
Lower polynomial degree, restrict tree depth (max_depth), reduce neurons / layers, or use fewer features. Directly reduces the model's capacity to memorise noise.
sklearn: max_depth, max_features, min_samples_leaf
🎲
Dropout (Neural Networks)
Randomly zero out a fraction of neurons during each training step. This forces the network to learn redundant representations and prevents co-adaptation of neurons. Typical rate: 0.2–0.5.
keras: Dropout(rate=0.3)
πŸ›‘
Early Stopping
Monitor validation error during training. Stop when validation error starts rising (even if training error continues to fall). Saves the model weights at the epoch of minimum validation loss.
sklearn: early_stopping=True (GradientBoosting, MLPClassifier)
🌲
Ensemble Methods
Bagging (e.g. Random Forests) averages many high-variance models, dramatically reducing variance. Boosting builds shallow trees sequentially, keeping each one low-variance. Stacking combines diverse models to balance bias and variance simultaneously.
RandomForestClassifier, GradientBoostingClassifier
🧹
Feature Selection / PCA
Remove irrelevant or redundant features. High-dimensional data with many irrelevant columns gives the model too many ways to "explain" the training noise. SelectKBest, feature importances, or PCA all reduce dimensionality.
sklearn: SelectKBest, PCA, VarianceThreshold
πŸ”
Cross-Validation
Use k-fold cross-validation to get a reliable estimate of generalisation error during model selection and hyperparameter tuning. Never select hyperparameters based solely on a single train-test split β€” it can mislead.
sklearn: cross_val_score(model, X, y, cv=5)

Section 12

Regularisation β€” The Most Powerful Overfitting Fix

L2 Regularisation β€” Ridge
J(β) = Loss + λ · Σβⱼ²
Penalises large weights by adding their squared sum to the loss. All weights shrink toward zero but never reach it exactly. Good when all features are useful β€” keeps them all but constrains their influence.
L1 Regularisation β€” Lasso
J(Ξ²) = Loss + Ξ» Β· Ξ£|Ξ²β±Ό|
Penalises the absolute sum of weights. Creates sparsity β€” some weights are pushed to exactly zero, effectively removing features. Natural automatic feature selection.
πŸ” How Ξ» Controls the Bias-Variance Tradeoff
Ξ» = 0
No penalty β€” pure maximum likelihood. Weights can grow as large as needed to fit training data. Risk of high variance on small datasets. Model memorises noise.
Ξ» small
Mild constraint. Weights are slightly smaller. Good generalisation on large datasets. The sweet spot for most production models when tuned via cross-validation.
Ξ» large
Strong constraint. Weights are forced near zero regardless of the data. Model predictions become near-constant β€” essentially the mean. High bias β€” underfitting territory.
Tuning
Always tune Ξ» (or C = 1/Ξ» in sklearn) via cross-validation. Try values on a logarithmic scale: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100].
Use RidgeCV, LassoCV, or GridSearchCV to find the optimal value.

Section 13

Python β€” Diagnosing Bias & Variance with Learning Curves

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline       import Pipeline
from sklearn.preprocessing  import PolynomialFeatures, StandardScaler
from sklearn.linear_model   import Ridge
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.datasets       import make_regression

# ── Generate data with a cubic relationship + noise ──────────
np.random.seed(42)
X_raw = np.random.uniform(-3, 3, (200, 1))
y     = 0.5*X_raw[:,0]**3 - X_raw[:,0]**2 + np.random.randn(200)*1.5

# ── Helper: plot learning curve ───────────────────────────────
def plot_learning_curve(model, X, y, title, ax, cv=5):
    sizes, train_s, val_s = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=cv, scoring='neg_mean_squared_error'
    )
    train_err = -train_s.mean(axis=1)
    val_err   = -val_s.mean(axis=1)
    ax.plot(sizes, train_err, 'o-', label='Train MSE',      color='#f59e0b')
    ax.plot(sizes, val_err,   's--', label='Validation MSE', color='#6366f1')
    ax.set_title(title, fontsize=12)
    ax.set_xlabel('Training set size')
    ax.set_ylabel('MSE')
    ax.legend()
    ax.grid(True, alpha=0.3)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# ── Three model complexities ──────────────────────────────────
models = [
    ('Underfit β€” Degree 1',
     Pipeline([('poly', PolynomialFeatures(1)),
                ('ridge', Ridge(alpha=1.0))])),

    ('Normal Fit β€” Degree 3',
     Pipeline([('poly', PolynomialFeatures(3)),
                ('ridge', Ridge(alpha=1.0))])),

    ('Overfit β€” Degree 15',
     Pipeline([('poly', PolynomialFeatures(15)),
                ('ridge', Ridge(alpha=0.0001))])),
]

for ax, (title, model) in zip(axes, models):
    plot_learning_curve(model, X_raw, y, title, ax)

plt.tight_layout()
plt.savefig('learning_curves.png', dpi=150)
plt.show()

Section 14

Python β€” Validation Curve & Regularisation Tuning

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets        import load_diabetes
from sklearn.linear_model    import Ridge, Lasso, RidgeCV
from sklearn.tree            import DecisionTreeRegressor
from sklearn.model_selection import (
    validation_curve, cross_val_score,
    train_test_split, KFold
)
from sklearn.preprocessing   import StandardScaler

X, y = load_diabetes(return_X_y=True)
scaler = StandardScaler()
X_sc   = scaler.fit_transform(X)

# ── 1. Validation curve: Decision Tree max_depth ─────────────
depths      = np.arange(1, 16)
train_s, val_s = validation_curve(
    DecisionTreeRegressor(), X_sc, y,
    param_name='max_depth', param_range=depths,
    scoring='neg_mean_squared_error', cv=5
)
plt.figure(figsize=(8, 4))
plt.plot(depths, -train_s.mean(1), 'o-', label='Train MSE',      color='#f59e0b')
plt.plot(depths, -val_s.mean(1),   's--', label='Validation MSE', color='#6366f1')
plt.axvline(x=depths[-val_s.mean(1).argmin()-1],
            color='#34d399', linestyle='--', label='Optimal depth')
plt.xlabel('max_depth')
plt.ylabel('MSE')
plt.title('Validation Curve β€” Bias-Variance vs Tree Depth')
plt.legend()
plt.show()

# ── 2. Ridge alpha tuning via RidgeCV ────────────────────────
alphas   = np.logspace(-4, 4, 50)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_sc, y)
print(f"Best Ridge alpha (Ξ»): {ridge_cv.alpha_:.4f}")

# ── 3. Cross-validation score comparison ─────────────────────
kf = KFold(n_splits=5, shuffle=True, random_state=42)
models_cv = {
    'Ridge (best Ξ»)' : Ridge(alpha=ridge_cv.alpha_),
    'Decision Tree d=2 (underfit)': DecisionTreeRegressor(max_depth=2),
    'Decision Tree d=5 (good)'   : DecisionTreeRegressor(max_depth=5),
    'Decision Tree d=None (overfit)': DecisionTreeRegressor(),
}
print(f"\n{'Model':<35} {'CV MSE':>10}  {'Std':>8}")
print("-"*58)
for name, model in models_cv.items():
    scores = cross_val_score(
        model, X_sc, y,
        cv=kf, scoring='neg_mean_squared_error'
    )
    print(f"{name:<35} {-scores.mean():>10.1f}  {scores.std():>8.1f}")
Output
Best Ridge alpha (Ξ»): 0.3162 Model CV MSE Std ---------------------------------------------------------- Ridge (best Ξ») 2906.3 225.4 Decision Tree d=2 (underfit) 4218.7 318.6 Decision Tree d=5 (good) 3105.2 289.1 Decision Tree d=None (overfit) 4889.3 632.4
🎯
What the Output Tells Us

The depth-2 tree underfits (high CV MSE, low std β€” consistent but wrong). The unlimited tree overfits (high CV MSE, high std β€” inconsistent across folds). The depth-5 tree and tuned Ridge both sit near the sweet spot with lower MSE and moderate variance. Notice that the overfitting model also has the highest standard deviation β€” it changes dramatically across folds, confirming high variance.


Section 15

Golden Rules

🎯 Bias, Variance & Model Fit β€” Key Rules
1
Always plot your learning curves before tuning hyperparameters. A learning curve immediately tells you whether you have a bias problem (both curves converge high β€” increase complexity) or a variance problem (large gap β€” reduce complexity, regularise, or get more data). Tuning blindly wastes time.
2
More data cures variance, not bias. If your model underfits, doubling the dataset size will barely move the needle. If it overfits, more data is often the best fix. Correctly diagnosing which problem you have prevents you from collecting data you don't need.
3
Regularisation moves you along the bias-variance curve β€” use cross-validation to find the sweet spot. Increasing Ξ» always increases bias and decreases variance. Never set Ξ» manually β€” always search on a log scale using RidgeCV, LassoCV, or GridSearchCV on your training data.
4
A model with near-zero training error is almost always overfitting. Real data always has noise. If your training MSE is 0.001 and validation MSE is 15.0, the model has memorised your training set perfectly and generalises to nothing. Check for data leakage, data snooping, or a model that is far too complex for your dataset size.
5
The irreducible noise floor is real β€” do not chase it. Every dataset has noise that no model can eliminate. If your validation MSE has plateaued and you keep increasing complexity, you are just adding variance without reducing the true signal error. Accept the floor and move on.
6
Ensemble methods are the most practical bias-variance management tool. Bagging (Random Forests) reduces variance without increasing bias. Boosting reduces bias gradually while keeping variance controlled. Both outperform single-model tuning in nearly every real-world benchmark β€” not because they violate the tradeoff, but because they navigate it more efficiently.