Machine Learning 📂 Supervised Learning · 1 of 17 37 min read

Linear Regression

A complete guide to Linear Regression covering the line equation, Ordinary Least Squares derivation via differential calculus, RMSE, R², and Adjusted R² — with real stories, inline SVG diagrams, step-by-step calculations, and Python code.

Section 01

The Story: Priya Wants to Price Her Flat

A Real-Estate Agent in Mumbai
Priya is a data analyst at a Mumbai real-estate firm. Her boss drops a dataset on her desk — 200 flats with their size in square feet and their sale price in lakhs. A new flat just came in: 1 200 sq ft. No price yet. Her boss asks: "What should we list it for?"

Priya knows she can't just take the average price — that ignores the fact that bigger flats cost more. She needs a formula that takes size as input and spits out a predicted price. That formula is a Linear Regression model.

Linear Regression is the most fundamental supervised machine-learning algorithm. It finds the straight line that best describes the relationship between one or more input variables (features) and a continuous output variable (target). Once the line is found, predicting new values is as simple as plugging a number into a formula.


Section 02

Visualising the Idea

Priya plots her data on a graph. Each dot is a flat — x-axis = size, y-axis = price. The dots form a rough upward trend. Linear Regression draws the best-fit line through those dots.

📊 Scatter Plot — Size vs. Price (Mumbai Flats)
0 50 100 150 200 600 900 1200 1500 1700 Size (sq ft) Price (₹ Lakhs) Prediction: ~80 L Training data Best-fit line New prediction

The regression line minimises the total squared distance between itself and every data point.


Section 03

The Equation of a Straight Line

In school you learned y = mx + c. Linear Regression uses the same idea, just with machine-learning notation:

Simple Linear Regression (1 feature)
ŷ = β₀ + β₁ · x
ŷ = predicted value  |  β₀ = intercept (where the line crosses the y-axis)  |  β₁ = slope (how much y changes per unit of x)
Multiple Linear Regression (many features)
ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
Each feature xᵢ has its own coefficient βᵢ that says how strongly that feature influences the prediction.
💡
Priya's Flat Example

After fitting her model, Priya finds β₀ = −5 (intercept) and β₁ = 0.115 (slope). For a 1 200 sq ft flat:
ŷ = −5 + 0.115 × 1200 = −5 + 138 = 133 lakhs.
The model says: list it at ₹1.33 crore.


Section 04

The Math: How the Best Line is Found (OLS)

"Best fit" is not guesswork. The algorithm minimises the Sum of Squared Residuals (SSR) — also called the Residual Sum of Squares (RSS) or Sum of Squared Errors (SSE). A residual is the gap between a real data point and the line's prediction.

📐 Residuals — The Gaps the Model is Minimising
x (feature) y (target) residual = yᵢ − ŷᵢ Actual yᵢ Predicted ŷᵢ (on line) Residual

OLS finds β₀ and β₁ that make the sum of all (red dashed lines)² as small as possible.

The Cost Function

We want to minimise the Loss function L:

Sum of Squared Residuals (SSR)
L = Σᵢ (yᵢ − ŷᵢ)²
Sum over all n data points of (actual − predicted)²
Expanded (substituting ŷ = β₀ + β₁x)
L = Σᵢ (yᵢ − β₀ − β₁xᵢ)²
L is now a function of two unknowns: β₀ and β₁

Differential Equations — Finding the Minimum

To minimise L, take the partial derivative with respect to each parameter and set it to zero. This is where calculus meets machine learning.

∂ Deriving β₀ and β₁ via Calculus
Step 1
Differentiate L with respect to β₀ and set to zero.
∂L/∂β₀ = −2 · Σ(yᵢ − β₀ − β₁xᵢ) = 0
Simplifying:  Σyᵢ = n·β₀ + β₁·Σxᵢ
Divide by n:  ȳ = β₀ + β₁·x̄  →  β₀ = ȳ − β₁·x̄
Step 2
Differentiate L with respect to β₁ and set to zero.
∂L/∂β₁ = −2 · Σxᵢ(yᵢ − β₀ − β₁xᵢ) = 0
Simplifying:  Σ(xᵢyᵢ) = β₀·Σxᵢ + β₁·Σxᵢ²
Step 3
Substitute β₀ = ȳ − β₁·x̄ into the Step 2 equation and solve for β₁:
β₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²
This is the OLS closed-form solution. No iteration needed.
β₁ = Cov(x, y) / Var(x)
Result
Once β₁ is known, find β₀:
β₀ = ȳ − β₁ · x̄
These two formulas give you the exact best-fit line — guaranteed to minimise the sum of squared residuals.
📐
Why Squared and Not Absolute?

Squaring residuals serves two purposes: it makes all gaps positive (no cancelling), and it penalises large errors more than small ones (because squaring grows faster than linear). The squared form is also mathematically smooth everywhere, making it easy to differentiate and find a unique minimum.


Section 05

Numerical Example — Priya's 5 Flats

Let's use a small dataset so the math is visible. Priya takes 5 flats:

i Size xᵢ (sq ft) Price yᵢ (₹L) xᵢ − x̄ yᵢ − ȳ (xᵢ−x̄)(yᵢ−ȳ) (xᵢ−x̄)²
160065−400−3915 600160 000
280085−200−193 80040 000
31 0001050100
41 200120200163 20040 000
51 4001454004116 400160 000
Σ 5 000 520 00 39 000 400 000
🧮 OLS Calculation for Priya's Data
Means
x̄ = 5000 / 5 = 1 000 sq ft    ȳ = 520 / 5 = 104 lakhs
β₁
β₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = 39 000 / 400 000 = 0.0975 lakhs per sq ft
β₀
β₀ = ȳ − β₁ · x̄ = 104 − 0.0975 × 1000 = 104 − 97.5 = 6.5 lakhs
Model
ŷ = 6.5 + 0.0975 · x
For the new 1 200 sq ft flat: ŷ = 6.5 + 0.0975 × 1200 = 123.5 lakhs

Section 06

RMSE — Root Mean Squared Error

Now that Priya has a model, she needs to know: how wrong is it, on average? The most common metric for this is RMSE.

Mean Squared Error (MSE)
MSE = (1/n) · Σ(yᵢ − ŷᵢ)²
Average of the squared errors. Units are squared (e.g. lakhs²) — hard to interpret directly.
Root Mean Squared Error (RMSE)
RMSE = √MSE = √[ (1/n) · Σ(yᵢ − ŷᵢ)² ]
Square root of MSE. Same unit as y. Tells you the typical prediction error in real terms.

Step-by-Step RMSE for Priya's 5 Flats

i Actual yᵢ Predicted ŷᵢ Residual (yᵢ − ŷᵢ) (yᵢ − ŷᵢ)²
1656.5+0.0975×600 = 65.00.00.00
2856.5+0.0975×800 = 84.50.50.25
31056.5+0.0975×1000 = 104.01.01.00
41206.5+0.0975×1200 = 123.5−3.512.25
51456.5+0.0975×1400 = 143.02.04.00
Σ17.50
🧮 RMSE Calculation
MSE
17.50 / 5 = 3.50 lakhs²
RMSE
√3.50 = 1.87 lakhs
Meaning
On average, Priya's model is off by ₹1.87 lakhs on each prediction. Since flats cost ~₹100 L, that's about 1.9% error — excellent.
⚠️
RMSE Punishes Big Mistakes Hard

Because errors are squared before averaging, a single large prediction error inflates RMSE far more than several small ones. This is both a feature (large errors are costly in practice) and a sensitivity to outliers. If you want a metric that treats all errors equally, use MAE (Mean Absolute Error).


Section 07

R² — The Coefficient of Determination

RMSE tells you the absolute error in real units. But it does not tell you how good your model is compared to having no model at all. answers that question.

What Does "Explained Variance" Mean?
Imagine Priya had no model at all. Her best guess for any flat's price would just be the mean: ȳ = 104 lakhs. The total variation in actual prices around this mean is called SStot (Total Sum of Squares).

After fitting her regression line, the variation that is still unexplained (the residuals) is called SSres (Residual Sum of Squares).

R² measures how much of the original variation the model explained away.
Total Sum of Squares (SS_tot)
SS_tot = Σ(yᵢ − ȳ)²
Variation when using the mean as the only predictor — the "dumb baseline"
Residual Sum of Squares (SS_res)
SS_res = Σ(yᵢ − ŷᵢ)²
Variation left unexplained after fitting the regression line
R² — Coefficient of Determination
R² = 1 − (SS_res / SS_tot)
Ranges from 0 to 1 (or negative if the model is worse than the mean baseline). R² = 1 means perfect prediction. R² = 0 means the model explains nothing.
📊 Visual Breakdown: SS_tot vs SS_res
SS_tot — Baseline Model (ŷ = ȳ) ȳ = 104 Each bar = (yᵢ − ȳ)² → SS_tot is large SS_res — After Regression Line Residuals are tiny → SS_res is small

R² = 1 − SS_res/SS_tot. The smaller SS_res is relative to SS_tot, the closer R² is to 1.

Calculating R² for Priya's Data

🧮 R² Calculation
SS_tot
Deviations from ȳ = 104:
(65−104)²+(85−104)²+(105−104)²+(120−104)²+(145−104)²
= 1521 + 361 + 1 + 256 + 1681 = 3 820
SS_res
Squared residuals from the model (from RMSE table above):
0 + 0.25 + 1 + 12.25 + 4 = 17.50
R² = 1 − (17.50 / 3820) = 1 − 0.00458 = 0.9954
The model explains 99.54% of the variance in flat prices. Excellent fit on this small dataset.
R² ValueInterpretationVerdict
0.90 – 1.00Model explains 90–100% of varianceExcellent
0.70 – 0.90Model explains 70–90% of varianceGood
0.50 – 0.70Moderate explanatory powerAcceptable
0.00 – 0.50Weak model; misses major patternsPoor
NegativeModel is worse than a flat mean baselineTerrible
⚠️
The Dark Side of R² — It Always Goes Up

Adding more features to a model never decreases R² — even if those features are pure noise. This means you can get a very high R² just by throwing in hundreds of useless columns. A model with 50 random features will have a higher R² than one with 3 meaningful features, even though it is worse. This is why Adjusted R² was invented.


Section 08

Adjusted R² — The Honest Version

Adjusted R² corrects for the "adding features always helps" bias in R². It penalises you for adding features that do not genuinely improve the model. It only increases if a new feature adds more than it costs.

Adjusted R² Formula
R²_adj = 1 − [ (1 − R²) · (n − 1) / (n − p − 1) ]
n = number of data points  |  p = number of features (predictors, not counting the intercept)  |  = ordinary R² of the model

How the Penalty Works

🔍 Understanding the Penalty Term
Term
(n − 1) / (n − p − 1) is the penalty multiplier. As p (number of features) grows, the denominator shrinks, making the multiplier larger. This makes (1 − R²) · penalty larger, which makes R²_adj smaller.
Good feature
If a new feature meaningfully reduces SS_res, R² increases by more than the penalty costs → Adjusted R² goes UP ↑
Noise feature
If a new feature adds very little to R², the penalty is larger than the gain → Adjusted R² goes DOWN ↓

Numerical Example — Adding a Useless Feature

Priya adds a second feature: the flat owner's lucky number (random noise).

ModelFeatures (p)nR²_adjVerdict
Model A Size only (p=1) 5 0.9954 1 − (1−0.9954)·(4/3) = 1 − 0.0046·1.333 = 0.9939 Good
Model B Size + Lucky Number (p=2) 5 0.9961 (+0.0007) 1 − (1−0.9961)·(4/2) = 1 − 0.0039·2.0 = 0.9922 Worse!
🎯
Adjusted R² Caught the Noise

R² went up slightly (0.9954 → 0.9961) when the lucky number was added — a misleading signal. But Adjusted R² fell (0.9939 → 0.9922), correctly signalling that the extra feature hurt the model more than it helped. Always prefer Adjusted R² when comparing models with different numbers of features.


Section 09

Metrics at a Glance

RMSE
📏
√[ Σ(yᵢ−ŷᵢ)² / n ]
  • Same unit as target
  • Penalises large errors heavily
  • Sensitive to outliers
  • Lower is better
📊
1 − SS_res / SS_tot
  • Unitless (0 to 1)
  • Easy to communicate
  • Never decreases with more features
  • Higher is better
Adjusted R²
⚖️
1 − (1−R²)(n−1)/(n−p−1)
  • Penalises for extra features
  • Best for model comparison
  • Can decrease with useless features
  • Higher is better

Section 10

Python Implementation

Manual OLS from Scratch

import math

x = [600, 800, 1000, 1200, 1400]    # size (sq ft)
y = [65,  85,  105,  120,  145]     # price (lakhs)

n     = len(x)
x_bar = sum(x) / n                   # 1000.0
y_bar = sum(y) / n                   # 104.0

# OLS coefficients
num   = sum((xi - x_bar) * (yi - y_bar) for xi, yi in zip(x, y))  # 39000
denom = sum((xi - x_bar) ** 2        for xi     in x)             # 400000

beta1 = num / denom                  # slope     = 0.0975
beta0 = y_bar - beta1 * x_bar       # intercept = 6.5

print(f"Intercept (β₀): {beta0:.4f}")  # 6.5
print(f"Slope     (β₁): {beta1:.4f}")  # 0.0975

# Predictions
y_pred = [beta0 + beta1 * xi for xi in x]
print("Predictions:", [round(p, 2) for p in y_pred])
# [65.0, 84.5, 104.0, 123.5, 143.0]

Computing RMSE, R², Adjusted R²

import math

def rmse(y_true, y_pred):
    n   = len(y_true)
    sse = sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred))
    return math.sqrt(sse / n)

def r_squared(y_true, y_pred):
    y_bar  = sum(y_true) / len(y_true)
    ss_tot = sum((yt - y_bar) ** 2 for yt     in y_true)
    ss_res = sum((yt - yp)    ** 2 for yt, yp in zip(y_true, y_pred))
    return 1 - (ss_res / ss_tot)

def adjusted_r2(y_true, y_pred, p):
    n  = len(y_true)
    r2 = r_squared(y_true, y_pred)
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

y_true = [65, 85, 105, 120, 145]
y_pred = [65.0, 84.5, 104.0, 123.5, 143.0]

print(f"RMSE        : {rmse(y_true, y_pred):.4f}")             # 1.8708
print(f"R²          : {r_squared(y_true, y_pred):.4f}")         # 0.9954
print(f"Adjusted R² : {adjusted_r2(y_true, y_pred, p=1):.4f}")  # 0.9939
Output
RMSE : 1.8708 R² : 0.9954 Adjusted R² : 0.9939

Using scikit-learn

import numpy as np
from sklearn.linear_model  import LinearRegression
from sklearn.metrics       import mean_squared_error, r2_score

X = np.array([[600], [800], [1000], [1200], [1400]])  # 2-D for sklearn
y = np.array([65, 85, 105, 120, 145])

model = LinearRegression()
model.fit(X, y)

print(f"Intercept β₀ : {model.intercept_:.4f}")   # 6.5
print(f"Slope     β₁ : {model.coef_[0]:.4f}")      # 0.0975

y_pred = model.predict(X)

mse  = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y, y_pred)

n, p   = len(y), X.shape[1]
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f"RMSE         : {rmse:.4f}")     # 1.8708
print(f"R²           : {r2:.4f}")       # 0.9954
print(f"Adjusted R²  : {adj_r2:.4f}")  # 0.9939

# Predict the new flat
new_flat = np.array([[1200]])
print(f"Predicted price for 1200 sq ft: {model.predict(new_flat)[0]:.1f} L")
# 123.5 L
💡
scikit-learn Does Not Give Adjusted R² Directly

r2_score() only returns plain R². You always need to compute Adjusted R² manually using the formula above. Remember to track n (sample size) and p (number of features, not counting the intercept) carefully.


Section 11

Assumptions of Linear Regression

Linear Regression is not magic — it comes with four assumptions that must hold for the model and its metrics to be trustworthy.

📏
Linearity
The relationship between x and y must be approximately linear. Check with a scatter plot. If the data curves, use polynomial features or a non-linear model.
🔀
Independence of Errors
Residuals must not correlate with each other. Time-series data often violates this — today's error leaks into tomorrow's. Use Durbin-Watson test to check.
🎯
Homoscedasticity
The spread of residuals should be constant across all levels of x. A "funnel" pattern in a residuals-vs-fitted plot signals heteroscedasticity — a violation. Try log-transforming the target.
🔔
Normality of Residuals
Residuals should follow a normal distribution. Important for hypothesis tests on coefficients. Check with a Q-Q plot. Minor violations are usually fine for large samples (Central Limit Theorem).
🚫
No Multicollinearity
Features should not be highly correlated with each other (Multiple LR only). Use VIF (Variance Inflation Factor) to detect it. Drop or combine correlated features.
🔬
No Influential Outliers
A single extreme point can drag the entire regression line. Check Cook's Distance or leverage scores. Consider robust regression methods if outliers cannot be removed.

Section 12

Golden Rules

🎯 Linear Regression — Key Rules
1
Always plot first. A scatter plot reveals whether a linear relationship actually exists. Fitting a line to a curved or categorical relationship gives misleading coefficients and useless metrics.
2
Use RMSE for error magnitude. RMSE is in the same unit as your target variable — it tells you the typical mistake in real-world terms. A RMSE of 1.87 lakhs on flat prices is immediately meaningful to Priya's boss.
3
Use Adjusted R² for model selection. When comparing models with different numbers of features, always use Adjusted R² — not R². R² almost always improves as you add features, even random noise.
4
High R² does not mean the model is useful. You can get R² = 0.99 by memorising the training data. Always evaluate on a held-out test set. A large gap between train R² and test R² signals overfitting.
5
OLS is optimal under the Gauss-Markov theorem — it gives the Best Linear Unbiased Estimator (BLUE) when the assumptions hold. If they don't, consider Ridge, Lasso, or robust regression variants.
6
Scale your features for gradient-descent solvers. While OLS has a closed-form solution (β = (XᵀX)⁻¹Xᵀy), gradient-descent implementations (used for very large datasets) converge far faster when features are standardised to mean 0, std 1.