Linear Regression Explained: RMSE, R², Adjusted R²

Section 01

The Story: Priya Wants to Price Her Flat

📖 Real-World Story

A Real-Estate Agent in Mumbai

Priya is a data analyst at a Mumbai real-estate firm. Her boss drops a dataset on her desk — 200 flats with their size in square feet and their sale price in lakhs. A new flat just came in: 1 200 sq ft. No price yet. Her boss asks: "What should we list it for?"

Priya knows she can't just take the average price — that ignores the fact that bigger flats cost more. She needs a formula that takes size as input and spits out a predicted price. That formula is a Linear Regression model.

Linear Regression is the most fundamental supervised machine-learning algorithm. It finds the straight line that best describes the relationship between one or more input variables (features) and a continuous output variable (target). Once the line is found, predicting new values is as simple as plugging a number into a formula.

Section 02

Visualising the Idea

Priya plots her data on a graph. Each dot is a flat — x-axis = size, y-axis = price. The dots form a rough upward trend. Linear Regression draws the best-fit line through those dots.

📊 Scatter Plot — Size vs. Price (Mumbai Flats)

The regression line minimises the total squared distance between itself and every data point.

Section 03

The Equation of a Straight Line

In school you learned y = mx + c. Linear Regression uses the same idea, just with machine-learning notation:

Simple Linear Regression (1 feature)

ŷ = β₀ + β₁ · x

ŷ = predicted value | β₀ = intercept (where the line crosses the y-axis) | β₁ = slope (how much y changes per unit of x)

Multiple Linear Regression (many features)

ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

Each feature xᵢ has its own coefficient βᵢ that says how strongly that feature influences the prediction.

💡

Priya's Flat Example

After fitting her model, Priya finds β₀ = −5 (intercept) and β₁ = 0.115 (slope). For a 1 200 sq ft flat:
ŷ = −5 + 0.115 × 1200 = −5 + 138 = 133 lakhs.
The model says: list it at ₹1.33 crore.

Section 04

The Math: How the Best Line is Found (OLS)

"Best fit" is not guesswork. The algorithm minimises the Sum of Squared Residuals (SSR) — also called the Residual Sum of Squares (RSS) or Sum of Squared Errors (SSE). A residual is the gap between a real data point and the line's prediction.

📐 Residuals — The Gaps the Model is Minimising

OLS finds β₀ and β₁ that make the sum of all (red dashed lines)² as small as possible.

The Cost Function

We want to minimise the Loss function L:

Sum of Squared Residuals (SSR)

L = Σᵢ (yᵢ − ŷᵢ)²

Sum over all n data points of (actual − predicted)²

Expanded (substituting ŷ = β₀ + β₁x)

L = Σᵢ (yᵢ − β₀ − β₁xᵢ)²

L is now a function of two unknowns: β₀ and β₁

Differential Equations — Finding the Minimum

To minimise L, take the partial derivative with respect to each parameter and set it to zero. This is where calculus meets machine learning.

∂ Deriving β₀ and β₁ via Calculus

Step 1

Differentiate L with respect to β₀ and set to zero.
∂L/∂β₀ = −2 · Σ(yᵢ − β₀ − β₁xᵢ) = 0
Simplifying: Σyᵢ = n·β₀ + β₁·Σxᵢ
Divide by n: ȳ = β₀ + β₁·x̄ → β₀ = ȳ − β₁·x̄

Step 2

Differentiate L with respect to β₁ and set to zero.
∂L/∂β₁ = −2 · Σxᵢ(yᵢ − β₀ − β₁xᵢ) = 0
Simplifying: Σ(xᵢyᵢ) = β₀·Σxᵢ + β₁·Σxᵢ²

Step 3

Substitute β₀ = ȳ − β₁·x̄ into the Step 2 equation and solve for β₁:
β₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²
This is the OLS closed-form solution. No iteration needed.
β₁ = Cov(x, y) / Var(x)

Result

Once β₁ is known, find β₀:
β₀ = ȳ − β₁ · x̄
These two formulas give you the exact best-fit line — guaranteed to minimise the sum of squared residuals.

📐

Why Squared and Not Absolute?

Squaring residuals serves two purposes: it makes all gaps positive (no cancelling), and it penalises large errors more than small ones (because squaring grows faster than linear). The squared form is also mathematically smooth everywhere, making it easy to differentiate and find a unique minimum.

Section 05

Numerical Example — Priya's 5 Flats

Let's use a small dataset so the math is visible. Priya takes 5 flats:

i	Size xᵢ (sq ft)	Price yᵢ (₹L)	xᵢ − x̄	yᵢ − ȳ	(xᵢ−x̄)(yᵢ−ȳ)	(xᵢ−x̄)²
1	600	65	−400	−39	15 600	160 000
2	800	85	−200	−19	3 800	40 000
3	1 000	105	0	1	0	0
4	1 200	120	200	16	3 200	40 000
5	1 400	145	400	41	16 400	160 000
Σ	5 000	520	0	0	39 000	400 000

🧮 OLS Calculation for Priya's Data

Means

x̄ = 5000 / 5 = 1 000 sq ft ȳ = 520 / 5 = 104 lakhs

β₁

β₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = 39 000 / 400 000 = 0.0975 lakhs per sq ft

β₀

β₀ = ȳ − β₁ · x̄ = 104 − 0.0975 × 1000 = 104 − 97.5 = 6.5 lakhs

Model

ŷ = 6.5 + 0.0975 · x
For the new 1 200 sq ft flat: ŷ = 6.5 + 0.0975 × 1200 = 123.5 lakhs

Section 06

RMSE — Root Mean Squared Error

Now that Priya has a model, she needs to know: how wrong is it, on average? The most common metric for this is RMSE.

Mean Squared Error (MSE)

MSE = (1/n) · Σ(yᵢ − ŷᵢ)²

Average of the squared errors. Units are squared (e.g. lakhs²) — hard to interpret directly.

Root Mean Squared Error (RMSE)

RMSE = √MSE = √[ (1/n) · Σ(yᵢ − ŷᵢ)² ]

Square root of MSE. Same unit as y. Tells you the typical prediction error in real terms.

Step-by-Step RMSE for Priya's 5 Flats

i	Actual yᵢ	Predicted ŷᵢ	Residual (yᵢ − ŷᵢ)	(yᵢ − ŷᵢ)²
1	65	6.5+0.0975×600 = 65.0	0.0	0.00
2	85	6.5+0.0975×800 = 84.5	0.5	0.25
3	105	6.5+0.0975×1000 = 104.0	1.0	1.00
4	120	6.5+0.0975×1200 = 123.5	−3.5	12.25
5	145	6.5+0.0975×1400 = 143.0	2.0	4.00
Σ				17.50

🧮 RMSE Calculation

MSE

17.50 / 5 = 3.50 lakhs²

RMSE

√3.50 = 1.87 lakhs

Meaning

On average, Priya's model is off by ₹1.87 lakhs on each prediction. Since flats cost ~₹100 L, that's about 1.9% error — excellent.

⚠️

RMSE Punishes Big Mistakes Hard

Because errors are squared before averaging, a single large prediction error inflates RMSE far more than several small ones. This is both a feature (large errors are costly in practice) and a sensitivity to outliers. If you want a metric that treats all errors equally, use MAE (Mean Absolute Error).

Section 07

R² — The Coefficient of Determination

RMSE tells you the absolute error in real units. But it does not tell you how good your model is compared to having no model at all. R² answers that question.

🧠 Intuition

What Does "Explained Variance" Mean?

Imagine Priya had no model at all. Her best guess for any flat's price would just be the mean: ȳ = 104 lakhs. The total variation in actual prices around this mean is called SS_tot (Total Sum of Squares).

After fitting her regression line, the variation that is still unexplained (the residuals) is called SS_res (Residual Sum of Squares).

R² measures how much of the original variation the model explained away.

Total Sum of Squares (SS_tot)

SS_tot = Σ(yᵢ − ȳ)²

Variation when using the mean as the only predictor — the "dumb baseline"

Residual Sum of Squares (SS_res)

SS_res = Σ(yᵢ − ŷᵢ)²

Variation left unexplained after fitting the regression line

R² — Coefficient of Determination

R² = 1 − (SS_res / SS_tot)

Ranges from 0 to 1 (or negative if the model is worse than the mean baseline). R² = 1 means perfect prediction. R² = 0 means the model explains nothing.

📊 Visual Breakdown: SS_tot vs SS_res

R² = 1 − SS_res/SS_tot. The smaller SS_res is relative to SS_tot, the closer R² is to 1.

Calculating R² for Priya's Data

🧮 R² Calculation

SS_tot

Deviations from ȳ = 104:
(65−104)²+(85−104)²+(105−104)²+(120−104)²+(145−104)²
= 1521 + 361 + 1 + 256 + 1681 = 3 820

SS_res

Squared residuals from the model (from RMSE table above):
0 + 0.25 + 1 + 12.25 + 4 = 17.50

R²

R² = 1 − (17.50 / 3820) = 1 − 0.00458 = 0.9954
The model explains 99.54% of the variance in flat prices. Excellent fit on this small dataset.

R² Value	Interpretation	Verdict
0.90 – 1.00	Model explains 90–100% of variance	Excellent
0.70 – 0.90	Model explains 70–90% of variance	Good
0.50 – 0.70	Moderate explanatory power	Acceptable
0.00 – 0.50	Weak model; misses major patterns	Poor
Negative	Model is worse than a flat mean baseline	Terrible

⚠️

The Dark Side of R² — It Always Goes Up

Adding more features to a model never decreases R² — even if those features are pure noise. This means you can get a very high R² just by throwing in hundreds of useless columns. A model with 50 random features will have a higher R² than one with 3 meaningful features, even though it is worse. This is why Adjusted R² was invented.

Section 08

Adjusted R² — The Honest Version

Adjusted R² corrects for the "adding features always helps" bias in R². It penalises you for adding features that do not genuinely improve the model. It only increases if a new feature adds more than it costs.

Adjusted R² Formula

R²_adj = 1 − [ (1 − R²) · (n − 1) / (n − p − 1) ]

n = number of data points | p = number of features (predictors, not counting the intercept) | R² = ordinary R² of the model

How the Penalty Works

🔍 Understanding the Penalty Term

Term

(n − 1) / (n − p − 1) is the penalty multiplier. As p (number of features) grows, the denominator shrinks, making the multiplier larger. This makes (1 − R²) · penalty larger, which makes R²_adj smaller.

Good feature

If a new feature meaningfully reduces SS_res, R² increases by more than the penalty costs → Adjusted R² goes UP ↑

Noise feature

If a new feature adds very little to R², the penalty is larger than the gain → Adjusted R² goes DOWN ↓

Numerical Example — Adding a Useless Feature

Priya adds a second feature: the flat owner's lucky number (random noise).

Model	Features (p)	n	R²	R²_adj	Verdict
Model A	Size only (p=1)	5	0.9954	1 − (1−0.9954)·(4/3) = 1 − 0.0046·1.333 = 0.9939	Good
Model B	Size + Lucky Number (p=2)	5	0.9961 (+0.0007)	1 − (1−0.9961)·(4/2) = 1 − 0.0039·2.0 = 0.9922	Worse!

🎯

Adjusted R² Caught the Noise

R² went up slightly (0.9954 → 0.9961) when the lucky number was added — a misleading signal. But Adjusted R² fell (0.9939 → 0.9922), correctly signalling that the extra feature hurt the model more than it helped. Always prefer Adjusted R² when comparing models with different numbers of features.

Section 09

Metrics at a Glance

RMSE

📏

√[ Σ(yᵢ−ŷᵢ)² / n ]

Same unit as target
Penalises large errors heavily
Sensitive to outliers
Lower is better

R²

📊

1 − SS_res / SS_tot

Unitless (0 to 1)
Easy to communicate
Never decreases with more features
Higher is better

Adjusted R²

⚖️

1 − (1−R²)(n−1)/(n−p−1)

Penalises for extra features
Best for model comparison
Can decrease with useless features
Higher is better

Section 10

Python Implementation

Manual OLS from Scratch

import math

x = [600, 800, 1000, 1200, 1400]    # size (sq ft)
y = [65,  85,  105,  120,  145]     # price (lakhs)

n     = len(x)
x_bar = sum(x) / n                   # 1000.0
y_bar = sum(y) / n                   # 104.0

# OLS coefficients
num   = sum((xi - x_bar) * (yi - y_bar) for xi, yi in zip(x, y))  # 39000
denom = sum((xi - x_bar) ** 2        for xi     in x)             # 400000

beta1 = num / denom                  # slope     = 0.0975
beta0 = y_bar - beta1 * x_bar       # intercept = 6.5

print(f"Intercept (β₀): {beta0:.4f}")  # 6.5
print(f"Slope     (β₁): {beta1:.4f}")  # 0.0975

# Predictions
y_pred = [beta0 + beta1 * xi for xi in x]
print("Predictions:", [round(p, 2) for p in y_pred])
# [65.0, 84.5, 104.0, 123.5, 143.0]

Computing RMSE, R², Adjusted R²

import math

def rmse(y_true, y_pred):
    n   = len(y_true)
    sse = sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred))
    return math.sqrt(sse / n)

def r_squared(y_true, y_pred):
    y_bar  = sum(y_true) / len(y_true)
    ss_tot = sum((yt - y_bar) ** 2 for yt     in y_true)
    ss_res = sum((yt - yp)    ** 2 for yt, yp in zip(y_true, y_pred))
    return 1 - (ss_res / ss_tot)

def adjusted_r2(y_true, y_pred, p):
    n  = len(y_true)
    r2 = r_squared(y_true, y_pred)
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

y_true = [65, 85, 105, 120, 145]
y_pred = [65.0, 84.5, 104.0, 123.5, 143.0]

print(f"RMSE        : {rmse(y_true, y_pred):.4f}")             # 1.8708
print(f"R²          : {r_squared(y_true, y_pred):.4f}")         # 0.9954
print(f"Adjusted R² : {adjusted_r2(y_true, y_pred, p=1):.4f}")  # 0.9939

Output

RMSE : 1.8708 R² : 0.9954 Adjusted R² : 0.9939

Using scikit-learn

import numpy as np
from sklearn.linear_model  import LinearRegression
from sklearn.metrics       import mean_squared_error, r2_score

X = np.array([[600], [800], [1000], [1200], [1400]])  # 2-D for sklearn
y = np.array([65, 85, 105, 120, 145])

model = LinearRegression()
model.fit(X, y)

print(f"Intercept β₀ : {model.intercept_:.4f}")   # 6.5
print(f"Slope     β₁ : {model.coef_[0]:.4f}")      # 0.0975

y_pred = model.predict(X)

mse  = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y, y_pred)

n, p   = len(y), X.shape[1]
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f"RMSE         : {rmse:.4f}")     # 1.8708
print(f"R²           : {r2:.4f}")       # 0.9954
print(f"Adjusted R²  : {adj_r2:.4f}")  # 0.9939

# Predict the new flat
new_flat = np.array([[1200]])
print(f"Predicted price for 1200 sq ft: {model.predict(new_flat)[0]:.1f} L")
# 123.5 L

💡

scikit-learn Does Not Give Adjusted R² Directly

r2_score() only returns plain R². You always need to compute Adjusted R² manually using the formula above. Remember to track n (sample size) and p (number of features, not counting the intercept) carefully.

Section 11

Assumptions of Linear Regression

Linear Regression is not magic — it comes with four assumptions that must hold for the model and its metrics to be trustworthy.

📏

Linearity

The relationship between x and y must be approximately linear. Check with a scatter plot. If the data curves, use polynomial features or a non-linear model.

🔀

Independence of Errors

Residuals must not correlate with each other. Time-series data often violates this — today's error leaks into tomorrow's. Use Durbin-Watson test to check.

🎯

Homoscedasticity

The spread of residuals should be constant across all levels of x. A "funnel" pattern in a residuals-vs-fitted plot signals heteroscedasticity — a violation. Try log-transforming the target.

🔔

Normality of Residuals

Residuals should follow a normal distribution. Important for hypothesis tests on coefficients. Check with a Q-Q plot. Minor violations are usually fine for large samples (Central Limit Theorem).

🚫

No Multicollinearity

Features should not be highly correlated with each other (Multiple LR only). Use VIF (Variance Inflation Factor) to detect it. Drop or combine correlated features.

🔬

No Influential Outliers

A single extreme point can drag the entire regression line. Check Cook's Distance or leverage scores. Consider robust regression methods if outliers cannot be removed.

Section 12

Golden Rules

🎯 Linear Regression — Key Rules

Always plot first. A scatter plot reveals whether a linear relationship actually exists. Fitting a line to a curved or categorical relationship gives misleading coefficients and useless metrics.

Use RMSE for error magnitude. RMSE is in the same unit as your target variable — it tells you the typical mistake in real-world terms. A RMSE of 1.87 lakhs on flat prices is immediately meaningful to Priya's boss.

Use Adjusted R² for model selection. When comparing models with different numbers of features, always use Adjusted R² — not R². R² almost always improves as you add features, even random noise.

High R² does not mean the model is useful. You can get R² = 0.99 by memorising the training data. Always evaluate on a held-out test set. A large gap between train R² and test R² signals overfitting.

OLS is optimal under the Gauss-Markov theorem — it gives the Best Linear Unbiased Estimator (BLUE) when the assumptions hold. If they don't, consider Ridge, Lasso, or robust regression variants.

Scale your features for gradient-descent solvers. While OLS has a closed-form solution (β = (XᵀX)⁻¹Xᵀy), gradient-descent implementations (used for very large datasets) converge far faster when features are standardised to mean 0, std 1.