The Story: Priya Wants to Price Her Flat
Priya knows she can't just take the average price — that ignores the fact that bigger flats cost more. She needs a formula that takes size as input and spits out a predicted price. That formula is a Linear Regression model.
Linear Regression is the most fundamental supervised machine-learning algorithm. It finds the straight line that best describes the relationship between one or more input variables (features) and a continuous output variable (target). Once the line is found, predicting new values is as simple as plugging a number into a formula.
Visualising the Idea
Priya plots her data on a graph. Each dot is a flat — x-axis = size, y-axis = price. The dots form a rough upward trend. Linear Regression draws the best-fit line through those dots.
The regression line minimises the total squared distance between itself and every data point.
The Equation of a Straight Line
In school you learned y = mx + c. Linear Regression uses the same idea,
just with machine-learning notation:
After fitting her model, Priya finds β₀ = −5 (intercept) and
β₁ = 0.115 (slope). For a 1 200 sq ft flat:
ŷ = −5 + 0.115 × 1200 = −5 + 138 = 133 lakhs.
The model says: list it at ₹1.33 crore.
The Math: How the Best Line is Found (OLS)
"Best fit" is not guesswork. The algorithm minimises the Sum of Squared Residuals (SSR) — also called the Residual Sum of Squares (RSS) or Sum of Squared Errors (SSE). A residual is the gap between a real data point and the line's prediction.
OLS finds β₀ and β₁ that make the sum of all (red dashed lines)² as small as possible.
The Cost Function
We want to minimise the Loss function L:
Differential Equations — Finding the Minimum
To minimise L, take the partial derivative with respect to each parameter and set it to zero. This is where calculus meets machine learning.
∂L/∂β₀ = −2 · Σ(yᵢ − β₀ − β₁xᵢ) = 0Simplifying:
Σyᵢ = n·β₀ + β₁·ΣxᵢDivide by n:
ȳ = β₀ + β₁·x̄ → β₀ = ȳ − β₁·x̄
∂L/∂β₁ = −2 · Σxᵢ(yᵢ − β₀ − β₁xᵢ) = 0Simplifying:
Σ(xᵢyᵢ) = β₀·Σxᵢ + β₁·Σxᵢ²
β₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²This is the OLS closed-form solution. No iteration needed.
β₁ = Cov(x, y) / Var(x)
β₀ = ȳ − β₁ · x̄
These two formulas give you the exact best-fit line — guaranteed to minimise the sum of squared residuals.
Squaring residuals serves two purposes: it makes all gaps positive (no cancelling), and it penalises large errors more than small ones (because squaring grows faster than linear). The squared form is also mathematically smooth everywhere, making it easy to differentiate and find a unique minimum.
Numerical Example — Priya's 5 Flats
Let's use a small dataset so the math is visible. Priya takes 5 flats:
| i | Size xᵢ (sq ft) | Price yᵢ (₹L) | xᵢ − x̄ | yᵢ − ȳ | (xᵢ−x̄)(yᵢ−ȳ) | (xᵢ−x̄)² |
|---|---|---|---|---|---|---|
| 1 | 600 | 65 | −400 | −39 | 15 600 | 160 000 |
| 2 | 800 | 85 | −200 | −19 | 3 800 | 40 000 |
| 3 | 1 000 | 105 | 0 | 1 | 0 | 0 |
| 4 | 1 200 | 120 | 200 | 16 | 3 200 | 40 000 |
| 5 | 1 400 | 145 | 400 | 41 | 16 400 | 160 000 |
| Σ | 5 000 | 520 | 0 | 0 | 39 000 | 400 000 |
For the new 1 200 sq ft flat: ŷ = 6.5 + 0.0975 × 1200 = 123.5 lakhs
RMSE — Root Mean Squared Error
Now that Priya has a model, she needs to know: how wrong is it, on average? The most common metric for this is RMSE.
Step-by-Step RMSE for Priya's 5 Flats
| i | Actual yᵢ | Predicted ŷᵢ | Residual (yᵢ − ŷᵢ) | (yᵢ − ŷᵢ)² |
|---|---|---|---|---|
| 1 | 65 | 6.5+0.0975×600 = 65.0 | 0.0 | 0.00 |
| 2 | 85 | 6.5+0.0975×800 = 84.5 | 0.5 | 0.25 |
| 3 | 105 | 6.5+0.0975×1000 = 104.0 | 1.0 | 1.00 |
| 4 | 120 | 6.5+0.0975×1200 = 123.5 | −3.5 | 12.25 |
| 5 | 145 | 6.5+0.0975×1400 = 143.0 | 2.0 | 4.00 |
| Σ | 17.50 |
Because errors are squared before averaging, a single large prediction error inflates RMSE far more than several small ones. This is both a feature (large errors are costly in practice) and a sensitivity to outliers. If you want a metric that treats all errors equally, use MAE (Mean Absolute Error).
R² — The Coefficient of Determination
RMSE tells you the absolute error in real units. But it does not tell you how good your model is compared to having no model at all. R² answers that question.
After fitting her regression line, the variation that is still unexplained (the residuals) is called SSres (Residual Sum of Squares).
R² measures how much of the original variation the model explained away.
R² = 1 − SS_res/SS_tot. The smaller SS_res is relative to SS_tot, the closer R² is to 1.
Calculating R² for Priya's Data
(65−104)²+(85−104)²+(105−104)²+(120−104)²+(145−104)²
= 1521 + 361 + 1 + 256 + 1681 = 3 820
0 + 0.25 + 1 + 12.25 + 4 = 17.50
The model explains 99.54% of the variance in flat prices. Excellent fit on this small dataset.
| R² Value | Interpretation | Verdict |
|---|---|---|
| 0.90 – 1.00 | Model explains 90–100% of variance | Excellent |
| 0.70 – 0.90 | Model explains 70–90% of variance | Good |
| 0.50 – 0.70 | Moderate explanatory power | Acceptable |
| 0.00 – 0.50 | Weak model; misses major patterns | Poor |
| Negative | Model is worse than a flat mean baseline | Terrible |
Adding more features to a model never decreases R² — even if those features are pure noise. This means you can get a very high R² just by throwing in hundreds of useless columns. A model with 50 random features will have a higher R² than one with 3 meaningful features, even though it is worse. This is why Adjusted R² was invented.
Adjusted R² — The Honest Version
Adjusted R² corrects for the "adding features always helps" bias in R². It penalises you for adding features that do not genuinely improve the model. It only increases if a new feature adds more than it costs.
How the Penalty Works
(n − 1) / (n − p − 1) is the penalty multiplier.
As p (number of features) grows, the denominator shrinks, making the
multiplier larger. This makes (1 − R²) · penalty larger,
which makes R²_adj smaller.
Numerical Example — Adding a Useless Feature
Priya adds a second feature: the flat owner's lucky number (random noise).
| Model | Features (p) | n | R² | R²_adj | Verdict |
|---|---|---|---|---|---|
| Model A | Size only (p=1) | 5 | 0.9954 | 1 − (1−0.9954)·(4/3) = 1 − 0.0046·1.333 = 0.9939 | Good |
| Model B | Size + Lucky Number (p=2) | 5 | 0.9961 (+0.0007) | 1 − (1−0.9961)·(4/2) = 1 − 0.0039·2.0 = 0.9922 | Worse! |
R² went up slightly (0.9954 → 0.9961) when the lucky number was added — a misleading signal. But Adjusted R² fell (0.9939 → 0.9922), correctly signalling that the extra feature hurt the model more than it helped. Always prefer Adjusted R² when comparing models with different numbers of features.
Metrics at a Glance
- Same unit as target
- Penalises large errors heavily
- Sensitive to outliers
- Lower is better
- Unitless (0 to 1)
- Easy to communicate
- Never decreases with more features
- Higher is better
- Penalises for extra features
- Best for model comparison
- Can decrease with useless features
- Higher is better
Python Implementation
Manual OLS from Scratch
import math
x = [600, 800, 1000, 1200, 1400] # size (sq ft)
y = [65, 85, 105, 120, 145] # price (lakhs)
n = len(x)
x_bar = sum(x) / n # 1000.0
y_bar = sum(y) / n # 104.0
# OLS coefficients
num = sum((xi - x_bar) * (yi - y_bar) for xi, yi in zip(x, y)) # 39000
denom = sum((xi - x_bar) ** 2 for xi in x) # 400000
beta1 = num / denom # slope = 0.0975
beta0 = y_bar - beta1 * x_bar # intercept = 6.5
print(f"Intercept (β₀): {beta0:.4f}") # 6.5
print(f"Slope (β₁): {beta1:.4f}") # 0.0975
# Predictions
y_pred = [beta0 + beta1 * xi for xi in x]
print("Predictions:", [round(p, 2) for p in y_pred])
# [65.0, 84.5, 104.0, 123.5, 143.0]
Computing RMSE, R², Adjusted R²
import math
def rmse(y_true, y_pred):
n = len(y_true)
sse = sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred))
return math.sqrt(sse / n)
def r_squared(y_true, y_pred):
y_bar = sum(y_true) / len(y_true)
ss_tot = sum((yt - y_bar) ** 2 for yt in y_true)
ss_res = sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred))
return 1 - (ss_res / ss_tot)
def adjusted_r2(y_true, y_pred, p):
n = len(y_true)
r2 = r_squared(y_true, y_pred)
return 1 - (1 - r2) * (n - 1) / (n - p - 1)
y_true = [65, 85, 105, 120, 145]
y_pred = [65.0, 84.5, 104.0, 123.5, 143.0]
print(f"RMSE : {rmse(y_true, y_pred):.4f}") # 1.8708
print(f"R² : {r_squared(y_true, y_pred):.4f}") # 0.9954
print(f"Adjusted R² : {adjusted_r2(y_true, y_pred, p=1):.4f}") # 0.9939
Using scikit-learn
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
X = np.array([[600], [800], [1000], [1200], [1400]]) # 2-D for sklearn
y = np.array([65, 85, 105, 120, 145])
model = LinearRegression()
model.fit(X, y)
print(f"Intercept β₀ : {model.intercept_:.4f}") # 6.5
print(f"Slope β₁ : {model.coef_[0]:.4f}") # 0.0975
y_pred = model.predict(X)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)
n, p = len(y), X.shape[1]
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"RMSE : {rmse:.4f}") # 1.8708
print(f"R² : {r2:.4f}") # 0.9954
print(f"Adjusted R² : {adj_r2:.4f}") # 0.9939
# Predict the new flat
new_flat = np.array([[1200]])
print(f"Predicted price for 1200 sq ft: {model.predict(new_flat)[0]:.1f} L")
# 123.5 L
r2_score() only returns plain R². You always need to compute
Adjusted R² manually using the formula above. Remember to track
n (sample size) and p (number of features,
not counting the intercept) carefully.
Assumptions of Linear Regression
Linear Regression is not magic — it comes with four assumptions that must hold for the model and its metrics to be trustworthy.
Golden Rules
β = (XᵀX)⁻¹Xᵀy),
gradient-descent implementations (used for very large datasets) converge
far faster when features are standardised to mean 0, std 1.