Loss Functions & Optimisation in ML: MSE, Cross-Entropy

Section 01

The Story That Explains Loss Functions

📖 Real World Analogy

The Archery Coach and the Scoreboard

Imagine an archery coach watching a student shoot at a target. After each arrow, the coach doesn't just say "good" or "bad" — she measures the exact distance from the bullseye. That distance is the loss.

If the arrow lands 2 cm away, she says "adjust your grip a little." If it lands 40 cm away, she says "stop, we need to completely rethink your stance." The magnitude of the error dictates the magnitude of the correction.

A machine learning model does exactly this. After every prediction it makes, the loss function measures how far off it was. The optimiser then nudges the model's parameters in whatever direction reduces that distance. Repeat millions of times. That is training.

A loss function (also called a cost function or objective function) is a mathematical formula that takes the model's prediction ŷ and the true label y, and returns a single non-negative number representing how wrong the prediction is. The model's entire goal during training is to minimise this number.

💡

Loss vs Cost vs Objective

Loss is computed on a single example. Cost (or empirical risk) is the average loss over the entire dataset. Objective is what you actually optimise — usually cost plus a regularisation penalty. The three terms are often used interchangeably in practice, but the distinction matters when reading research papers.

Section 02

Mean Squared Error — The Regression Workhorse

📖 Intuition First

The Estate Agent's Mistake

An estate agent predicts house prices. On Monday she predicts £200,000 for a house that sells for £210,000 — a £10,000 miss. On Friday she predicts £200,000 for a house that sells for £230,000 — a £30,000 miss.

If we simply averaged the raw errors, a model that over-predicts by £20,000 and under-predicts by £20,000 would look perfect — they cancel out. That's useless. So instead we square each error before averaging. Squaring does two things: eliminates negative signs, and punishes large errors disproportionately (the £30,000 miss hurts 9× more than the £10,000 miss, not just 3×).

The Formula

Single-Sample Loss

L(y, ŷ) = (y − ŷ)²

Square the difference between the true value y and prediction ŷ.

MSE over n Samples

MSE = (1/n) Σ (yᵢ − ŷᵢ)²

Average squared error over the entire training set.

📐

Probabilistic Derivation — Why MSE is "Correct"

MSE is not an arbitrary choice. It falls directly out of Maximum Likelihood Estimation (MLE) when you assume the residuals follow a Gaussian distribution: y = f(x) + ε, where ε ~ N(0, σ²). Maximising the log-likelihood of a Gaussian is exactly equivalent to minimising the sum of squared errors. This gives MSE a deep statistical justification — it's optimal when your noise really is Gaussian.

🔨 Numerical Example 1 — MSE Step by Step

A model predicts apartment rents (£/month) for five flats. Let's compute the MSE manually.

Flat	True Rent (y)	Predicted Rent (ŷ)	Error (y − ŷ)	Squared Error (y − ŷ)²
A	£1,200	£1,150	+50	2,500
B	£850	£900	−50	2,500
C	£2,000	£1,700	+300	90,000
D	£950	£940	+10	100
E	£1,500	£1,520	−20	400

🧮 MSE Calculation

Step 1

Sum the squared errors: 2,500 + 2,500 + 90,000 + 100 + 400 = 95,500

Step 2

Divide by n = 5: MSE = 95,500 / 5 = 19,100

Step 3

RMSE (more interpretable): √19,100 ≈ £138.2 — the model is off by roughly £138 on average.

Note

Flat C's £300 error contributes 90,000 — over 94% of the total squared error. MSE is very sensitive to outliers.

📈 MSE Parabola — Why the Loss Surface is Bowl-Shaped

The loss surface for MSE is a smooth, convex parabola. Any gradient descent algorithm is guaranteed to find the global minimum — there are no local traps.

When to Use MSE

House Price Prediction

USE MSE

Continuous target, roughly Gaussian errors, large errors warrant heavier penalty.

Temperature Forecasting

USE MSE

Symmetric errors expected, physics-based noise is Gaussian.

Median Income (Outliers)

USE MAE

MSE gets dominated by billionaire outliers. MAE or Huber loss is more robust.

Section 03

Binary Cross-Entropy — The Classification Standard

📖 Intuition First

The Overconfident Doctor

A doctor reviews an X-ray and says: "I am 99% sure this patient has no tumour." The patient actually has a tumour. How wrong is the doctor?

If we used MSE: (1 − 0.01)² = 0.98 — sounds manageable. But intuitively, a doctor being 99% wrong on a cancer diagnosis is catastrophically wrong. That confidence should be punished severely.

Cross-entropy agrees: −log(0.01) ≈ 4.6 — a very large penalty. For the same patient, a doctor who said "50% chance" gets −log(0.50) ≈ 0.69 — still penalised, but fairly. Cross-entropy punishes confident wrong answers brutally while barely scolding humble uncertainty.

The Formula

Binary Cross-Entropy (per sample)

L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

y ∈ {0, 1}, ŷ ∈ (0, 1) is the predicted probability. Only one term is "active" for each label.

Categorical Cross-Entropy (K classes)

L = −Σₖ yₖ · log(ŷₖ)

yₖ is 1 for the true class, 0 otherwise (one-hot). ŷₖ is the softmax probability for class k.

🌱

Probabilistic Derivation — Cross-Entropy from MLE

Assume labels follow a Bernoulli distribution: P(y|x) = ŷ^y(1−ŷ)^1−y. Taking the log-likelihood of n samples and flipping the sign (to minimise) gives exactly the binary cross-entropy formula. Minimising cross-entropy = maximising the likelihood that the model's predicted probabilities generated the observed labels. No arbitrary choice — it drops out of pure probability theory.

🔨 Numerical Example 2 — Cross-Entropy Step by Step

A spam classifier assigns probabilities to 4 emails. We compute the binary cross-entropy for each and then the average cost.

Email	True Label (y)	P(spam) = ŷ	Active Term	Loss = −log(active prob)
E1 — Spam	SPAM (1)	0.90	−log(0.90)	0.105 ✓ Low
E2 — Not Spam	HAM (0)	0.05	−log(1 − 0.05) = −log(0.95)	0.051 ✓ Very Low
E3 — Spam	SPAM (1)	0.10	−log(0.10)	2.303 ✗ High Penalty
E4 — Not Spam	HAM (0)	0.60	−log(1 − 0.60) = −log(0.40)	0.916 ⚠ Moderate

🧮 Average Cross-Entropy (Cost)

Sum

0.105 + 0.051 + 2.303 + 0.916 = 3.375

Average

Cost = 3.375 / 4 = 0.844

Key Insight

Email E3 alone contributes 2.303 of 3.375 (68%) of the total cost — the model was 90% confident it was ham, but it was spam. Cross-entropy demands this be fixed urgently.

📈 Cross-Entropy Curve — The Asymptotic Penalty

As the predicted probability of the true class approaches 0, the loss approaches infinity. This is why neural networks never output exactly 0 or 1 — the gradient would be undefined.

Section 04

Maximum Likelihood Estimation — The Unifying View

You may wonder why MSE and cross-entropy look so different, yet both "work." The answer is that they are both instances of the same principle: Maximum Likelihood Estimation.

📖 The Core Idea

Fitting a Key to a Lock

You have a lock (the data) and need to find the best key (model parameters θ). MLE says: "Choose the key that makes the data you observed as probable as possible." Formally, you maximise P(data | θ) — the likelihood of observing your training data given a particular set of parameters. Since products of tiny probabilities underflow numerically, we take the log and maximise the sum. Flipping the sign (to minimise) gives the negative log-likelihood — which is your loss function.

Assumed Noise Distribution	Likelihood P(y\|x,θ)	−log Likelihood =	Resulting Loss
Gaussian (Normal)	N(ŷ, σ²)	Σ (yᵢ − ŷᵢ)² / 2σ²	MSE
Bernoulli (binary)	ŷʸ(1−ŷ)¹⁻ʸ	−Σ [y log ŷ + (1−y)log(1−ŷ)]	Binary Cross-Entropy
Categorical (K classes)	Π ŷₖʸᵏ	−Σ yₖ log ŷₖ	Categorical Cross-Entropy
Laplacian	exp(−\|y−ŷ\|/b)	Σ \|yᵢ − ŷᵢ\|	MAE (L1 Loss)

🎉

The Beautiful Unification

Every major loss function is the negative log-likelihood under some probability distribution. Choosing a loss function is therefore equivalent to choosing a probabilistic model for your noise. This is why the choice matters: if your data has Gaussian noise, MSE is principled. If it has heavy-tailed noise, you need Huber or MAE. The loss function encodes your assumptions about the world.

Section 05

Regularised Loss — L1 and L2 Penalties

A model that perfectly minimises training loss often overfits — it memorises the training data including noise, and fails on new examples. Regularisation adds a penalty term to the objective that discourages the model from growing too complex.

L2 Regularisation (Ridge)

J(θ) = Loss + λ·Σ θᵢ²

Penalises the sum of squared weights. Pushes all weights toward zero smoothly. Never exactly zero.

L1 Regularisation (Lasso)

J(θ) = Loss + λ·Σ |θᵢ|

Penalises the sum of absolute weights. Creates sparse solutions — some weights become exactly zero (feature selection).

📖 Analogy

The Bureaucrat and the Sculptor

L2 (Ridge) is like a bureaucrat who tells every employee: "You can work, but the harder you work the more tax you pay — so everyone naturally works a bit less." All employees stay employed, but none are working at full capacity. Weights are all small but non-zero.

L1 (Lasso) is like a sculptor with a chisel. Instead of just penalising effort, the sculptor actively carves away anything unnecessary. After training, many weights are exactly zero — the model has selected a sparse subset of features. Ideal when you suspect most features are irrelevant.

Property	L1 (Lasso)	L2 (Ridge)	Elastic Net (L1+L2)
Sparsity (zero weights)	Yes — exact zeros	No — only shrinks	Partial
Feature Selection	Built-in	Not built-in	Partial
Correlated Features	Picks one, drops rest	Spreads weight across all	Handles well
Gradient at zero	Undefined (subgradient)	Smooth and differentiable	Mixed
Best Use Case	Many irrelevant features	All features somewhat relevant	Correlated feature groups

⚠️

Tuning λ — The Regularisation Strength

λ = 0: No regularisation — pure loss minimisation, maximum overfitting risk.
λ → ∞: All weights crushed to zero — maximum underfitting (the model predicts the mean always).
The optimal λ lives between these extremes and is found via cross-validation, not guessing. In sklearn, this is the alpha parameter.

Section 06

Gradient Descent — How the Optimiser Actually Learns

Knowing what to minimise is only half the story. The optimiser is the algorithm that actually adjusts the model's weights to reduce the loss. The most fundamental optimiser is gradient descent.

Forward Pass

Feed a mini-batch of training examples through the model. Compute predictions ŷ using current weights θ.

Compute Loss

Apply the chosen loss function to compare ŷ against y. Get a single scalar: the current cost.

Backward Pass (Backpropagation)

Compute ∂Loss/∂θ for every weight using the chain rule. This gives the direction of steepest ascent for each weight.

Parameter Update

Nudge every weight in the opposite direction: θ ← θ − η · ∂Loss/∂θ, where η is the learning rate.

Repeat Until Convergence

After enough iterations, weights settle at values that minimise the loss. That is your trained model.

🔄

Why MSE and Cross-Entropy Have Beautiful Gradients

For MSE: ∂MSE/∂ŷ = −2(y − ŷ) — linear in the error. Simple and stable.
For cross-entropy + sigmoid/softmax: the gradient simplifies to ŷ − y — the prediction error itself. This clean form is not a coincidence; it's another reason MLE-derived losses are preferred. The math works out perfectly.

Section 07

Python Implementation — Full Working Code

Below is a complete example computing MSE, binary cross-entropy, and regularised loss from scratch in NumPy, then training a logistic regression model with both L1 and L2 regularisation using scikit-learn.

Part A — Loss Functions from Scratch

import numpy as np

# ── Mean Squared Error ──────────────────────────────────
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# ── Root Mean Squared Error ─────────────────────────────
def rmse(y_true, y_pred):
    return np.sqrt(mse(y_true, y_pred))

# ── Binary Cross-Entropy ────────────────────────────────
def binary_cross_entropy(y_true, y_pred):
    eps = 1e-15  # prevent log(0)
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(
        y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
    )

# ── Regularised MSE ─────────────────────────────────────
def regularised_mse(y_true, y_pred, weights, lam=0.01, mode='l2'):
    loss = mse(y_true, y_pred)
    if mode == 'l2':
        penalty = lam * np.sum(weights ** 2)
    elif mode == 'l1':
        penalty = lam * np.sum(np.abs(weights))
    else:
        penalty = 0
    return loss + penalty

# ── Demo ─────────────────────────────────────────────────
y_true_reg  = np.array([1200, 850, 2000, 950, 1500])
y_pred_reg  = np.array([1150, 900, 1700, 940, 1520])
weights     = np.array([0.5, -0.3, 1.2, 0.8])

print(f"MSE:         {mse(y_true_reg, y_pred_reg):,.1f}")
print(f"RMSE:        £{rmse(y_true_reg, y_pred_reg):.1f}")

y_true_cls = np.array([1, 0, 1, 0])
y_pred_cls = np.array([0.90, 0.05, 0.10, 0.60])
print(f"BCE:         {binary_cross_entropy(y_true_cls, y_pred_cls):.4f}")
print(f"Reg MSE L2:  {regularised_mse(y_true_reg, y_pred_reg, weights, 0.01, 'l2'):,.2f}")
print(f"Reg MSE L1:  {regularised_mse(y_true_reg, y_pred_reg, weights, 0.01, 'l1'):,.2f}")

OUTPUT

MSE: 19,100.0 RMSE: £138.2 BCE: 0.8438 Reg MSE L2: 19,100.03 Reg MSE L1: 19,100.03

Part B — Scikit-learn: L1 vs L2 on Breast Cancer Dataset

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss, accuracy_score
import numpy as np

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# L2 Regularised Logistic Regression (default)
lr_l2 = LogisticRegression(penalty='l2', C=1.0, max_iter=1000)
lr_l2.fit(X_train, y_train)

# L1 Regularised Logistic Regression
lr_l1 = LogisticRegression(penalty='l1', C=1.0, solver='liblinear')
lr_l1.fit(X_train, y_train)

for name, model in [('L2 Ridge', lr_l2), ('L1 Lasso', lr_l1)]:
    probs = model.predict_proba(X_test)
    acc   = accuracy_score(y_test, model.predict(X_test))
    bce   = log_loss(y_test, probs)
    zeros = np.sum(model.coef_ == 0)
    print(f"{name} | Accuracy: {acc:.4f} | BCE Loss: {bce:.4f} | Zero weights: {zeros}/30")

OUTPUT

🔍

Reading the Output

L2 (Ridge) achieves slightly higher accuracy because it uses all 30 features simultaneously — none are discarded. L1 (Lasso) zeroed out 12 of 30 features, trading a little accuracy for a far simpler, more interpretable model. In a clinical setting where explaining which features drive the diagnosis matters, L1's 18-feature model might be the better choice despite the lower accuracy.

Section 08

Quick Reference — Choosing Your Loss Function

📈

Regression, Gaussian Noise

→ Use MSE / RMSE
sklearn: LinearRegression
keras: loss='mse'

👑

Regression, Outlier-Robust

→ Use MAE or Huber
sklearn: HuberRegressor
keras: loss='huber'

🚨

Binary Classification

→ Use Binary Cross-Entropy
sklearn: LogisticRegression
keras: loss='binary_crossentropy'

🌎

Multi-class (mutually exclusive)

→ Categorical Cross-Entropy
sklearn: softmax output
keras: loss='sparse_categorical_crossentropy'

📋

Overfitting — Many Features

→ Add L1 penalty (Lasso)
sklearn: penalty='l1'
keras: regularizers.l1(λ)

🔧

Overfitting — Correlated Features

→ Add L2 penalty (Ridge)
sklearn: penalty='l2'
keras: regularizers.l2(λ)

Section 09

Golden Rules

⚡ Loss Functions & Optimisation — Non-Negotiable Rules

Never use MSE for classification. MSE assumes a linear, continuous output. Applied to probabilities (0–1), its gradient behaves badly near the extremes and the loss surface develops flat regions that stall training.

Clip predictions before computing log-based losses. log(0) is undefined (−∞). Always clip predicted probabilities to [ε, 1−ε] where ε = 1e-15 before calling any cross-entropy function.

MSE is sensitive to outliers — check your target distribution first. If your target has heavy tails or occasional extreme values, a single outlier can dominate the entire loss. Use Huber loss or log-transform the target.

Tune λ with cross-validation — never by intuition. The regularisation strength is a hyperparameter. Use GridSearchCV or RidgeCV/LassoCV which have efficient built-in cross-validation over a range of λ values.

The loss function and the evaluation metric need not be the same. You might train with cross-entropy but evaluate with F1 score or AUC-ROC. The loss must be differentiable (for gradient descent); the metric just needs to measure what you care about in production.

Cross-entropy implicitly assumes calibrated probabilities. If your model is poorly calibrated (confident and wrong), cross-entropy will be very high. Consider using CalibratedClassifierCV in sklearn to post-hoc calibrate your model's output probabilities.