The Story That Explains Loss Functions
If the arrow lands 2 cm away, she says "adjust your grip a little." If it lands 40 cm away, she says "stop, we need to completely rethink your stance." The magnitude of the error dictates the magnitude of the correction.
A machine learning model does exactly this. After every prediction it makes, the loss function measures how far off it was. The optimiser then nudges the model's parameters in whatever direction reduces that distance. Repeat millions of times. That is training.
A loss function (also called a cost function or objective function) is a mathematical formula that takes the model's prediction Ε· and the true label y, and returns a single non-negative number representing how wrong the prediction is. The model's entire goal during training is to minimise this number.
Loss is computed on a single example. Cost (or empirical risk) is the average loss over the entire dataset. Objective is what you actually optimise β usually cost plus a regularisation penalty. The three terms are often used interchangeably in practice, but the distinction matters when reading research papers.
Mean Squared Error β The Regression Workhorse
If we simply averaged the raw errors, a model that over-predicts by Β£20,000 and under-predicts by Β£20,000 would look perfect β they cancel out. That's useless. So instead we square each error before averaging. Squaring does two things: eliminates negative signs, and punishes large errors disproportionately (the Β£30,000 miss hurts 9Γ more than the Β£10,000 miss, not just 3Γ).
The Formula
MSE is not an arbitrary choice. It falls directly out of Maximum Likelihood Estimation (MLE) when you assume the residuals follow a Gaussian distribution: y = f(x) + Ξ΅, where Ξ΅ ~ N(0, ΟΒ²). Maximising the log-likelihood of a Gaussian is exactly equivalent to minimising the sum of squared errors. This gives MSE a deep statistical justification β it's optimal when your noise really is Gaussian.
🔨 Numerical Example 1 β MSE Step by Step
A model predicts apartment rents (Β£/month) for five flats. Let's compute the MSE manually.
| Flat | True Rent (y) | Predicted Rent (Ε·) | Error (y β Ε·) | Squared Error (y β Ε·)Β² |
|---|---|---|---|---|
| A | Β£1,200 | Β£1,150 | +50 | 2,500 |
| B | Β£850 | Β£900 | β50 | 2,500 |
| C | Β£2,000 | Β£1,700 | +300 | 90,000 |
| D | Β£950 | Β£940 | +10 | 100 |
| E | Β£1,500 | Β£1,520 | β20 | 400 |
The loss surface for MSE is a smooth, convex parabola. Any gradient descent algorithm is guaranteed to find the global minimum β there are no local traps.
When to Use MSE
Binary Cross-Entropy β The Classification Standard
If we used MSE: (1 β 0.01)Β² = 0.98 β sounds manageable. But intuitively, a doctor being 99% wrong on a cancer diagnosis is catastrophically wrong. That confidence should be punished severely.
Cross-entropy agrees: βlog(0.01) β 4.6 β a very large penalty. For the same patient, a doctor who said "50% chance" gets βlog(0.50) β 0.69 β still penalised, but fairly. Cross-entropy punishes confident wrong answers brutally while barely scolding humble uncertainty.
The Formula
Assume labels follow a Bernoulli distribution: P(y|x) = Ε·y(1βΕ·)1βy. Taking the log-likelihood of n samples and flipping the sign (to minimise) gives exactly the binary cross-entropy formula. Minimising cross-entropy = maximising the likelihood that the model's predicted probabilities generated the observed labels. No arbitrary choice β it drops out of pure probability theory.
🔨 Numerical Example 2 β Cross-Entropy Step by Step
A spam classifier assigns probabilities to 4 emails. We compute the binary cross-entropy for each and then the average cost.
| True Label (y) | P(spam) = Ε· | Active Term | Loss = βlog(active prob) | |
|---|---|---|---|---|
| E1 β Spam | SPAM (1) | 0.90 | βlog(0.90) | 0.105 β Low |
| E2 β Not Spam | HAM (0) | 0.05 | βlog(1 β 0.05) = βlog(0.95) | 0.051 β Very Low |
| E3 β Spam | SPAM (1) | 0.10 | βlog(0.10) | 2.303 β High Penalty |
| E4 β Not Spam | HAM (0) | 0.60 | βlog(1 β 0.60) = βlog(0.40) | 0.916 β Moderate |
As the predicted probability of the true class approaches 0, the loss approaches infinity. This is why neural networks never output exactly 0 or 1 β the gradient would be undefined.
Maximum Likelihood Estimation β The Unifying View
You may wonder why MSE and cross-entropy look so different, yet both "work." The answer is that they are both instances of the same principle: Maximum Likelihood Estimation.
| Assumed Noise Distribution | Likelihood P(y|x,ΞΈ) | βlog Likelihood = | Resulting Loss |
|---|---|---|---|
| Gaussian (Normal) | N(Ε·, ΟΒ²) | Ξ£ (yα΅’ β Ε·α΅’)Β² / 2ΟΒ² | MSE |
| Bernoulli (binary) | Ε·ΚΈ(1βΕ·)ΒΉβ»ΚΈ | βΞ£ [y log Ε· + (1βy)log(1βΕ·)] | Binary Cross-Entropy |
| Categorical (K classes) | Ξ Ε·βΚΈα΅ | βΞ£ yβ log Ε·β | Categorical Cross-Entropy |
| Laplacian | exp(β|yβΕ·|/b) | Ξ£ |yα΅’ β Ε·α΅’| | MAE (L1 Loss) |
Every major loss function is the negative log-likelihood under some probability distribution. Choosing a loss function is therefore equivalent to choosing a probabilistic model for your noise. This is why the choice matters: if your data has Gaussian noise, MSE is principled. If it has heavy-tailed noise, you need Huber or MAE. The loss function encodes your assumptions about the world.
Regularised Loss β L1 and L2 Penalties
A model that perfectly minimises training loss often overfits β it memorises the training data including noise, and fails on new examples. Regularisation adds a penalty term to the objective that discourages the model from growing too complex.
L1 (Lasso) is like a sculptor with a chisel. Instead of just penalising effort, the sculptor actively carves away anything unnecessary. After training, many weights are exactly zero β the model has selected a sparse subset of features. Ideal when you suspect most features are irrelevant.
| Property | L1 (Lasso) | L2 (Ridge) | Elastic Net (L1+L2) |
|---|---|---|---|
| Sparsity (zero weights) | Yes β exact zeros | No β only shrinks | Partial |
| Feature Selection | Built-in | Not built-in | Partial |
| Correlated Features | Picks one, drops rest | Spreads weight across all | Handles well |
| Gradient at zero | Undefined (subgradient) | Smooth and differentiable | Mixed |
| Best Use Case | Many irrelevant features | All features somewhat relevant | Correlated feature groups |
Ξ» = 0: No regularisation β pure loss minimisation, maximum overfitting risk.
Ξ» β β: All weights crushed to zero β maximum underfitting (the model predicts the mean always).
The optimal Ξ» lives between these extremes and is found via cross-validation,
not guessing. In sklearn, this is the alpha parameter.
Gradient Descent β How the Optimiser Actually Learns
Knowing what to minimise is only half the story. The optimiser is the algorithm that actually adjusts the model's weights to reduce the loss. The most fundamental optimiser is gradient descent.
For MSE: βMSE/βΕ· = β2(y β Ε·) β linear in the error. Simple and stable.
For cross-entropy + sigmoid/softmax: the gradient simplifies to Ε· β y β the
prediction error itself. This clean form is not a coincidence; it's another reason
MLE-derived losses are preferred. The math works out perfectly.
Python Implementation β Full Working Code
Below is a complete example computing MSE, binary cross-entropy, and regularised loss from scratch in NumPy, then training a logistic regression model with both L1 and L2 regularisation using scikit-learn.
Part A β Loss Functions from Scratch
import numpy as np
# ββ Mean Squared Error ββββββββββββββββββββββββββββββββββ
def mse(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
# ββ Root Mean Squared Error βββββββββββββββββββββββββββββ
def rmse(y_true, y_pred):
return np.sqrt(mse(y_true, y_pred))
# ββ Binary Cross-Entropy ββββββββββββββββββββββββββββββββ
def binary_cross_entropy(y_true, y_pred):
eps = 1e-15 # prevent log(0)
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(
y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
)
# ββ Regularised MSE βββββββββββββββββββββββββββββββββββββ
def regularised_mse(y_true, y_pred, weights, lam=0.01, mode='l2'):
loss = mse(y_true, y_pred)
if mode == 'l2':
penalty = lam * np.sum(weights ** 2)
elif mode == 'l1':
penalty = lam * np.sum(np.abs(weights))
else:
penalty = 0
return loss + penalty
# ββ Demo βββββββββββββββββββββββββββββββββββββββββββββββββ
y_true_reg = np.array([1200, 850, 2000, 950, 1500])
y_pred_reg = np.array([1150, 900, 1700, 940, 1520])
weights = np.array([0.5, -0.3, 1.2, 0.8])
print(f"MSE: {mse(y_true_reg, y_pred_reg):,.1f}")
print(f"RMSE: Β£{rmse(y_true_reg, y_pred_reg):.1f}")
y_true_cls = np.array([1, 0, 1, 0])
y_pred_cls = np.array([0.90, 0.05, 0.10, 0.60])
print(f"BCE: {binary_cross_entropy(y_true_cls, y_pred_cls):.4f}")
print(f"Reg MSE L2: {regularised_mse(y_true_reg, y_pred_reg, weights, 0.01, 'l2'):,.2f}")
print(f"Reg MSE L1: {regularised_mse(y_true_reg, y_pred_reg, weights, 0.01, 'l1'):,.2f}")
Part B β Scikit-learn: L1 vs L2 on Breast Cancer Dataset
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss, accuracy_score
import numpy as np
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# L2 Regularised Logistic Regression (default)
lr_l2 = LogisticRegression(penalty='l2', C=1.0, max_iter=1000)
lr_l2.fit(X_train, y_train)
# L1 Regularised Logistic Regression
lr_l1 = LogisticRegression(penalty='l1', C=1.0, solver='liblinear')
lr_l1.fit(X_train, y_train)
for name, model in [('L2 Ridge', lr_l2), ('L1 Lasso', lr_l1)]:
probs = model.predict_proba(X_test)
acc = accuracy_score(y_test, model.predict(X_test))
bce = log_loss(y_test, probs)
zeros = np.sum(model.coef_ == 0)
print(f"{name} | Accuracy: {acc:.4f} | BCE Loss: {bce:.4f} | Zero weights: {zeros}/30")
L2 (Ridge) achieves slightly higher accuracy because it uses all 30 features simultaneously β none are discarded. L1 (Lasso) zeroed out 12 of 30 features, trading a little accuracy for a far simpler, more interpretable model. In a clinical setting where explaining which features drive the diagnosis matters, L1's 18-feature model might be the better choice despite the lower accuracy.
Quick Reference β Choosing Your Loss Function
sklearn: LinearRegression
keras: loss='mse'
sklearn: HuberRegressor
keras: loss='huber'
sklearn: LogisticRegression
keras: loss='binary_crossentropy'
sklearn: softmax output
keras: loss='sparse_categorical_crossentropy'
sklearn: penalty='l1'
keras: regularizers.l1(Ξ»)
sklearn: penalty='l2'
keras: regularizers.l2(Ξ»)