The Story: Dr. Meera's TB Screening Crisis
Rohan builds a model, tests it, and comes back beaming: "Doctor, the model is 89.5% accurate!"
Dr. Meera is unconvinced. She asks Rohan to tell her specifically how many real TB patients the model missed. Rohan checks โ the model missed 5 out of 50 TB patients. Five people with active, infectious TB walked out undetected.
At the same time, it sent 100 healthy people for needlessly expensive follow-up tests.
"That is not 89.5% good," Dr. Meera says. "That is a public health failure. You need better metrics." Rohan reaches for Precision, Recall, and F1 Score.
Why Accuracy Fails on Imbalanced Data
| Metric | Value | What It Hides |
|---|---|---|
| Accuracy | 89.5% | Looks great โ is misleading |
| Missed TB cases (FN) | 5 patients | Walking out undetected |
| Wasted tests (FP) | 100 patients | Unnecessary cost & stress |
A model that predicts everyone is healthy gets (950/1000) = 95% accuracy โ yet catches zero TB cases. Accuracy is useless here.
| Metric | Value | What It Reveals |
|---|---|---|
| Precision | 31% | Most positive flags are wrong |
| Recall | 90% | Most real TB cases are caught |
| F1 Score | 46% | Overall weak performance |
These metrics tell the true story: the model floods the lab with false alarms but does catch 90% of actual TB. Whether that trade-off is acceptable is a medical decision, not a model decision.
The Confusion Matrix โ The Foundation of All Metrics
Every classification metric is derived from a single 2ร2 table called the Confusion Matrix. Understanding it is the first step.
| Cell | Full Name | Count | Plain-English Meaning |
|---|---|---|---|
| TN | True Negative | 850 | Healthy patients correctly cleared โ no action needed |
| FP | False Positive | 100 | Healthy patients wrongly flagged โ wasted tests, anxiety |
| FN | False Negative | 5 | TB patients missed โ most dangerous; walk away untreated |
| TP | True Positive | 45 | TB patients correctly detected โ the goal |
In disease detection, fraud, and safety systems, a False Negative (missed threat) is usually far more costly than a False Positive (false alarm). A missed TB patient spreads infection. A missed fraud claim costs thousands. A missed safety defect causes accidents. Always ask: "Which error is worse in my domain?" โ then choose and tune your metrics accordingly.
Precision โ When You Raise the Alarm, How Often Are You Right?
High precision is critical when false alarms are costly. Examples: a spam filter that moves legitimate emails to spam (FP = lost business email); a legal document retrieval system that returns irrelevant results (FP = wasted lawyer hours); a credit card block that stops a legitimate transaction (FP = angry customer). In all these cases, you want every positive prediction to be correct.
Recall โ Of All Real Positives, How Many Did You Find?
High recall is critical when missing a real positive is catastrophic. Examples: cancer screening (missing cancer = delayed treatment, death); fraud detection (missing fraud = financial loss); airport security (missing a threat = public safety risk); earthquake early warning (missing an alert = disaster). In these domains, you would rather have many false alarms than miss one real event.
The Precision-Recall Trade-off
Precision and Recall pull in opposite directions. As you lower the classification threshold (flag more things as positive), you catch more real positives (recall rises) but also flag more false alarms (precision falls). As you raise the threshold, you become more selective (precision rises) but miss more real positives (recall falls).
The vertical dashed line marks where F1 Score peaks โ the balanced operating point. Moving left (lower threshold) boosts Recall at the cost of Precision. Moving right (higher threshold) boosts Precision at the cost of Recall.
F1 Score โ The Harmonic Mean of Precision and Recall
Neither Precision nor Recall alone is sufficient. We need a single metric that balances both. The F1 Score does this by taking their harmonic mean โ not the arithmetic mean.
F1 = 2ยทTP / (2ยทTP + FP + FN)Ranges from 0 (worst) to 1 (perfect). Penalises extreme imbalances between P and R.
This gives an optimistic picture โ it lets a very high Recall paper over a terrible Precision.
Arithmetic mean = (1.0 + 0.05) / 2 = 0.525 = 52.5% โ looks decent!
Harmonic mean (F1) = 2ยท(1.0ยท0.05) / (1.0+0.05) = 0.095 = 9.5% โ correctly terrible.
F-Beta Score โ Tuning the Precision-Recall Balance
The F1 score weights Precision and Recall equally. But in the real world, one is often more important than the other. The F-Beta score lets you control that balance with a single parameter ฮฒ.
ฮฒ = 1: equal weight โ standard F1 score.
ฮฒ > 1: weights Recall more heavily (e.g. ฮฒ=2 โ Recall twice as important).
| ฮฒ Value | Name | Recall weight vs Precision | Use Case | Dr. Meera's Score |
|---|---|---|---|---|
ฮฒ = 0.5 |
F0.5 | Precision 2ร more important | Spam filter, legal search, product recommendation โ false alarms are costly | F0.5 = (1+0.25)ร(0.310ร0.900)/(0.25ร0.310+0.900) = 0.354 |
ฮฒ = 1 |
F1 | Equal weight | General-purpose; balanced domains | 0.461 |
ฮฒ = 2 |
F2 | Recall 2ร more important | TB / cancer screening, fraud, safety โ missed positives are catastrophic | F2 = (1+4)ร(0.310ร0.900)/(4ร0.310+0.900) = 0.612 |
Dr. Meera consults with her epidemiologist and decides that for TB screening, a missed TB case is four times more costly than a false alarm. She adopts F2 as her primary metric. Under F2, the model scores 61.2% โ still modest, but it now rewards the model's high recall while still accounting for the flood of false alarms. She lowers the threshold from 0.5 to 0.3 to push Recall above 95%, accepting lower Precision โ and the F2 score guides whether the trade-off was worth it.
Other Essential Metrics
Specificity โ Recall for the Negative Class
All Metrics at a Glance โ Using Dr. Meera's Numbers
| Metric | Formula | Value | Meaning in TB Context |
|---|---|---|---|
| Accuracy | (TP+TN) / N | 89.5% | Misleading โ ignores class imbalance |
| Precision (PPV) | TP / (TP+FP) | 31.0% | Only 3 in 10 positive flags are real TB |
| Recall (Sensitivity) | TP / (TP+FN) | 90.0% | Catches 9 of 10 real TB cases |
| Specificity (TNR) | TN / (TN+FP) | 89.5% | Clears 89.5% of healthy patients correctly |
| NPV | TN / (TN+FN) | 99.4% | If cleared, 99.4% chance actually healthy |
| F1 Score | 2ยทPยทR / (P+R) | 46.1% | Balanced score โ reveals the precision gap |
| F2 Score | (1+4)ยทPยทR / (4P+R) | 61.2% | Recall-weighted โ more appropriate here |
| False Discovery Rate | FP / (FP+TP) | 69.0% | 69% of positive alerts are false alarms |
| Miss Rate | FN / (FN+TP) | 10.0% | 10% of real TB cases are missed |
| Fall-out (FPR) | FP / (FP+TN) | 10.5% | 10.5% of healthy patients wrongly flagged |
Advanced Metrics: MCC and Cohen's Kappa
When class imbalance is severe and you need a single metric that's genuinely robust, two metrics stand out above accuracy and F1.
Formula:
MCC = (TPยทTN โ FPยทFN) / โ[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]Dr. Meera's model:
MCC = (45ยท850 โ 100ยท5) / โ[(145)(50)(950)(855)]
= (38 250 โ 500) / โ[5 893 781 250]
= 37 750 / 76 773 โ 0.491
Formula:
ฮบ = (p_observed โ p_expected) / (1 โ p_expected)Where p_expected is what you'd get by chance given class distributions.
ฮบ < 0: worse than chance | ฮบ = 0: chance level | ฮบ = 1: perfect agreement
General guide: ฮบ < 0.20 = poor, 0.20โ0.40 = fair, 0.40โ0.60 = moderate, 0.60โ0.80 = substantial, 0.80โ1.0 = almost perfect.
Formula:
Balanced Accuracy = (Recall + Specificity) / 2Dr. Meera's model: (0.900 + 0.895) / 2 = 0.8975 = 89.75%
Much more honest than the raw 89.5% accuracy, and it reflects the model's genuine ability on both classes equally.
| Metric | Range | Handles Imbalance? | Uses all 4 CM cells? | Best for |
|---|---|---|---|---|
| Accuracy | 0โ1 | No | Yes | Balanced datasets only |
| Precision | 0โ1 | Partial | No (TN ignored) | FP cost is high |
| Recall | 0โ1 | Partial | No (TN ignored) | FN cost is high |
| F1 | 0โ1 | Partial | No (TN ignored) | General-purpose imbalance |
| MCC | โ1 to 1 | Yes | Yes | Imbalanced binary โ most robust |
| Cohen's Kappa | โ1 to 1 | Yes | Yes | Agreement quality; multi-class |
| Balanced Accuracy | 0โ1 | Yes | Yes | Interpretable imbalance metric |
Multi-Class Metrics โ Macro, Micro & Weighted Averaging
For problems with more than 2 classes, Precision, Recall, and F1 are computed per class and then averaged. Three averaging strategies exist, each with a different philosophy.
| Class | Support | Precision | Recall | F1 |
|---|---|---|---|---|
| Benign | 600 | 0.94 | 0.96 | 0.95 |
| Type-A | 300 | 0.82 | 0.78 | 0.80 |
| Type-B | 100 | 0.61 | 0.55 | 0.58 |
Metric Selection Guide โ When to Use What
When in doubt: report Precision, Recall, F1, and MCC together โ they cover different angles and provide a complete picture for stakeholders and engineering teams alike.
Python โ All Metrics from Scratch
import math
# โโ Confusion matrix values โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
TP, FP, FN, TN = 45, 100, 5, 850
N_total = TP + FP + FN + TN # 1000
P = TP + FN # all actual positives = 50
# โโ Core metrics โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
accuracy = (TP + TN) / N_total
precision = TP / (TP + FP)
recall = TP / (TP + FN)
specificity = TN / (TN + FP)
npv = TN / (TN + FN) # Negative Predictive Value
fpr = FP / (FP + TN) # False Positive Rate
miss_rate = FN / (FN + TP) # False Negative Rate
fdr = FP / (FP + TP) # False Discovery Rate
# โโ F-scores โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def f_beta(precision, recall, beta=1):
b2 = beta ** 2
if precision + recall == 0:
return 0.0
return (1 + b2) * precision * recall / (b2 * precision + recall)
f1 = f_beta(precision, recall, beta=1)
f2 = f_beta(precision, recall, beta=2)
f_half = f_beta(precision, recall, beta=0.5)
# โโ Matthews Correlation Coefficient โโโโโโโโโโโโโโโโโโโโโโโโโ
mcc_num = (TP * TN) - (FP * FN)
mcc_den = math.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN))
mcc = mcc_num / mcc_den if mcc_den > 0 else 0.0
# โโ Balanced Accuracy โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
balanced_acc = (recall + specificity) / 2
# โโ Cohen's Kappa โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
p_obs = (TP + TN) / N_total
p_pos = ((TP + FP) / N_total) * ((TP + FN) / N_total)
p_neg = ((TN + FN) / N_total) * ((TN + FP) / N_total)
p_exp = p_pos + p_neg
kappa = (p_obs - p_exp) / (1 - p_exp)
# โโ Print results โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f"{'Metric':<25} {'Value':>8}")
print("-" * 35)
for name, val in [
("Accuracy", accuracy),
("Precision (PPV)", precision),
("Recall (Sensitivity)", recall),
("Specificity (TNR)", specificity),
("NPV", npv),
("False Positive Rate", fpr),
("Miss Rate (FNR)", miss_rate),
("F0.5 Score", f_half),
("F1 Score", f1),
("F2 Score", f2),
("MCC", mcc),
("Balanced Accuracy", balanced_acc),
("Cohen's Kappa", kappa),
]:
print(f"{name:<25} {val:>8.4f}")
Python โ Full Pipeline with scikit-learn
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
confusion_matrix,
classification_report,
precision_score, recall_score,
f1_score, fbeta_score,
matthews_corrcoef, cohen_kappa_score,
balanced_accuracy_score,
precision_recall_fscore_support
)
# โโ 1. Load & split โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# โโ 2. Train โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# โโ 3. Full classification report โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print("=== Classification Report ===")
print(classification_report(y_test, y_pred,
target_names=['Malignant', 'Benign']))
# โโ 4. Individual metrics โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print("=== Individual Metrics (positive class = Benign) ===")
print(f"Precision : {precision_score(y_test, y_pred):.4f}")
print(f"Recall : {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score : {f1_score(y_test, y_pred):.4f}")
print(f"F2 Score : {fbeta_score(y_test, y_pred, beta=2):.4f}")
print(f"F0.5 Score : {fbeta_score(y_test, y_pred, beta=0.5):.4f}")
print(f"MCC : {matthews_corrcoef(y_test, y_pred):.4f}")
print(f"Cohen's Kappa : {cohen_kappa_score(y_test, y_pred):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")
# โโ 5. Threshold tuning for high Recall โโโโโโโโโโโโโโโโโโโโโโ
y_proba = model.predict_proba(X_test)[:, 1]
threshold = 0.3
y_pred_low = (y_proba >= threshold).astype(int)
print(f"\n=== Metrics at threshold = {threshold} ===")
print(f"Precision : {precision_score(y_test, y_pred_low):.4f}")
print(f"Recall : {recall_score(y_test, y_pred_low):.4f}")
print(f"F1 : {f1_score(y_test, y_pred_low):.4f}")
print(f"F2 : {fbeta_score(y_test, y_pred_low, beta=2):.4f}")
# โโ 6. Multi-class macro / weighted / micro โโโโโโโโโโโโโโโโโโโ
from sklearn.datasets import load_iris
Xi, yi = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(Xi, yi, test_size=0.25, random_state=0)
mc_model = LogisticRegression(max_iter=500).fit(Xtr, ytr)
yp = mc_model.predict(Xte)
print("\n=== Multi-class Averaging (Iris, 3 classes) ===")
for avg in ['macro', 'weighted', 'micro']:
p, r, f, _ = precision_recall_fscore_support(yte, yp, average=avg)
print(f"{avg:>10} โ P={p:.3f} R={r:.3f} F1={f:.3f}")
print("\n=== Per-Class Breakdown ===")
print(classification_report(yte, yp,
target_names=['setosa', 'versicolor', 'virginica']))
Golden Rules
classification_report() and inspect each class's
Precision, Recall, and F1 individually โ especially for minority classes.