Precision, Recall & F1 Score

Section 01

The Story: Dr. Meera's TB Screening Crisis

📖 Real-World Story

When 89% Accuracy Hides a Disaster

Dr. Meera runs a tuberculosis (TB) screening programme in rural Maharashtra. Her clinic tests 1 000 patients per month. Only 50 of them actually have TB — a 5% prevalence rate. She asks her data science intern, Rohan, to build a model that flags high-risk patients for expensive confirmatory testing.

Rohan builds a model, tests it, and comes back beaming: "Doctor, the model is 89.5% accurate!"

Dr. Meera is unconvinced. She asks Rohan to tell her specifically how many real TB patients the model missed. Rohan checks — the model missed 5 out of 50 TB patients. Five people with active, infectious TB walked out undetected.

At the same time, it sent 100 healthy people for needlessly expensive follow-up tests.

"That is not 89.5% good," Dr. Meera says. "That is a public health failure. You need better metrics." Rohan reaches for Precision, Recall, and F1 Score.

Section 02

Why Accuracy Fails on Imbalanced Data

❌ Rohan's "89.5% Accurate" Model

Metric	Value	What It Hides
Accuracy	89.5%	Looks great — is misleading
Missed TB cases (FN)	5 patients	Walking out undetected
Wasted tests (FP)	100 patients	Unnecessary cost & stress

A model that predicts everyone is healthy gets (950/1000) = 95% accuracy — yet catches zero TB cases. Accuracy is useless here.

✅ What Honest Metrics Show

Metric	Value	What It Reveals
Precision	31%	Most positive flags are wrong
Recall	90%	Most real TB cases are caught
F1 Score	46%	Overall weak performance

These metrics tell the true story: the model floods the lab with false alarms but does catch 90% of actual TB. Whether that trade-off is acceptable is a medical decision, not a model decision.

Section 03

The Confusion Matrix — The Foundation of All Metrics

Every classification metric is derived from a single 2×2 table called the Confusion Matrix. Understanding it is the first step.

🟥 Confusion Matrix — Dr. Meera's TB Screening Model (n = 1 000)

Cell	Full Name	Count	Plain-English Meaning
TN	True Negative	850	Healthy patients correctly cleared — no action needed
FP	False Positive	100	Healthy patients wrongly flagged — wasted tests, anxiety
FN	False Negative	5	TB patients missed — most dangerous; walk away untreated
TP	True Positive	45	TB patients correctly detected — the goal

⚠️

False Negative Is Often the Costliest Error

In disease detection, fraud, and safety systems, a False Negative (missed threat) is usually far more costly than a False Positive (false alarm). A missed TB patient spreads infection. A missed fraud claim costs thousands. A missed safety defect causes accidents. Always ask: "Which error is worse in my domain?" — then choose and tune your metrics accordingly.

Section 04

Precision — When You Raise the Alarm, How Often Are You Right?

Precision (Positive Predictive Value)

Precision = TP / (TP + FP)

Of all the patients the model flagged as TB-positive, what fraction actually had TB? Precision is about the quality of the predictions you made.

Dr. Meera's Model

Precision = 45 / (45 + 100) = 45 / 145 = 0.310 = 31%

For every 10 patients the model flags as TB, only ~3 actually have TB. 7 out of 10 alerts are false alarms. Low precision = crying wolf.

📊 Precision — Focus on the Predicted Positive Pool

💡

When Precision Matters Most

High precision is critical when false alarms are costly. Examples: a spam filter that moves legitimate emails to spam (FP = lost business email); a legal document retrieval system that returns irrelevant results (FP = wasted lawyer hours); a credit card block that stops a legitimate transaction (FP = angry customer). In all these cases, you want every positive prediction to be correct.

Section 05

Recall — Of All Real Positives, How Many Did You Find?

Recall (Sensitivity / True Positive Rate)

Recall = TP / (TP + FN)

Of all patients who actually had TB, what fraction did the model correctly detect? Recall is about how well you cover the real positive cases. Also called Sensitivity or Hit Rate.

Dr. Meera's Model

Recall = 45 / (45 + 5) = 45 / 50 = 0.90 = 90%

The model catches 9 out of 10 TB patients. It misses 1 in 10. For TB screening, this may still be unacceptably low — even 5 missed patients represent undetected active infections spreading in the community.

📊 Recall — Focus on the Actual Positive Pool

💡

When Recall Matters Most

High recall is critical when missing a real positive is catastrophic. Examples: cancer screening (missing cancer = delayed treatment, death); fraud detection (missing fraud = financial loss); airport security (missing a threat = public safety risk); earthquake early warning (missing an alert = disaster). In these domains, you would rather have many false alarms than miss one real event.

Section 06

The Precision-Recall Trade-off

Precision and Recall pull in opposite directions. As you lower the classification threshold (flag more things as positive), you catch more real positives (recall rises) but also flag more false alarms (precision falls). As you raise the threshold, you become more selective (precision rises) but miss more real positives (recall falls).

📈 Precision & Recall vs. Classification Threshold

The vertical dashed line marks where F1 Score peaks — the balanced operating point. Moving left (lower threshold) boosts Recall at the cost of Precision. Moving right (higher threshold) boosts Precision at the cost of Recall.

Section 07

F1 Score — The Harmonic Mean of Precision and Recall

Neither Precision nor Recall alone is sufficient. We need a single metric that balances both. The F1 Score does this by taking their harmonic mean — not the arithmetic mean.

F1 Score — Harmonic Mean

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Equivalent to: F1 = 2·TP / (2·TP + FP + FN)
Ranges from 0 (worst) to 1 (perfect). Penalises extreme imbalances between P and R.

Dr. Meera's Model

F1 = 2 × (0.310 × 0.900) / (0.310 + 0.900) = 0.558 / 1.210 = 0.461

Despite 90% Recall, the F1 is only 46.1% — dragged down by the poor 31% Precision. A single high score cannot "average away" a very low partner score in F1.

🧮 Why Harmonic Mean and Not Arithmetic Mean?

Problem

Arithmetic mean = (Precision + Recall) / 2 = (0.310 + 0.900) / 2 = 0.605 = 60.5%
This gives an optimistic picture — it lets a very high Recall paper over a terrible Precision.

Extreme case

A model that predicts everyone is positive: Recall = 1.0 (catches every real positive), Precision = 50/1000 = 0.05 (5% positive rate).
Arithmetic mean = (1.0 + 0.05) / 2 = 0.525 = 52.5% — looks decent!
Harmonic mean (F1) = 2·(1.0·0.05) / (1.0+0.05) = 0.095 = 9.5% — correctly terrible.

Why harmonic

The harmonic mean is dominated by the smaller of the two values. If either Precision or Recall is near zero, F1 is near zero — no matter how high the other is. It forces the model to be reasonably good at both.

Section 08

F-Beta Score — Tuning the Precision-Recall Balance

The F1 score weights Precision and Recall equally. But in the real world, one is often more important than the other. The F-Beta score lets you control that balance with a single parameter β.

F-Beta Score

F_β = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

β < 1: weights Precision more heavily (e.g. β=0.5 → Precision twice as important).
β = 1: equal weight → standard F1 score.
β > 1: weights Recall more heavily (e.g. β=2 → Recall twice as important).

β Value	Name	Recall weight vs Precision	Use Case	Dr. Meera's Score
`β = 0.5`	F0.5	Precision 2× more important	Spam filter, legal search, product recommendation — false alarms are costly	F0.5 = (1+0.25)×(0.310×0.900)/(0.25×0.310+0.900) = 0.354
`β = 1`	F1	Equal weight	General-purpose; balanced domains	0.461
`β = 2`	F2	Recall 2× more important	TB / cancer screening, fraud, safety — missed positives are catastrophic	F2 = (1+4)×(0.310×0.900)/(4×0.310+0.900) = 0.612

🎯

Dr. Meera's Decision

Dr. Meera consults with her epidemiologist and decides that for TB screening, a missed TB case is four times more costly than a false alarm. She adopts F2 as her primary metric. Under F2, the model scores 61.2% — still modest, but it now rewards the model's high recall while still accounting for the flood of false alarms. She lowers the threshold from 0.5 to 0.3 to push Recall above 95%, accepting lower Precision — and the F2 score guides whether the trade-off was worth it.

Section 09

Other Essential Metrics

Specificity — Recall for the Negative Class

Specificity (True Negative Rate / Selectivity)

Specificity = TN / (TN + FP)

Of all actually healthy patients, how many did the model correctly clear? Dr. Meera: 850 / (850+100) = 89.5% — 9 in 10 healthy patients correctly cleared.

Relationship to FPR

Specificity = 1 − FPR

This is exactly the x-axis of the ROC curve inverted. A high Specificity means few false alarms. The trade-off: high Specificity often comes with lower Sensitivity (Recall).

All Metrics at a Glance — Using Dr. Meera's Numbers

Metric	Formula	Value	Meaning in TB Context
Accuracy	(TP+TN) / N	89.5%	Misleading — ignores class imbalance
Precision (PPV)	TP / (TP+FP)	31.0%	Only 3 in 10 positive flags are real TB
Recall (Sensitivity)	TP / (TP+FN)	90.0%	Catches 9 of 10 real TB cases
Specificity (TNR)	TN / (TN+FP)	89.5%	Clears 89.5% of healthy patients correctly
NPV	TN / (TN+FN)	99.4%	If cleared, 99.4% chance actually healthy
F1 Score	2·P·R / (P+R)	46.1%	Balanced score — reveals the precision gap
F2 Score	(1+4)·P·R / (4P+R)	61.2%	Recall-weighted — more appropriate here
False Discovery Rate	FP / (FP+TP)	69.0%	69% of positive alerts are false alarms
Miss Rate	FN / (FN+TP)	10.0%	10% of real TB cases are missed
Fall-out (FPR)	FP / (FP+TN)	10.5%	10.5% of healthy patients wrongly flagged

Section 10

Advanced Metrics: MCC and Cohen's Kappa

When class imbalance is severe and you need a single metric that's genuinely robust, two metrics stand out above accuracy and F1.

🔬

Matthews Correlation Coefficient (MCC)

Considers all four cells of the confusion matrix simultaneously. Equivalent to the Pearson correlation between predicted and actual labels. Ranges from −1 (perfect inverse) to +1 (perfect prediction). 0 = random.

Formula: MCC = (TP·TN − FP·FN) / √[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]

Dr. Meera's model:
MCC = (45·850 − 100·5) / √[(145)(50)(950)(855)]
= (38 250 − 500) / √[5 893 781 250]
= 37 750 / 76 773 ≈ 0.491

sklearn: matthews_corrcoef(y_true, y_pred)

📊

Cohen's Kappa (κ)

Measures agreement between the model and the truth, corrected for chance agreement. A model that randomly predicts the right class distribution gets κ = 0, not κ = accuracy.

Formula: κ = (p_observed − p_expected) / (1 − p_expected)

Where p_expected is what you'd get by chance given class distributions.
κ < 0: worse than chance | κ = 0: chance level | κ = 1: perfect agreement

General guide: κ < 0.20 = poor, 0.20–0.40 = fair, 0.40–0.60 = moderate, 0.60–0.80 = substantial, 0.80–1.0 = almost perfect.

sklearn: cohen_kappa_score(y_true, y_pred)

⚖️

Balanced Accuracy

The arithmetic mean of Sensitivity (Recall) and Specificity. Unlike regular accuracy, it gives equal weight to both classes regardless of their frequency — making it ideal for imbalanced problems.

Formula: Balanced Accuracy = (Recall + Specificity) / 2

Dr. Meera's model: (0.900 + 0.895) / 2 = 0.8975 = 89.75%

Much more honest than the raw 89.5% accuracy, and it reflects the model's genuine ability on both classes equally.

sklearn: balanced_accuracy_score(y_true, y_pred)

Metric	Range	Handles Imbalance?	Uses all 4 CM cells?	Best for
Accuracy	0–1	No	Yes	Balanced datasets only
Precision	0–1	Partial	No (TN ignored)	FP cost is high
Recall	0–1	Partial	No (TN ignored)	FN cost is high
F1	0–1	Partial	No (TN ignored)	General-purpose imbalance
MCC	−1 to 1	Yes	Yes	Imbalanced binary — most robust
Cohen's Kappa	−1 to 1	Yes	Yes	Agreement quality; multi-class
Balanced Accuracy	0–1	Yes	Yes	Interpretable imbalance metric

Section 11

Multi-Class Metrics — Macro, Micro & Weighted Averaging

For problems with more than 2 classes, Precision, Recall, and F1 are computed per class and then averaged. Three averaging strategies exist, each with a different philosophy.

📐

Macro Average

average='macro'

Compute metric separately for each class, then take the unweighted mean. Treats every class as equally important regardless of how many samples it has.

✓ Fair to small / rare classes

✗ Can be dominated by poor performance on rare classes

🔢

Micro Average

average='micro'

Pool all TP, FP, FN across classes first, then compute the metric globally. Weighted by number of samples — majority classes dominate.

✓ Overall system performance across all predictions

✗ Hides poor performance on minority classes

⚖️

Weighted Average

average='weighted'

Compute metric per class, then take the weighted mean by each class's support (number of true samples). Balances micro and macro.

✓ Accounts for class imbalance, familiar to stakeholders

✗ Rare class failures are still diluted by frequent classes

🧮 Worked Example — 3-Class Problem (Tumour Type: Benign / Type-A / Type-B)

Per class

Class	Support	Precision	Recall	F1
Benign	600	0.94	0.96	0.95
Type-A	300	0.82	0.78	0.80
Type-B	100	0.61	0.55	0.58

Macro F1

(0.95 + 0.80 + 0.58) / 3 = 0.777 — Type-B's poor performance drags the average down significantly, even though it has few samples. Good for spotting minority-class failures.

Weighted F1

(600×0.95 + 300×0.80 + 100×0.58) / 1000 = (570+240+58) / 1000 = 0.868 — looks much better because Benign (600 samples) dominates. Use when majority class performance matters most.

Conclusion

Always report all three in multi-class settings, plus the per-class breakdown. If Macro F1 and Weighted F1 diverge strongly, your minority class is suffering.

Section 12

Metric Selection Guide — When to Use What

🗺️ Decision Map — Choosing the Right Metric

When in doubt: report Precision, Recall, F1, and MCC together — they cover different angles and provide a complete picture for stakeholders and engineering teams alike.

Section 13

Python — All Metrics from Scratch

import math

# ── Confusion matrix values ───────────────────────────────────
TP, FP, FN, TN = 45, 100, 5, 850
N_total = TP + FP + FN + TN           # 1000
P = TP + FN                           # all actual positives = 50

# ── Core metrics ─────────────────────────────────────────────
accuracy         = (TP + TN) / N_total
precision        = TP / (TP + FP)
recall           = TP / (TP + FN)
specificity      = TN / (TN + FP)
npv              = TN / (TN + FN)     # Negative Predictive Value
fpr              = FP / (FP + TN)     # False Positive Rate
miss_rate        = FN / (FN + TP)     # False Negative Rate
fdr              = FP / (FP + TP)     # False Discovery Rate

# ── F-scores ─────────────────────────────────────────────────
def f_beta(precision, recall, beta=1):
    b2 = beta ** 2
    if precision + recall == 0:
        return 0.0
    return (1 + b2) * precision * recall / (b2 * precision + recall)

f1 = f_beta(precision, recall, beta=1)
f2 = f_beta(precision, recall, beta=2)
f_half = f_beta(precision, recall, beta=0.5)

# ── Matthews Correlation Coefficient ─────────────────────────
mcc_num  = (TP * TN) - (FP * FN)
mcc_den  = math.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN))
mcc      = mcc_num / mcc_den if mcc_den > 0 else 0.0

# ── Balanced Accuracy ─────────────────────────────────────────
balanced_acc = (recall + specificity) / 2

# ── Cohen's Kappa ─────────────────────────────────────────────
p_obs = (TP + TN) / N_total
p_pos = ((TP + FP) / N_total) * ((TP + FN) / N_total)
p_neg = ((TN + FN) / N_total) * ((TN + FP) / N_total)
p_exp = p_pos + p_neg
kappa = (p_obs - p_exp) / (1 - p_exp)

# ── Print results ─────────────────────────────────────────────
print(f"{'Metric':<25} {'Value':>8}")
print("-" * 35)
for name, val in [
    ("Accuracy",         accuracy),
    ("Precision (PPV)",   precision),
    ("Recall (Sensitivity)", recall),
    ("Specificity (TNR)", specificity),
    ("NPV",              npv),
    ("False Positive Rate", fpr),
    ("Miss Rate (FNR)",  miss_rate),
    ("F0.5 Score",       f_half),
    ("F1  Score",        f1),
    ("F2  Score",        f2),
    ("MCC",              mcc),
    ("Balanced Accuracy", balanced_acc),
    ("Cohen's Kappa",    kappa),
]:
    print(f"{name:<25} {val:>8.4f}")

Output

Metric Value ----------------------------------- Accuracy 0.8950 Precision (PPV) 0.3103 Recall (Sensitivity) 0.9000 Specificity (TNR) 0.8947 NPV 0.9942 False Positive Rate 0.1053 Miss Rate (FNR) 0.1000 F0.5 Score 0.3537 F1 Score 0.4615 F2 Score 0.6522 MCC 0.4908 Balanced Accuracy 0.8974 Cohen's Kappa 0.3918

Section 14

Python — Full Pipeline with scikit-learn

import numpy as np
from sklearn.datasets        import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import (
    confusion_matrix,
    classification_report,
    precision_score, recall_score,
    f1_score, fbeta_score,
    matthews_corrcoef, cohen_kappa_score,
    balanced_accuracy_score,
    precision_recall_fscore_support
)

# ── 1. Load & split ───────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# ── 2. Train ──────────────────────────────────────────────────
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# ── 3. Full classification report ────────────────────────────
print("=== Classification Report ===")
print(classification_report(y_test, y_pred,
      target_names=['Malignant', 'Benign']))

# ── 4. Individual metrics ─────────────────────────────────────
print("=== Individual Metrics (positive class = Benign) ===")
print(f"Precision        : {precision_score(y_test, y_pred):.4f}")
print(f"Recall           : {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score         : {f1_score(y_test, y_pred):.4f}")
print(f"F2 Score         : {fbeta_score(y_test, y_pred, beta=2):.4f}")
print(f"F0.5 Score       : {fbeta_score(y_test, y_pred, beta=0.5):.4f}")
print(f"MCC              : {matthews_corrcoef(y_test, y_pred):.4f}")
print(f"Cohen's Kappa    : {cohen_kappa_score(y_test, y_pred):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")

# ── 5. Threshold tuning for high Recall ──────────────────────
y_proba    = model.predict_proba(X_test)[:, 1]
threshold  = 0.3
y_pred_low = (y_proba >= threshold).astype(int)

print(f"\n=== Metrics at threshold = {threshold} ===")
print(f"Precision : {precision_score(y_test, y_pred_low):.4f}")
print(f"Recall    : {recall_score(y_test, y_pred_low):.4f}")
print(f"F1        : {f1_score(y_test, y_pred_low):.4f}")
print(f"F2        : {fbeta_score(y_test, y_pred_low, beta=2):.4f}")

# ── 6. Multi-class macro / weighted / micro ───────────────────
from sklearn.datasets import load_iris
Xi, yi = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(Xi, yi, test_size=0.25, random_state=0)
mc_model = LogisticRegression(max_iter=500).fit(Xtr, ytr)
yp = mc_model.predict(Xte)

print("\n=== Multi-class Averaging (Iris, 3 classes) ===")
for avg in ['macro', 'weighted', 'micro']:
    p, r, f, _ = precision_recall_fscore_support(yte, yp, average=avg)
    print(f"{avg:>10} → P={p:.3f}  R={r:.3f}  F1={f:.3f}")

print("\n=== Per-Class Breakdown ===")
print(classification_report(yte, yp,
      target_names=['setosa', 'versicolor', 'virginica']))

Output

=== Classification Report === precision recall f1-score support Malignant 0.98 0.95 0.96 43 Benign 0.97 0.99 0.98 71 accuracy 0.97 114 === Individual Metrics (positive class = Benign) === Precision : 0.9722 Recall : 0.9859 F1 Score : 0.9790 F2 Score : 0.9831 F0.5 Score : 0.9749 MCC : 0.9451 Cohen's Kappa : 0.9403 Balanced Accuracy: 0.9680 === Metrics at threshold = 0.3 === Precision : 0.9400 Recall : 1.0000 F1 : 0.9691 F2 : 0.9901 === Multi-class Averaging (Iris, 3 classes) === macro → P=0.974 R=0.974 F1=0.974 weighted → P=0.974 R=0.974 F1=0.974 micro → P=0.974 R=0.974 F1=0.974 === Per-Class Breakdown === precision recall f1-score support setosa 1.00 1.00 1.00 13 versicolor 0.94 1.00 0.97 16 virginica 1.00 0.94 0.97 9 accuracy 0.97 38

Section 15

Golden Rules

🎯 Precision, Recall & F1 — Key Rules

Never report accuracy alone on imbalanced data. A 95% accurate model on a 95:5 dataset can be completely useless. Always accompany accuracy with at least Precision, Recall, and F1 — or switch to MCC and Balanced Accuracy as your primary metrics.

Define your error costs before choosing a metric. In fraud detection, a missed fraud (FN) costs 10× more than a false alarm (FP). In spam filtering, a mislabelled legitimate email (FP) costs more than missed spam (FN). This cost ratio should directly determine whether you optimise for Recall, Precision, or which β to use in F-beta.

Tune the decision threshold — do not just accept the default 0.5. The default threshold maximises F1 only by coincidence. For high-Recall requirements, lower it to 0.3 or 0.2 and watch what happens to Precision. Make this decision on a validation set and evaluate once on the test set.

Use MCC as a single-number summary on binary problems. Unlike F1, MCC uses all four cells of the confusion matrix. Unlike Accuracy, it is not fooled by class imbalance. Unlike Cohen's Kappa, it has a clear probabilistic interpretation. It is the most information-dense single metric for binary classification.

In multi-class problems, always show the per-class breakdown. Macro and Weighted averages hide whether the model fails on a specific class. Print the full classification_report() and inspect each class's Precision, Recall, and F1 individually — especially for minority classes.

Precision and Recall are inversely coupled — only the business can break the tie. Data science can compute every trade-off curve. But the decision of whether it is better to miss 5 TB patients or send 100 healthy people for unnecessary tests is a medical ethics decision, not a data science decision. Always present the trade-off curve to domain experts and let them pick the operating point.