Machine Learning ๐Ÿ“‚ Supervised Learning ยท 4 of 17 47 min read

Precision, Recall & F1 Score

A full guide to every essential classification evaluation metric. Covers the confusion matrix, precision, recall, specificity, NPV, F1, F-beta score, Matthews Correlation Coefficient, Cohen's Kappa, balanced accuracy, and multi-class macro / micro / weighted averaging โ€” anchored by a clinical TB-screening story, SVG diagrams, step-by-step worked calculations, a metric decision map, and complete Python implementations from scratch and with scikit-learn.

Section 01

The Story: Dr. Meera's TB Screening Crisis

When 89% Accuracy Hides a Disaster
Dr. Meera runs a tuberculosis (TB) screening programme in rural Maharashtra. Her clinic tests 1 000 patients per month. Only 50 of them actually have TB โ€” a 5% prevalence rate. She asks her data science intern, Rohan, to build a model that flags high-risk patients for expensive confirmatory testing.

Rohan builds a model, tests it, and comes back beaming: "Doctor, the model is 89.5% accurate!"

Dr. Meera is unconvinced. She asks Rohan to tell her specifically how many real TB patients the model missed. Rohan checks โ€” the model missed 5 out of 50 TB patients. Five people with active, infectious TB walked out undetected.

At the same time, it sent 100 healthy people for needlessly expensive follow-up tests.

"That is not 89.5% good," Dr. Meera says. "That is a public health failure. You need better metrics." Rohan reaches for Precision, Recall, and F1 Score.

Section 02

Why Accuracy Fails on Imbalanced Data

โŒ Rohan's "89.5% Accurate" Model
MetricValueWhat It Hides
Accuracy89.5%Looks great โ€” is misleading
Missed TB cases (FN)5 patientsWalking out undetected
Wasted tests (FP)100 patientsUnnecessary cost & stress

A model that predicts everyone is healthy gets (950/1000) = 95% accuracy โ€” yet catches zero TB cases. Accuracy is useless here.

โœ… What Honest Metrics Show
MetricValueWhat It Reveals
Precision31%Most positive flags are wrong
Recall90%Most real TB cases are caught
F1 Score46%Overall weak performance

These metrics tell the true story: the model floods the lab with false alarms but does catch 90% of actual TB. Whether that trade-off is acceptable is a medical decision, not a model decision.


Section 03

The Confusion Matrix โ€” The Foundation of All Metrics

Every classification metric is derived from a single 2ร—2 table called the Confusion Matrix. Understanding it is the first step.

๐ŸŸฅ Confusion Matrix โ€” Dr. Meera's TB Screening Model (n = 1 000)
Predicted by Model Predicted: Healthy (0) Predicted: TB (1) Row Total Actual Truth Actual: Healthy (0) Actual: TB (1) 850 True Negative (TN) Correctly said Healthy TN / N = 89.5% Specificity 100 False Positive (FP) Wrongly flagged as TB Type I Error โ€” False Alarm 5 False Negative (FN) Missed real TB patients! Type II Error โ€” Most Dangerous 45 True Positive (TP) Correctly caught TB TP / P = 90% Recall 950 (N โ€” all Healthy) 50 (P โ€” all TB) Pred Neg: 855 Pred Pos: 145 Total: 1000
CellFull NameCountPlain-English Meaning
TNTrue Negative850Healthy patients correctly cleared โ€” no action needed
FPFalse Positive100Healthy patients wrongly flagged โ€” wasted tests, anxiety
FNFalse Negative5TB patients missed โ€” most dangerous; walk away untreated
TPTrue Positive45TB patients correctly detected โ€” the goal
โš ๏ธ
False Negative Is Often the Costliest Error

In disease detection, fraud, and safety systems, a False Negative (missed threat) is usually far more costly than a False Positive (false alarm). A missed TB patient spreads infection. A missed fraud claim costs thousands. A missed safety defect causes accidents. Always ask: "Which error is worse in my domain?" โ€” then choose and tune your metrics accordingly.


Section 04

Precision โ€” When You Raise the Alarm, How Often Are You Right?

Precision (Positive Predictive Value)
Precision = TP / (TP + FP)
Of all the patients the model flagged as TB-positive, what fraction actually had TB? Precision is about the quality of the predictions you made.
Dr. Meera's Model
Precision = 45 / (45 + 100) = 45 / 145 = 0.310 = 31%
For every 10 patients the model flags as TB, only ~3 actually have TB. 7 out of 10 alerts are false alarms. Low precision = crying wolf.
๐Ÿ“Š Precision โ€” Focus on the Predicted Positive Pool
All 145 Predictions Flagged as TB (Predicted Positive Pool) 100 ร— FP Healthy โ€” wrongly flagged 45 ร— TP Real TB โ€” correctly found Precision = 45 / 145 = 31% Precision asks: "Of everything I flagged โ€” how much was real?" โ†’ focuses on FP
๐Ÿ’ก
When Precision Matters Most

High precision is critical when false alarms are costly. Examples: a spam filter that moves legitimate emails to spam (FP = lost business email); a legal document retrieval system that returns irrelevant results (FP = wasted lawyer hours); a credit card block that stops a legitimate transaction (FP = angry customer). In all these cases, you want every positive prediction to be correct.


Section 05

Recall โ€” Of All Real Positives, How Many Did You Find?

Recall (Sensitivity / True Positive Rate)
Recall = TP / (TP + FN)
Of all patients who actually had TB, what fraction did the model correctly detect? Recall is about how well you cover the real positive cases. Also called Sensitivity or Hit Rate.
Dr. Meera's Model
Recall = 45 / (45 + 5) = 45 / 50 = 0.90 = 90%
The model catches 9 out of 10 TB patients. It misses 1 in 10. For TB screening, this may still be unacceptably low โ€” even 5 missed patients represent undetected active infections spreading in the community.
๐Ÿ“Š Recall โ€” Focus on the Actual Positive Pool
All 50 Patients Who Actually Have TB (Ground-Truth Positive Pool) 45 ร— TP โ€” Correctly Caught Model flagged these โ€” they will get treatment 5 FN Recall = 45 / 50 = 90% Recall asks: "Of everything that was truly positive โ€” how much did I find?" โ†’ focuses on FN
๐Ÿ’ก
When Recall Matters Most

High recall is critical when missing a real positive is catastrophic. Examples: cancer screening (missing cancer = delayed treatment, death); fraud detection (missing fraud = financial loss); airport security (missing a threat = public safety risk); earthquake early warning (missing an alert = disaster). In these domains, you would rather have many false alarms than miss one real event.


Section 06

The Precision-Recall Trade-off

Precision and Recall pull in opposite directions. As you lower the classification threshold (flag more things as positive), you catch more real positives (recall rises) but also flag more false alarms (precision falls). As you raise the threshold, you become more selective (precision rises) but miss more real positives (recall falls).

๐Ÿ“ˆ Precision & Recall vs. Classification Threshold
Classification Threshold โ†’ Score 0.0 0.2 0.4 0.6 0.8 1.0 0% 25% 50% 75% 100% F1 peak โ‰ˆ thresh 0.6 Sweet spot Recall (โ†‘ as threshold โ†“) Precision (โ†‘ as threshold โ†‘) F1 Score

The vertical dashed line marks where F1 Score peaks โ€” the balanced operating point. Moving left (lower threshold) boosts Recall at the cost of Precision. Moving right (higher threshold) boosts Precision at the cost of Recall.


Section 07

F1 Score โ€” The Harmonic Mean of Precision and Recall

Neither Precision nor Recall alone is sufficient. We need a single metric that balances both. The F1 Score does this by taking their harmonic mean โ€” not the arithmetic mean.

F1 Score โ€” Harmonic Mean
F1 = 2 ร— (Precision ร— Recall) / (Precision + Recall)
Equivalent to: F1 = 2ยทTP / (2ยทTP + FP + FN)
Ranges from 0 (worst) to 1 (perfect). Penalises extreme imbalances between P and R.
Dr. Meera's Model
F1 = 2 ร— (0.310 ร— 0.900) / (0.310 + 0.900) = 0.558 / 1.210 = 0.461
Despite 90% Recall, the F1 is only 46.1% โ€” dragged down by the poor 31% Precision. A single high score cannot "average away" a very low partner score in F1.
๐Ÿงฎ Why Harmonic Mean and Not Arithmetic Mean?
Problem
Arithmetic mean = (Precision + Recall) / 2 = (0.310 + 0.900) / 2 = 0.605 = 60.5%
This gives an optimistic picture โ€” it lets a very high Recall paper over a terrible Precision.
Extreme case
A model that predicts everyone is positive: Recall = 1.0 (catches every real positive), Precision = 50/1000 = 0.05 (5% positive rate).
Arithmetic mean = (1.0 + 0.05) / 2 = 0.525 = 52.5% โ€” looks decent!
Harmonic mean (F1) = 2ยท(1.0ยท0.05) / (1.0+0.05) = 0.095 = 9.5% โ€” correctly terrible.
Why harmonic
The harmonic mean is dominated by the smaller of the two values. If either Precision or Recall is near zero, F1 is near zero โ€” no matter how high the other is. It forces the model to be reasonably good at both.

Section 08

F-Beta Score โ€” Tuning the Precision-Recall Balance

The F1 score weights Precision and Recall equally. But in the real world, one is often more important than the other. The F-Beta score lets you control that balance with a single parameter ฮฒ.

F-Beta Score
F_ฮฒ = (1 + ฮฒยฒ) ร— (Precision ร— Recall) / (ฮฒยฒ ร— Precision + Recall)
ฮฒ < 1: weights Precision more heavily (e.g. ฮฒ=0.5 โ†’ Precision twice as important).
ฮฒ = 1: equal weight โ†’ standard F1 score.
ฮฒ > 1: weights Recall more heavily (e.g. ฮฒ=2 โ†’ Recall twice as important).
ฮฒ Value Name Recall weight vs Precision Use Case Dr. Meera's Score
ฮฒ = 0.5 F0.5 Precision 2ร— more important Spam filter, legal search, product recommendation โ€” false alarms are costly F0.5 = (1+0.25)ร—(0.310ร—0.900)/(0.25ร—0.310+0.900) = 0.354
ฮฒ = 1 F1 Equal weight General-purpose; balanced domains 0.461
ฮฒ = 2 F2 Recall 2ร— more important TB / cancer screening, fraud, safety โ€” missed positives are catastrophic F2 = (1+4)ร—(0.310ร—0.900)/(4ร—0.310+0.900) = 0.612
๐ŸŽฏ
Dr. Meera's Decision

Dr. Meera consults with her epidemiologist and decides that for TB screening, a missed TB case is four times more costly than a false alarm. She adopts F2 as her primary metric. Under F2, the model scores 61.2% โ€” still modest, but it now rewards the model's high recall while still accounting for the flood of false alarms. She lowers the threshold from 0.5 to 0.3 to push Recall above 95%, accepting lower Precision โ€” and the F2 score guides whether the trade-off was worth it.


Section 09

Other Essential Metrics

Specificity โ€” Recall for the Negative Class

Specificity (True Negative Rate / Selectivity)
Specificity = TN / (TN + FP)
Of all actually healthy patients, how many did the model correctly clear? Dr. Meera: 850 / (850+100) = 89.5% โ€” 9 in 10 healthy patients correctly cleared.
Relationship to FPR
Specificity = 1 โˆ’ FPR
This is exactly the x-axis of the ROC curve inverted. A high Specificity means few false alarms. The trade-off: high Specificity often comes with lower Sensitivity (Recall).

All Metrics at a Glance โ€” Using Dr. Meera's Numbers

Metric Formula Value Meaning in TB Context
Accuracy(TP+TN) / N89.5%Misleading โ€” ignores class imbalance
Precision (PPV)TP / (TP+FP)31.0%Only 3 in 10 positive flags are real TB
Recall (Sensitivity)TP / (TP+FN)90.0%Catches 9 of 10 real TB cases
Specificity (TNR)TN / (TN+FP)89.5%Clears 89.5% of healthy patients correctly
NPVTN / (TN+FN)99.4%If cleared, 99.4% chance actually healthy
F1 Score2ยทPยทR / (P+R)46.1%Balanced score โ€” reveals the precision gap
F2 Score(1+4)ยทPยทR / (4P+R)61.2%Recall-weighted โ€” more appropriate here
False Discovery RateFP / (FP+TP)69.0%69% of positive alerts are false alarms
Miss RateFN / (FN+TP)10.0%10% of real TB cases are missed
Fall-out (FPR)FP / (FP+TN)10.5%10.5% of healthy patients wrongly flagged

Section 10

Advanced Metrics: MCC and Cohen's Kappa

When class imbalance is severe and you need a single metric that's genuinely robust, two metrics stand out above accuracy and F1.

๐Ÿ”ฌ
Matthews Correlation Coefficient (MCC)
Considers all four cells of the confusion matrix simultaneously. Equivalent to the Pearson correlation between predicted and actual labels. Ranges from โˆ’1 (perfect inverse) to +1 (perfect prediction). 0 = random.

Formula: MCC = (TPยทTN โˆ’ FPยทFN) / โˆš[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]

Dr. Meera's model:
MCC = (45ยท850 โˆ’ 100ยท5) / โˆš[(145)(50)(950)(855)]
= (38 250 โˆ’ 500) / โˆš[5 893 781 250]
= 37 750 / 76 773 โ‰ˆ 0.491
sklearn: matthews_corrcoef(y_true, y_pred)
๐Ÿ“Š
Cohen's Kappa (ฮบ)
Measures agreement between the model and the truth, corrected for chance agreement. A model that randomly predicts the right class distribution gets ฮบ = 0, not ฮบ = accuracy.

Formula: ฮบ = (p_observed โˆ’ p_expected) / (1 โˆ’ p_expected)

Where p_expected is what you'd get by chance given class distributions.
ฮบ < 0: worse than chance  |  ฮบ = 0: chance level  |  ฮบ = 1: perfect agreement

General guide: ฮบ < 0.20 = poor, 0.20โ€“0.40 = fair, 0.40โ€“0.60 = moderate, 0.60โ€“0.80 = substantial, 0.80โ€“1.0 = almost perfect.
sklearn: cohen_kappa_score(y_true, y_pred)
โš–๏ธ
Balanced Accuracy
The arithmetic mean of Sensitivity (Recall) and Specificity. Unlike regular accuracy, it gives equal weight to both classes regardless of their frequency โ€” making it ideal for imbalanced problems.

Formula: Balanced Accuracy = (Recall + Specificity) / 2

Dr. Meera's model: (0.900 + 0.895) / 2 = 0.8975 = 89.75%

Much more honest than the raw 89.5% accuracy, and it reflects the model's genuine ability on both classes equally.
sklearn: balanced_accuracy_score(y_true, y_pred)
MetricRangeHandles Imbalance?Uses all 4 CM cells?Best for
Accuracy0โ€“1NoYesBalanced datasets only
Precision0โ€“1PartialNo (TN ignored)FP cost is high
Recall0โ€“1PartialNo (TN ignored)FN cost is high
F10โ€“1PartialNo (TN ignored)General-purpose imbalance
MCCโˆ’1 to 1YesYesImbalanced binary โ€” most robust
Cohen's Kappaโˆ’1 to 1YesYesAgreement quality; multi-class
Balanced Accuracy0โ€“1YesYesInterpretable imbalance metric

Section 11

Multi-Class Metrics โ€” Macro, Micro & Weighted Averaging

For problems with more than 2 classes, Precision, Recall, and F1 are computed per class and then averaged. Three averaging strategies exist, each with a different philosophy.

๐Ÿ“
Macro Average
average='macro'
Compute metric separately for each class, then take the unweighted mean. Treats every class as equally important regardless of how many samples it has.
โœ“ Fair to small / rare classes
โœ— Can be dominated by poor performance on rare classes
๐Ÿ”ข
Micro Average
average='micro'
Pool all TP, FP, FN across classes first, then compute the metric globally. Weighted by number of samples โ€” majority classes dominate.
โœ“ Overall system performance across all predictions
โœ— Hides poor performance on minority classes
โš–๏ธ
Weighted Average
average='weighted'
Compute metric per class, then take the weighted mean by each class's support (number of true samples). Balances micro and macro.
โœ“ Accounts for class imbalance, familiar to stakeholders
โœ— Rare class failures are still diluted by frequent classes
๐Ÿงฎ Worked Example โ€” 3-Class Problem (Tumour Type: Benign / Type-A / Type-B)
Per class
Class Support Precision Recall F1
Benign6000.940.960.95
Type-A3000.820.780.80
Type-B1000.610.550.58
Macro F1
(0.95 + 0.80 + 0.58) / 3 = 0.777 โ€” Type-B's poor performance drags the average down significantly, even though it has few samples. Good for spotting minority-class failures.
Weighted F1
(600ร—0.95 + 300ร—0.80 + 100ร—0.58) / 1000 = (570+240+58) / 1000 = 0.868 โ€” looks much better because Benign (600 samples) dominates. Use when majority class performance matters most.
Conclusion
Always report all three in multi-class settings, plus the per-class breakdown. If Macro F1 and Weighted F1 diverge strongly, your minority class is suffering.

Section 12

Metric Selection Guide โ€” When to Use What

๐Ÿ—บ๏ธ Decision Map โ€” Choosing the Right Metric
Binary Classification? YES NO (Multi-class) Classes balanced? (ratio โ‰ค 4:1 approx) Accuracy + F1 Both reliable here YES Imbalanced Which error costs more? NO FN worse Recall, F2, MCC (TB, Fraud, Safety) FN costly FP worse Precision, F0.5 (Spam, Recommend) FP costly Classes balanced? Weighted F1 or Accuracy YES Macro F1 + MCC Per-class breakdown NO MCC & Cohen's Kappa work well in all cases above

When in doubt: report Precision, Recall, F1, and MCC together โ€” they cover different angles and provide a complete picture for stakeholders and engineering teams alike.


Section 13

Python โ€” All Metrics from Scratch

import math

# โ”€โ”€ Confusion matrix values โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TP, FP, FN, TN = 45, 100, 5, 850
N_total = TP + FP + FN + TN           # 1000
P = TP + FN                           # all actual positives = 50

# โ”€โ”€ Core metrics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
accuracy         = (TP + TN) / N_total
precision        = TP / (TP + FP)
recall           = TP / (TP + FN)
specificity      = TN / (TN + FP)
npv              = TN / (TN + FN)     # Negative Predictive Value
fpr              = FP / (FP + TN)     # False Positive Rate
miss_rate        = FN / (FN + TP)     # False Negative Rate
fdr              = FP / (FP + TP)     # False Discovery Rate

# โ”€โ”€ F-scores โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def f_beta(precision, recall, beta=1):
    b2 = beta ** 2
    if precision + recall == 0:
        return 0.0
    return (1 + b2) * precision * recall / (b2 * precision + recall)

f1 = f_beta(precision, recall, beta=1)
f2 = f_beta(precision, recall, beta=2)
f_half = f_beta(precision, recall, beta=0.5)

# โ”€โ”€ Matthews Correlation Coefficient โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
mcc_num  = (TP * TN) - (FP * FN)
mcc_den  = math.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN))
mcc      = mcc_num / mcc_den if mcc_den > 0 else 0.0

# โ”€โ”€ Balanced Accuracy โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
balanced_acc = (recall + specificity) / 2

# โ”€โ”€ Cohen's Kappa โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
p_obs = (TP + TN) / N_total
p_pos = ((TP + FP) / N_total) * ((TP + FN) / N_total)
p_neg = ((TN + FN) / N_total) * ((TN + FP) / N_total)
p_exp = p_pos + p_neg
kappa = (p_obs - p_exp) / (1 - p_exp)

# โ”€โ”€ Print results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print(f"{'Metric':<25} {'Value':>8}")
print("-" * 35)
for name, val in [
    ("Accuracy",         accuracy),
    ("Precision (PPV)",   precision),
    ("Recall (Sensitivity)", recall),
    ("Specificity (TNR)", specificity),
    ("NPV",              npv),
    ("False Positive Rate", fpr),
    ("Miss Rate (FNR)",  miss_rate),
    ("F0.5 Score",       f_half),
    ("F1  Score",        f1),
    ("F2  Score",        f2),
    ("MCC",              mcc),
    ("Balanced Accuracy", balanced_acc),
    ("Cohen's Kappa",    kappa),
]:
    print(f"{name:<25} {val:>8.4f}")
Output
Metric Value ----------------------------------- Accuracy 0.8950 Precision (PPV) 0.3103 Recall (Sensitivity) 0.9000 Specificity (TNR) 0.8947 NPV 0.9942 False Positive Rate 0.1053 Miss Rate (FNR) 0.1000 F0.5 Score 0.3537 F1 Score 0.4615 F2 Score 0.6522 MCC 0.4908 Balanced Accuracy 0.8974 Cohen's Kappa 0.3918

Section 14

Python โ€” Full Pipeline with scikit-learn

import numpy as np
from sklearn.datasets        import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import (
    confusion_matrix,
    classification_report,
    precision_score, recall_score,
    f1_score, fbeta_score,
    matthews_corrcoef, cohen_kappa_score,
    balanced_accuracy_score,
    precision_recall_fscore_support
)

# โ”€โ”€ 1. Load & split โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# โ”€โ”€ 2. Train โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# โ”€โ”€ 3. Full classification report โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("=== Classification Report ===")
print(classification_report(y_test, y_pred,
      target_names=['Malignant', 'Benign']))

# โ”€โ”€ 4. Individual metrics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print("=== Individual Metrics (positive class = Benign) ===")
print(f"Precision        : {precision_score(y_test, y_pred):.4f}")
print(f"Recall           : {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score         : {f1_score(y_test, y_pred):.4f}")
print(f"F2 Score         : {fbeta_score(y_test, y_pred, beta=2):.4f}")
print(f"F0.5 Score       : {fbeta_score(y_test, y_pred, beta=0.5):.4f}")
print(f"MCC              : {matthews_corrcoef(y_test, y_pred):.4f}")
print(f"Cohen's Kappa    : {cohen_kappa_score(y_test, y_pred):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.4f}")

# โ”€โ”€ 5. Threshold tuning for high Recall โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
y_proba    = model.predict_proba(X_test)[:, 1]
threshold  = 0.3
y_pred_low = (y_proba >= threshold).astype(int)

print(f"\n=== Metrics at threshold = {threshold} ===")
print(f"Precision : {precision_score(y_test, y_pred_low):.4f}")
print(f"Recall    : {recall_score(y_test, y_pred_low):.4f}")
print(f"F1        : {f1_score(y_test, y_pred_low):.4f}")
print(f"F2        : {fbeta_score(y_test, y_pred_low, beta=2):.4f}")

# โ”€โ”€ 6. Multi-class macro / weighted / micro โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from sklearn.datasets import load_iris
Xi, yi = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(Xi, yi, test_size=0.25, random_state=0)
mc_model = LogisticRegression(max_iter=500).fit(Xtr, ytr)
yp = mc_model.predict(Xte)

print("\n=== Multi-class Averaging (Iris, 3 classes) ===")
for avg in ['macro', 'weighted', 'micro']:
    p, r, f, _ = precision_recall_fscore_support(yte, yp, average=avg)
    print(f"{avg:>10} โ†’ P={p:.3f}  R={r:.3f}  F1={f:.3f}")

print("\n=== Per-Class Breakdown ===")
print(classification_report(yte, yp,
      target_names=['setosa', 'versicolor', 'virginica']))
Output
=== Classification Report === precision recall f1-score support Malignant 0.98 0.95 0.96 43 Benign 0.97 0.99 0.98 71 accuracy 0.97 114 === Individual Metrics (positive class = Benign) === Precision : 0.9722 Recall : 0.9859 F1 Score : 0.9790 F2 Score : 0.9831 F0.5 Score : 0.9749 MCC : 0.9451 Cohen's Kappa : 0.9403 Balanced Accuracy: 0.9680 === Metrics at threshold = 0.3 === Precision : 0.9400 Recall : 1.0000 F1 : 0.9691 F2 : 0.9901 === Multi-class Averaging (Iris, 3 classes) === macro โ†’ P=0.974 R=0.974 F1=0.974 weighted โ†’ P=0.974 R=0.974 F1=0.974 micro โ†’ P=0.974 R=0.974 F1=0.974 === Per-Class Breakdown === precision recall f1-score support setosa 1.00 1.00 1.00 13 versicolor 0.94 1.00 0.97 16 virginica 1.00 0.94 0.97 9 accuracy 0.97 38

Section 15

Golden Rules

๐ŸŽฏ Precision, Recall & F1 โ€” Key Rules
1
Never report accuracy alone on imbalanced data. A 95% accurate model on a 95:5 dataset can be completely useless. Always accompany accuracy with at least Precision, Recall, and F1 โ€” or switch to MCC and Balanced Accuracy as your primary metrics.
2
Define your error costs before choosing a metric. In fraud detection, a missed fraud (FN) costs 10ร— more than a false alarm (FP). In spam filtering, a mislabelled legitimate email (FP) costs more than missed spam (FN). This cost ratio should directly determine whether you optimise for Recall, Precision, or which ฮฒ to use in F-beta.
3
Tune the decision threshold โ€” do not just accept the default 0.5. The default threshold maximises F1 only by coincidence. For high-Recall requirements, lower it to 0.3 or 0.2 and watch what happens to Precision. Make this decision on a validation set and evaluate once on the test set.
4
Use MCC as a single-number summary on binary problems. Unlike F1, MCC uses all four cells of the confusion matrix. Unlike Accuracy, it is not fooled by class imbalance. Unlike Cohen's Kappa, it has a clear probabilistic interpretation. It is the most information-dense single metric for binary classification.
5
In multi-class problems, always show the per-class breakdown. Macro and Weighted averages hide whether the model fails on a specific class. Print the full classification_report() and inspect each class's Precision, Recall, and F1 individually โ€” especially for minority classes.
6
Precision and Recall are inversely coupled โ€” only the business can break the tie. Data science can compute every trade-off curve. But the decision of whether it is better to miss 5 TB patients or send 100 healthy people for unnecessary tests is a medical ethics decision, not a data science decision. Always present the trade-off curve to domain experts and let them pick the operating point.