Machine Learning ๐Ÿ“‚ Supervised Learning ยท 3 of 17 42 min read

ROC Curve & AUC

A full guide to the ROC Curve and AUC metric. Covers TPR, FPR, the threshold-performance trade-off, step-by-step curve construction with a 10-transaction worked example, AUC calculation via the trapezoidal rule, multi-model visual comparison, operating point selection strategies (Youden's J, cost-sensitive, F-beta), ROC vs Precision-Recall curve with the imbalanced-data trap, multi-class One-vs-Rest ROC, Python from scratch, and a complete scikit-learn pipeline

Section 01

The Story: Aisha's Three Fraud Models

A Data Scientist with Three Models and One Decision
Aisha is a senior data scientist at a fintech startup in Lagos. Her team has just built three different models to detect credit card fraud โ€” a Logistic Regression, a Random Forest, and a Gradient Boosting model. Each model gives every transaction a fraud probability score between 0 and 1.

Her manager asks: "Which model should we deploy? And what threshold should we use to flag a transaction as fraud?"

Aisha knows accuracy won't answer this โ€” only 0.5% of transactions are fraudulent, so a model that always says "Not Fraud" gets 99.5% accuracy while being completely useless. She needs a metric that evaluates model quality across all possible thresholds at once. That metric is the ROC Curve and its summary statistic, the AUC.

Section 02

What Is the ROC Curve?

The ROC Curve (Receiver Operating Characteristic Curve) is a graph that shows how a classifier performs at every possible classification threshold โ€” from 0.0 (classify everything as fraud) to 1.0 (classify nothing as fraud).

At each threshold, the model produces a different confusion matrix. The ROC curve plots two numbers extracted from that matrix:

X-Axis โ€” False Positive Rate (FPR)
FPR = FP / (FP + TN) = FP / N
Of all actual legitimate transactions, what fraction did the model wrongly flag as fraud? Also called Fall-out or 1 โˆ’ Specificity. Lower is better.
Y-Axis โ€” True Positive Rate (TPR)
TPR = TP / (TP + FN) = TP / P
Of all actual fraudulent transactions, what fraction did the model correctly catch? Also called Recall or Sensitivity. Higher is better.
๐Ÿ’ก
The Threshold-Performance Trade-off

As you lower the fraud threshold (e.g. from 0.7 โ†’ 0.3), the model flags more transactions. This catches more real fraud (TPR rises) but also incorrectly flags more legitimate transactions (FPR also rises). The ROC curve traces this exact trade-off as the threshold sweeps from 1 down to 0. A perfect model reaches TPR=1 while keeping FPR=0.


Section 03

Building the ROC Curve Step by Step

Let's use a tiny dataset โ€” 10 transactions, 5 fraudulent (F) and 5 legitimate (L). A model has assigned each a fraud probability. We sort by probability descending and sweep the threshold:

Rank Transaction Actual Label Fraud Probability (pฬ‚) Action at this threshold
1T-09Fraud (F)0.95โ† Lower threshold to here โ†’ predict T-09 as Fraud
2T-03Legit (L)0.88โ† Predict T-09, T-03 as Fraud
3T-07Fraud (F)0.82โ† Predict top-3 as Fraud
4T-01Fraud (F)0.71โ† Predict top-4 as Fraud
5T-06Legit (L)0.60โ† Predict top-5 as Fraud
6T-10Legit (L)0.45โ† Predict top-6 as Fraud
7T-04Fraud (F)0.38โ† Predict top-7 as Fraud
8T-02Legit (L)0.25โ† Predict top-8 as Fraud
9T-05Fraud (F)0.15โ† Predict top-9 as Fraud
10T-08Legit (L)0.05โ† Predict all as Fraud

Total Positives (Fraud): P = 5  |  Total Negatives (Legit): N = 5
At each rank boundary we record the cumulative TP, FP, and compute TPR = TP/5, FPR = FP/5:

Threshold โ‰ฅ Predicted as Fraud TP FP FPR = FP/5 TPR = TP/5 ROC Point
1.00 (nothing)โ€”000.000.00(0.00, 0.00) โ† Start
0.95T-09 F100.000.20(0.00, 0.20)
0.88+ T-03 L110.200.20(0.20, 0.20)
0.82+ T-07 F210.200.40(0.20, 0.40)
0.71+ T-01 F310.200.60(0.20, 0.60)
0.60+ T-06 L320.400.60(0.40, 0.60)
0.45+ T-10 L330.600.60(0.60, 0.60)
0.38+ T-04 F430.600.80(0.60, 0.80)
0.25+ T-02 L440.800.80(0.80, 0.80)
0.15+ T-05 F540.801.00(0.80, 1.00)
0.00 (all)+ T-08 L551.001.00(1.00, 1.00) โ† End
๐ŸŽฏ
Key Observation

When a Fraud case is added, the point moves straight up (TPR increases, FPR stays the same โ€” good news). When a Legit case is added, the point moves straight right (FPR increases, TPR stays the same โ€” bad news). A model that perfectly ranks all frauds above all legitimate transactions would reach (0, 1.0) before moving right at all โ€” an AUC of 1.0.


Section 04

The ROC Curve โ€” Visualised

๐Ÿ“ˆ ROC Curve โ€” Aisha's 10-Transaction Example (AUC = 0.64)
False Positive Rate (FPR) True Positive Rate (TPR) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 1.0 Random (AUC = 0.5) pฬ‚=0.95 TPโ†‘ pฬ‚=0.88 FPโ†’ pฬ‚=0.71 TPโ†‘ pฬ‚=0.38 TPโ†‘ pฬ‚=0.15 TPโ†‘ AUC = 0.64 Move UP = Fraud caught (good) Move RIGHT = Legit wrongly flagged (bad)

Each vertical jump corresponds to catching a real fraud. Each horizontal jump is a false alarm on a legitimate transaction. AUC = 0.64 means this model ranks a random fraud above a random legitimate transaction 64% of the time.


Section 05

Calculating AUC โ€” The Trapezoidal Rule

AUC is the area under the ROC curve, computed using the trapezoidal rule: for each consecutive pair of points (FPRโ‚, TPRโ‚) and (FPRโ‚‚, TPRโ‚‚), the area of the trapezoid beneath them is:

Trapezoidal Area Between Two ROC Points
A = (FPRโ‚‚ โˆ’ FPRโ‚) ร— (TPRโ‚ + TPRโ‚‚) / 2
Sum this over all consecutive point pairs. Only horizontal segments contribute area (vertical jumps have zero width).
Equivalent Probabilistic Interpretation
AUC = P(score(fraud) > score(legit))
Probability that the model ranks a random fraudulent transaction higher than a random legitimate one. AUC = 0.64 โ†’ correct ranking 64% of the time.
๐Ÿงฎ AUC Calculation โ€” 10-Transaction Example (Trapezoidal Rule)
Segments
Only segments where FPR changes contribute area. Listing all horizontal moves:

Seg 1: (0.00, 0.20) โ†’ (0.20, 0.20): width = 0.20, avg TPR = 0.20
Area = 0.20 ร— 0.20 = 0.040

Seg 2: (0.20, 0.60) โ†’ (0.40, 0.60): width = 0.20, avg TPR = 0.60
Area = 0.20 ร— 0.60 = 0.120

Seg 3: (0.40, 0.60) โ†’ (0.60, 0.60): width = 0.20, avg TPR = 0.60
Area = 0.20 ร— 0.60 = 0.120

Seg 4: (0.60, 0.80) โ†’ (0.80, 0.80): width = 0.20, avg TPR = 0.80
Area = 0.20 ร— 0.80 = 0.160

Seg 5: (0.80, 1.00) โ†’ (1.00, 1.00): width = 0.20, avg TPR = 1.00
Area = 0.20 ร— 1.00 = 0.200
Total AUC
0.040 + 0.120 + 0.120 + 0.160 + 0.200 = 0.640
The model has an AUC of 0.64 โ€” better than random (0.5) but far from a production-ready model (0.90+).

Section 06

Multi-Model Comparison โ€” The Real Power of ROC

The true strength of ROC curves is comparing multiple models simultaneously, across all thresholds at once. Aisha plots all three of her models on the same axes:

๐Ÿ“Š ROC Comparison โ€” Three Fraud Detection Models
False Positive Rate (FPR) True Positive Rate (TPR) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 1.0 Perfect (AUC = 1.0) Random (AUC = 0.5) Gradient Boost AUC=0.93 Random Forest AUC=0.79 Logistic Reg. AUC=0.64

The Gradient Boosting model dominates at every threshold โ€” its curve is highest everywhere. Aisha should deploy this model. The Random Forest is a solid backup. Logistic Regression lags at higher FPR values, suggesting it misses more fraud at typical operating thresholds.

๐ŸŽฏ
How to Read a Multi-Model ROC Plot

A model whose ROC curve is higher and to the left at your operating FPR is better. If Aisha's fraud team can investigate 10% of all transactions (FPR โ‰ค 0.10), she reads the y-value at FPR=0.10: Gradient Boost โ‰ˆ 78% recall, Random Forest โ‰ˆ 45% recall, Logistic Regression โ‰ˆ 20% recall. Clear winner.


Section 07

Interpreting the AUC Score

AUC = 0.50
๐ŸŽฒ
  • Equivalent to random guessing
  • The diagonal line on the ROC plot
  • Model has learned nothing useful
  • Your floor โ€” always beat this
AUC 0.50โ€“0.70
โš ๏ธ
  • Poor discrimination
  • Better than random but barely
  • Revisit features & data quality
  • Rarely production-worthy
AUC 0.70โ€“0.90
๐Ÿ‘
  • Good to very good discrimination
  • Acceptable for many real problems
  • Tune threshold to fit your domain
  • Most production models live here
AUC 0.90โ€“1.00
๐Ÿ†
  • Excellent discrimination
  • Always check for data leakage
  • Future data can always degrade this
  • AUC = 1.0 โ†’ perfect (too perfect?)
โš ๏ธ
AUC Below 0.5 โ€” A Surprising Finding

An AUC below 0.5 is not automatically bad โ€” it means the model is consistently wrong. Flip its predictions (predict 1 when it says 0) and you get a model with AUC = 1 โˆ’ original AUC. This usually means a bug in your label encoding (0 and 1 swapped) or in your feature pipeline.


Section 08

ROC-AUC vs Precision-Recall AUC โ€” When Each Wins

The ROC curve is not always the right tool. On severely imbalanced datasets โ€” like fraud (0.1% positive rate) โ€” the ROC curve can be misleadingly optimistic. The Precision-Recall (PR) curve is more honest in those cases.

๐Ÿ“ˆ ROC Curve โ€” Axes
AxisFormulaUses
X (FPR)FP / (FP + TN)TN โ€” the large negatives pool
Y (TPR / Recall)TP / (TP + FN)TP and FN only

On imbalanced data, TN is huge. Even many FPs produce a tiny FPR, making the curve look excellent while Precision is terrible.

๐Ÿ“Š PR Curve โ€” Axes
AxisFormulaUses
X (Recall)TP / (TP + FN)No TN involved
Y (Precision)TP / (TP + FP)No TN involved

Because TN never appears, PR-AUC is sensitive to imbalance. A bad model that generates many FPs will show up clearly here even when ROC-AUC looks fine.

Situation Use ROC-AUC Use PR-AUC
Class ratio ~50:50โœ“ Best choiceAlso works
Mild imbalance (10:1)โœ“ Good choiceAlso informative
Severe imbalance (100:1 or worse)Can be misleadingโœ“ Preferred
Comparing models across teamsโœ“ Standard choiceLess universal
Care more about PrecisionLess relevantโœ“ Shows precision clearly
Fraud / medical rare-event detectionUse with cautionโœ“ Much more informative
When ROC Lies to You
Aisha's bank has 100 000 transactions per day. Only 50 are fraudulent (0.05% rate). Her Gradient Boosting model flags 500 transactions as fraud. Of those 500, only 40 are real fraud.

Confusion matrix: TP=40, FP=460, FN=10, TN=99 490
ROC-AUC: Excellent ~0.93 โ€” because FPR = 460/99 950 = 0.0046 (tiny denominator).
Precision: 40/500 = 8% โ€” for every 12 alerts sent to the fraud team, 11 are false alarms.

ROC says "great model." PR curve says "this model is drowning your fraud team in noise." In banking operations, PR-AUC is the decision metric.

Section 09

Choosing the Right Operating Point

The ROC curve shows all possible thresholds. Choosing which point to operate at requires a business decision. Three common strategies:

๐Ÿ“
Youden's J Statistic
Maximises TPR โˆ’ FPR. The point farthest from the diagonal. Balanced โ€” treats FP and FN equally. Good default when costs are unknown.
J = TPR + TNR โˆ’ 1 = Sensitivity + Specificity โˆ’ 1
๐Ÿ’ฐ
Cost-Sensitive Threshold
Minimise cost(FP)ยทFPR + cost(FN)ยท(1โˆ’TPR). Define the cost of a false alarm vs a missed fraud. In fraud detection, missing a ยฃ10 000 fraud costs far more than one wasted analyst hour.
Threshold = cost(FP) / (cost(FP) + cost(FN))
๐ŸŽฏ
Operational Constraint
Set a hard constraint based on capacity. "Our team can review 200 alerts/day" โ†’ select the threshold that produces exactly 200 positives. Or "We must catch at least 90% of fraud" โ†’ fix TPR โ‰ฅ 0.90 and minimise FPR.
Read TPR at your max tolerable FPR from the curve
โš–๏ธ
F-Beta Maximisation
Choose the threshold that maximises F-beta score. ฮฒ < 1 weights Precision more; ฮฒ > 1 weights Recall more. F2 (ฮฒ=2) is common in fraud and medical screening where missing a positive is twice as costly as a false alarm.
F_ฮฒ = (1+ฮฒยฒ) ยท PยทR / (ฮฒยฒยทP + R)

Section 10

Multi-Class ROC โ€” One-vs-Rest (OvR)

For problems with more than two classes (e.g. classifying a tumour as Benign, Malignant Type A, or Malignant Type B), the standard binary ROC cannot be directly applied. The most common extension is One-vs-Rest (OvR):

01
Train K binary classifiers
For each class k (k = 1 โ€ฆ K), treat class k as the "positive" and all other classes as the "negative." Get a probability score for each class. For 3 classes โ†’ 3 ROC curves.
02
Plot a ROC curve per class
Compute TPR and FPR for each class separately: "Is this Malignant Type A vs all others?" This gives you K ROC curves, each with its own AUC โ€” so you can see which class your model struggles with.
03
Average the AUC scores
Macro-average AUC: unweighted mean โ€” treats all classes equally.
Weighted-average AUC: weighted by class support โ€” better for imbalanced multi-class.
scikit-learn: roc_auc_score(y, proba, multi_class='ovr', average='macro')

Section 11

Python โ€” ROC Curve & AUC from Scratch

import numpy as np

# โ”€โ”€ Build ROC curve manually โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def roc_curve_manual(y_true, y_scores):
    """Compute TPR and FPR at every unique threshold."""
    thresholds = np.sort(np.unique(y_scores))[::-1]  # descending
    P = np.sum(y_true == 1)                       # total positives
    N = np.sum(y_true == 0)                       # total negatives

    tpr_list = [0.0]
    fpr_list = [0.0]

    for thresh in thresholds:
        y_pred = (y_scores >= thresh).astype(int)
        TP = np.sum((y_pred == 1) & (y_true == 1))
        FP = np.sum((y_pred == 1) & (y_true == 0))
        tpr_list.append(TP / P)
        fpr_list.append(FP / N)

    tpr_list.append(1.0)
    fpr_list.append(1.0)
    return np.array(fpr_list), np.array(tpr_list)

# โ”€โ”€ AUC via trapezoidal rule โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def auc_trapezoid(fpr, tpr):
    """Area under the ROC curve using np.trapz."""
    return np.trapz(tpr, fpr)   # integrates TPR over FPR

# โ”€โ”€ 10-transaction example โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
y_true   = np.array([1, 0, 1, 1, 0, 0, 1, 0, 1, 0])
y_scores = np.array([0.95, 0.88, 0.82, 0.71, 0.60,
                      0.45, 0.38, 0.25, 0.15, 0.05])

fpr, tpr = roc_curve_manual(y_true, y_scores)
auc      = auc_trapezoid(fpr, tpr)

print("FPR points:", fpr.round(2))
print("TPR points:", tpr.round(2))
print(f"AUC         : {auc:.4f}")      # 0.6400

# โ”€โ”€ Youden's J โ€” find optimal threshold โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
thresholds = np.sort(np.unique(y_scores))[::-1]
j_scores   = tpr[1:-1] - fpr[1:-1]     # TPR - FPR at each threshold
best_idx   = np.argmax(j_scores)
best_thresh = thresholds[best_idx]

print(f"Best threshold (Youden's J): {best_thresh:.2f}")
print(f"  TPR at best: {tpr[best_idx+1]:.2f}  "
      f"FPR at best: {fpr[best_idx+1]:.2f}")
Output
FPR points: [0. 0. 0.2 0.2 0.2 0.4 0.6 0.6 0.8 0.8 1. ] TPR points: [0. 0.2 0.2 0.4 0.6 0.6 0.6 0.8 0.8 1. 1. ] AUC : 0.6400 Best threshold (Youden's J): 0.71 TPR at best: 0.60 FPR at best: 0.20

Section 12

Python โ€” Full Pipeline with scikit-learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets        import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.ensemble        import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics         import (
    roc_curve, roc_auc_score,
    precision_recall_curve, average_precision_score,
    RocCurveDisplay
)

# โ”€โ”€ 1. Create imbalanced dataset โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X, y = make_classification(
    n_samples=5000, n_features=20,
    weights=[0.95, 0.05],         # 95% legit, 5% fraud
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# โ”€โ”€ 2. Train three models โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
models = {
    'Logistic Regression'    : LogisticRegression(C=1.0, max_iter=500),
    'Random Forest'          : RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting'      : GradientBoostingClassifier(n_estimators=100, random_state=42),
}

for name, model in models.items():
    model.fit(X_train, y_train)

# โ”€โ”€ 3. Compute & plot ROC curves โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

colors = ['#6366f1', '#f59e0b', '#34d399']

for (name, model), color in zip(models.items(), colors):
    y_proba = model.predict_proba(X_test)[:, 1]

    # ROC
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc          = roc_auc_score(y_test, y_proba)
    axes[0].plot(fpr, tpr, color=color, lw=2,
                 label=f"{name} (AUC={auc:.3f})")

    # Precision-Recall
    prec, rec, _ = precision_recall_curve(y_test, y_proba)
    ap            = average_precision_score(y_test, y_proba)
    axes[1].plot(rec, prec, color=color, lw=2,
                 label=f"{name} (AP={ap:.3f})")

# ROC โ€” diagonal baseline
axes[0].plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.50)')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()

# PR โ€” baseline (always-positive classifier)
baseline = y_test.sum() / len(y_test)
axes[1].axhline(y=baseline, color='k', linestyle='--',
                label=f'Random (AP={baseline:.3f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend()

plt.tight_layout()
plt.show()

# โ”€โ”€ 4. Optimal threshold via Youden's J โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
gb_proba     = models['Gradient Boosting'].predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, gb_proba)

j_scores   = tpr - fpr
best_idx   = np.argmax(j_scores)
best_thresh = thresholds[best_idx]
print(f"Optimal threshold (Youden): {best_thresh:.4f}")
print(f"  TPR = {tpr[best_idx]:.4f}   FPR = {fpr[best_idx]:.4f}")

# โ”€โ”€ 5. Multi-class AUC (if needed) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from sklearn.datasets   import load_iris
from sklearn.multiclass import OneVsRestClassifier

X_mc, y_mc = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_mc, y_mc, test_size=0.3)

ovr_model = OneVsRestClassifier(LogisticRegression(max_iter=500))
ovr_model.fit(X_tr, y_tr)
y_proba_mc = ovr_model.predict_proba(X_te)

macro_auc = roc_auc_score(y_te, y_proba_mc,
                            multi_class='ovr', average='macro')
print(f"Multi-class Macro AUC (Iris): {macro_auc:.4f}")
Output
Optimal threshold (Youden): 0.3847 TPR = 0.8571 FPR = 0.0526 Multi-class Macro AUC (Iris): 0.9971

Section 13

Common Mistakes with ROC & AUC

โš ๏ธ
Using ROC-AUC on Severe Imbalance
When positives are <1%, a high ROC-AUC can hide a useless model. Always complement with PR-AUC or check Precision at your operating threshold.
Fix: use PR-AUC, F-beta, or Cohen's Kappa alongside
๐Ÿ”€
Comparing AUC Without Confidence Intervals
AUC 0.872 vs 0.868 on a 200-sample test set is not a meaningful difference. Always report confidence intervals using bootstrap resampling before declaring a winner.
Fix: bootstrap 1000ร— and report 95% CI on AUC difference
๐Ÿ“‰
Fitting the Scaler on the Full Dataset
Fitting StandardScaler before the train-test split leaks test-set statistics into training, inflating all metrics including AUC. Always fit on train only, transform both.
Fix: place scaler inside a Pipeline with cross-validation
๐ŸŽญ
Choosing Threshold After Seeing the Test Set
Selecting the "best" threshold on the test set is data snooping โ€” it inflates recall/precision. Choose the threshold on a validation set or via cross-validation, then evaluate once on the untouched test set.
Fix: 3-way split (train / val / test) or nested CV
๐Ÿค–
Ignoring Class Probability Calibration
AUC measures ranking, not calibration. A model with AUC=0.95 whose probabilities are wildly off (0.9 when the true rate is 0.3) is dangerous for downstream decisions. Use CalibratedClassifierCV for well-calibrated probabilities.
Fix: check calibration curves and apply isotonic / Platt scaling
๐Ÿท๏ธ
Wrong Class as "Positive"
roc_auc_score assumes label=1 is the positive class. If your rare class is encoded as 0 and the majority as 1, AUC will be inverted (often <0.5). Always verify your positive class label.
Fix: confirm with confusion_matrix which label is which

Section 14

Golden Rules

๐ŸŽฏ ROC Curve & AUC โ€” Key Rules
1
AUC is a ranking metric, not a calibration metric. A model with AUC = 0.97 may still assign probabilities of 0.85 to events that happen only 30% of the time. Always inspect a calibration plot before using raw probabilities for business decisions.
2
ROC-AUC is threshold-independent โ€” that is both its power and its weakness. It summarises all thresholds, so it does not tell you which threshold to use. Always select an operating threshold separately using a business-driven criterion (Youden's J, F-beta, or cost matrix) on the validation set.
3
On imbalanced data (<10% positive rate), always compute PR-AUC alongside ROC-AUC. ROC-AUC can stay high while Precision collapses. Report both. If they diverge, PR-AUC is usually the more honest signal for rare-event problems.
4
Never compare AUC values from different test sets. A model with AUC = 0.91 on a balanced dataset is not better than one with AUC = 0.87 on a 1:100 imbalanced dataset โ€” the numbers are not comparable. Always compare models on the same held-out test set under identical conditions.
5
Validate AUC with bootstrap confidence intervals before shipping. A single point estimate of AUC hides uncertainty. Bootstrap resample the test set 1 000 times and report the 95% CI. Differences within the CI are noise, not model improvement.
6
Cross-validate AUC, not just the final holdout. Use cross_val_score(model, X, y, scoring='roc_auc', cv=5). A model with mean AUC = 0.89 ยฑ 0.02 across folds is far more trustworthy than one with a single holdout AUC of 0.92 but high variance.