ROC Curve & AUC Explained: TPR, FPR, Threshold Tuning

Section 01

The Story: Aisha's Three Fraud Models

📖 Real-World Story

A Data Scientist with Three Models and One Decision

Aisha is a senior data scientist at a fintech startup in Lagos. Her team has just built three different models to detect credit card fraud — a Logistic Regression, a Random Forest, and a Gradient Boosting model. Each model gives every transaction a fraud probability score between 0 and 1.

Her manager asks: "Which model should we deploy? And what threshold should we use to flag a transaction as fraud?"

Aisha knows accuracy won't answer this — only 0.5% of transactions are fraudulent, so a model that always says "Not Fraud" gets 99.5% accuracy while being completely useless. She needs a metric that evaluates model quality across all possible thresholds at once. That metric is the ROC Curve and its summary statistic, the AUC.

Section 02

What Is the ROC Curve?

The ROC Curve (Receiver Operating Characteristic Curve) is a graph that shows how a classifier performs at every possible classification threshold — from 0.0 (classify everything as fraud) to 1.0 (classify nothing as fraud).

At each threshold, the model produces a different confusion matrix. The ROC curve plots two numbers extracted from that matrix:

X-Axis — False Positive Rate (FPR)

FPR = FP / (FP + TN) = FP / N

Of all actual legitimate transactions, what fraction did the model wrongly flag as fraud? Also called Fall-out or 1 − Specificity. Lower is better.

Y-Axis — True Positive Rate (TPR)

TPR = TP / (TP + FN) = TP / P

Of all actual fraudulent transactions, what fraction did the model correctly catch? Also called Recall or Sensitivity. Higher is better.

💡

The Threshold-Performance Trade-off

As you lower the fraud threshold (e.g. from 0.7 → 0.3), the model flags more transactions. This catches more real fraud (TPR rises) but also incorrectly flags more legitimate transactions (FPR also rises). The ROC curve traces this exact trade-off as the threshold sweeps from 1 down to 0. A perfect model reaches TPR=1 while keeping FPR=0.

Section 03

Building the ROC Curve Step by Step

Let's use a tiny dataset — 10 transactions, 5 fraudulent (F) and 5 legitimate (L). A model has assigned each a fraud probability. We sort by probability descending and sweep the threshold:

Rank	Transaction	Actual Label	Fraud Probability (p̂)	Action at this threshold
1	T-09	Fraud (F)	0.95	← Lower threshold to here → predict T-09 as Fraud
2	T-03	Legit (L)	0.88	← Predict T-09, T-03 as Fraud
3	T-07	Fraud (F)	0.82	← Predict top-3 as Fraud
4	T-01	Fraud (F)	0.71	← Predict top-4 as Fraud
5	T-06	Legit (L)	0.60	← Predict top-5 as Fraud
6	T-10	Legit (L)	0.45	← Predict top-6 as Fraud
7	T-04	Fraud (F)	0.38	← Predict top-7 as Fraud
8	T-02	Legit (L)	0.25	← Predict top-8 as Fraud
9	T-05	Fraud (F)	0.15	← Predict top-9 as Fraud
10	T-08	Legit (L)	0.05	← Predict all as Fraud

Total Positives (Fraud): P = 5 | Total Negatives (Legit): N = 5
At each rank boundary we record the cumulative TP, FP, and compute TPR = TP/5, FPR = FP/5:

Threshold ≥	Predicted as Fraud	TP	FP	FPR = FP/5	TPR = TP/5	ROC Point
1.00 (nothing)	—	0	0	0.00	0.00	(0.00, 0.00) ← Start
0.95	T-09 F	1	0	0.00	0.20	(0.00, 0.20)
0.88	+ T-03 L	1	1	0.20	0.20	(0.20, 0.20)
0.82	+ T-07 F	2	1	0.20	0.40	(0.20, 0.40)
0.71	+ T-01 F	3	1	0.20	0.60	(0.20, 0.60)
0.60	+ T-06 L	3	2	0.40	0.60	(0.40, 0.60)
0.45	+ T-10 L	3	3	0.60	0.60	(0.60, 0.60)
0.38	+ T-04 F	4	3	0.60	0.80	(0.60, 0.80)
0.25	+ T-02 L	4	4	0.80	0.80	(0.80, 0.80)
0.15	+ T-05 F	5	4	0.80	1.00	(0.80, 1.00)
0.00 (all)	+ T-08 L	5	5	1.00	1.00	(1.00, 1.00) ← End

🎯

Key Observation

When a Fraud case is added, the point moves straight up (TPR increases, FPR stays the same — good news). When a Legit case is added, the point moves straight right (FPR increases, TPR stays the same — bad news). A model that perfectly ranks all frauds above all legitimate transactions would reach (0, 1.0) before moving right at all — an AUC of 1.0.

Section 04

The ROC Curve — Visualised

📈 ROC Curve — Aisha's 10-Transaction Example (AUC = 0.64)

Each vertical jump corresponds to catching a real fraud. Each horizontal jump is a false alarm on a legitimate transaction. AUC = 0.64 means this model ranks a random fraud above a random legitimate transaction 64% of the time.

Section 05

Calculating AUC — The Trapezoidal Rule

AUC is the area under the ROC curve, computed using the trapezoidal rule: for each consecutive pair of points (FPR₁, TPR₁) and (FPR₂, TPR₂), the area of the trapezoid beneath them is:

Trapezoidal Area Between Two ROC Points

A = (FPR₂ − FPR₁) × (TPR₁ + TPR₂) / 2

Sum this over all consecutive point pairs. Only horizontal segments contribute area (vertical jumps have zero width).

Equivalent Probabilistic Interpretation

AUC = P(score(fraud) > score(legit))

Probability that the model ranks a random fraudulent transaction higher than a random legitimate one. AUC = 0.64 → correct ranking 64% of the time.

🧮 AUC Calculation — 10-Transaction Example (Trapezoidal Rule)

Segments

Only segments where FPR changes contribute area. Listing all horizontal moves:

Seg 1: (0.00, 0.20) → (0.20, 0.20): width = 0.20, avg TPR = 0.20
Area = 0.20 × 0.20 = 0.040

Seg 2: (0.20, 0.60) → (0.40, 0.60): width = 0.20, avg TPR = 0.60
Area = 0.20 × 0.60 = 0.120

Seg 3: (0.40, 0.60) → (0.60, 0.60): width = 0.20, avg TPR = 0.60
Area = 0.20 × 0.60 = 0.120

Seg 4: (0.60, 0.80) → (0.80, 0.80): width = 0.20, avg TPR = 0.80
Area = 0.20 × 0.80 = 0.160

Seg 5: (0.80, 1.00) → (1.00, 1.00): width = 0.20, avg TPR = 1.00
Area = 0.20 × 1.00 = 0.200

Total AUC

0.040 + 0.120 + 0.120 + 0.160 + 0.200 = 0.640
The model has an AUC of 0.64 — better than random (0.5) but far from a production-ready model (0.90+).

Section 06

Multi-Model Comparison — The Real Power of ROC

The true strength of ROC curves is comparing multiple models simultaneously, across all thresholds at once. Aisha plots all three of her models on the same axes:

📊 ROC Comparison — Three Fraud Detection Models

The Gradient Boosting model dominates at every threshold — its curve is highest everywhere. Aisha should deploy this model. The Random Forest is a solid backup. Logistic Regression lags at higher FPR values, suggesting it misses more fraud at typical operating thresholds.

🎯

How to Read a Multi-Model ROC Plot

A model whose ROC curve is higher and to the left at your operating FPR is better. If Aisha's fraud team can investigate 10% of all transactions (FPR ≤ 0.10), she reads the y-value at FPR=0.10: Gradient Boost ≈ 78% recall, Random Forest ≈ 45% recall, Logistic Regression ≈ 20% recall. Clear winner.

Section 07

Interpreting the AUC Score

AUC = 0.50

🎲

Equivalent to random guessing
The diagonal line on the ROC plot
Model has learned nothing useful
Your floor — always beat this

AUC 0.50–0.70

⚠️

Poor discrimination
Better than random but barely
Revisit features & data quality
Rarely production-worthy

AUC 0.70–0.90

👍

Good to very good discrimination
Acceptable for many real problems
Tune threshold to fit your domain
Most production models live here

AUC 0.90–1.00

🏆

Excellent discrimination
Always check for data leakage
Future data can always degrade this
AUC = 1.0 → perfect (too perfect?)

⚠️

AUC Below 0.5 — A Surprising Finding

An AUC below 0.5 is not automatically bad — it means the model is consistently wrong. Flip its predictions (predict 1 when it says 0) and you get a model with AUC = 1 − original AUC. This usually means a bug in your label encoding (0 and 1 swapped) or in your feature pipeline.

Section 08

ROC-AUC vs Precision-Recall AUC — When Each Wins

The ROC curve is not always the right tool. On severely imbalanced datasets — like fraud (0.1% positive rate) — the ROC curve can be misleadingly optimistic. The Precision-Recall (PR) curve is more honest in those cases.

📈 ROC Curve — Axes

Axis	Formula	Uses
X (FPR)	FP / (FP + TN)	TN — the large negatives pool
Y (TPR / Recall)	TP / (TP + FN)	TP and FN only

On imbalanced data, TN is huge. Even many FPs produce a tiny FPR, making the curve look excellent while Precision is terrible.

📊 PR Curve — Axes

Axis	Formula	Uses
X (Recall)	TP / (TP + FN)	No TN involved
Y (Precision)	TP / (TP + FP)	No TN involved

Because TN never appears, PR-AUC is sensitive to imbalance. A bad model that generates many FPs will show up clearly here even when ROC-AUC looks fine.

Situation	Use ROC-AUC	Use PR-AUC
Class ratio ~50:50	✓ Best choice	Also works
Mild imbalance (10:1)	✓ Good choice	Also informative
Severe imbalance (100:1 or worse)	Can be misleading	✓ Preferred
Comparing models across teams	✓ Standard choice	Less universal
Care more about Precision	Less relevant	✓ Shows precision clearly
Fraud / medical rare-event detection	Use with caution	✓ Much more informative

🧩 Concrete Example

When ROC Lies to You

Aisha's bank has 100 000 transactions per day. Only 50 are fraudulent (0.05% rate). Her Gradient Boosting model flags 500 transactions as fraud. Of those 500, only 40 are real fraud.

Confusion matrix: TP=40, FP=460, FN=10, TN=99 490
ROC-AUC: Excellent ~0.93 — because FPR = 460/99 950 = 0.0046 (tiny denominator).
Precision: 40/500 = 8% — for every 12 alerts sent to the fraud team, 11 are false alarms.

ROC says "great model." PR curve says "this model is drowning your fraud team in noise." In banking operations, PR-AUC is the decision metric.

Section 09

Choosing the Right Operating Point

The ROC curve shows all possible thresholds. Choosing which point to operate at requires a business decision. Three common strategies:

📐

Youden's J Statistic

Maximises TPR − FPR. The point farthest from the diagonal. Balanced — treats FP and FN equally. Good default when costs are unknown.

J = TPR + TNR − 1 = Sensitivity + Specificity − 1

💰

Cost-Sensitive Threshold

Minimise cost(FP)·FPR + cost(FN)·(1−TPR). Define the cost of a false alarm vs a missed fraud. In fraud detection, missing a £10 000 fraud costs far more than one wasted analyst hour.

Threshold = cost(FP) / (cost(FP) + cost(FN))

🎯

Operational Constraint

Set a hard constraint based on capacity. "Our team can review 200 alerts/day" → select the threshold that produces exactly 200 positives. Or "We must catch at least 90% of fraud" → fix TPR ≥ 0.90 and minimise FPR.

Read TPR at your max tolerable FPR from the curve

⚖️

F-Beta Maximisation

Choose the threshold that maximises F-beta score. β < 1 weights Precision more; β > 1 weights Recall more. F2 (β=2) is common in fraud and medical screening where missing a positive is twice as costly as a false alarm.

F_β = (1+β²) · P·R / (β²·P + R)

Section 10

Multi-Class ROC — One-vs-Rest (OvR)

For problems with more than two classes (e.g. classifying a tumour as Benign, Malignant Type A, or Malignant Type B), the standard binary ROC cannot be directly applied. The most common extension is One-vs-Rest (OvR):

Train K binary classifiers

For each class k (k = 1 … K), treat class k as the "positive" and all other classes as the "negative." Get a probability score for each class. For 3 classes → 3 ROC curves.

Plot a ROC curve per class

Compute TPR and FPR for each class separately: "Is this Malignant Type A vs all others?" This gives you K ROC curves, each with its own AUC — so you can see which class your model struggles with.

Average the AUC scores

Macro-average AUC: unweighted mean — treats all classes equally.
Weighted-average AUC: weighted by class support — better for imbalanced multi-class.
scikit-learn: roc_auc_score(y, proba, multi_class='ovr', average='macro')

Section 11

Python — ROC Curve & AUC from Scratch

import numpy as np

# ── Build ROC curve manually ─────────────────────────────────
def roc_curve_manual(y_true, y_scores):
    """Compute TPR and FPR at every unique threshold."""
    thresholds = np.sort(np.unique(y_scores))[::-1]  # descending
    P = np.sum(y_true == 1)                       # total positives
    N = np.sum(y_true == 0)                       # total negatives

    tpr_list = [0.0]
    fpr_list = [0.0]

    for thresh in thresholds:
        y_pred = (y_scores >= thresh).astype(int)
        TP = np.sum((y_pred == 1) & (y_true == 1))
        FP = np.sum((y_pred == 1) & (y_true == 0))
        tpr_list.append(TP / P)
        fpr_list.append(FP / N)

    tpr_list.append(1.0)
    fpr_list.append(1.0)
    return np.array(fpr_list), np.array(tpr_list)

# ── AUC via trapezoidal rule ─────────────────────────────────
def auc_trapezoid(fpr, tpr):
    """Area under the ROC curve using np.trapz."""
    return np.trapz(tpr, fpr)   # integrates TPR over FPR

# ── 10-transaction example ───────────────────────────────────
y_true   = np.array([1, 0, 1, 1, 0, 0, 1, 0, 1, 0])
y_scores = np.array([0.95, 0.88, 0.82, 0.71, 0.60,
                      0.45, 0.38, 0.25, 0.15, 0.05])

fpr, tpr = roc_curve_manual(y_true, y_scores)
auc      = auc_trapezoid(fpr, tpr)

print("FPR points:", fpr.round(2))
print("TPR points:", tpr.round(2))
print(f"AUC         : {auc:.4f}")      # 0.6400

# ── Youden's J — find optimal threshold ──────────────────────
thresholds = np.sort(np.unique(y_scores))[::-1]
j_scores   = tpr[1:-1] - fpr[1:-1]     # TPR - FPR at each threshold
best_idx   = np.argmax(j_scores)
best_thresh = thresholds[best_idx]

print(f"Best threshold (Youden's J): {best_thresh:.2f}")
print(f"  TPR at best: {tpr[best_idx+1]:.2f}  "
      f"FPR at best: {fpr[best_idx+1]:.2f}")

Output

FPR points: [0. 0. 0.2 0.2 0.2 0.4 0.6 0.6 0.8 0.8 1. ] TPR points: [0. 0.2 0.2 0.4 0.6 0.6 0.6 0.8 0.8 1. 1. ] AUC : 0.6400 Best threshold (Youden's J): 0.71 TPR at best: 0.60 FPR at best: 0.20

Section 12

Python — Full Pipeline with scikit-learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets        import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.ensemble        import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics         import (
    roc_curve, roc_auc_score,
    precision_recall_curve, average_precision_score,
    RocCurveDisplay
)

# ── 1. Create imbalanced dataset ─────────────────────────────
X, y = make_classification(
    n_samples=5000, n_features=20,
    weights=[0.95, 0.05],         # 95% legit, 5% fraud
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# ── 2. Train three models ────────────────────────────────────
models = {
    'Logistic Regression'    : LogisticRegression(C=1.0, max_iter=500),
    'Random Forest'          : RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting'      : GradientBoostingClassifier(n_estimators=100, random_state=42),
}

for name, model in models.items():
    model.fit(X_train, y_train)

# ── 3. Compute & plot ROC curves ─────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

colors = ['#6366f1', '#f59e0b', '#34d399']

for (name, model), color in zip(models.items(), colors):
    y_proba = model.predict_proba(X_test)[:, 1]

    # ROC
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc          = roc_auc_score(y_test, y_proba)
    axes[0].plot(fpr, tpr, color=color, lw=2,
                 label=f"{name} (AUC={auc:.3f})")

    # Precision-Recall
    prec, rec, _ = precision_recall_curve(y_test, y_proba)
    ap            = average_precision_score(y_test, y_proba)
    axes[1].plot(rec, prec, color=color, lw=2,
                 label=f"{name} (AP={ap:.3f})")

# ROC — diagonal baseline
axes[0].plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.50)')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()

# PR — baseline (always-positive classifier)
baseline = y_test.sum() / len(y_test)
axes[1].axhline(y=baseline, color='k', linestyle='--',
                label=f'Random (AP={baseline:.3f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend()

plt.tight_layout()
plt.show()

# ── 4. Optimal threshold via Youden's J ──────────────────────
gb_proba     = models['Gradient Boosting'].predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, gb_proba)

j_scores   = tpr - fpr
best_idx   = np.argmax(j_scores)
best_thresh = thresholds[best_idx]
print(f"Optimal threshold (Youden): {best_thresh:.4f}")
print(f"  TPR = {tpr[best_idx]:.4f}   FPR = {fpr[best_idx]:.4f}")

# ── 5. Multi-class AUC (if needed) ───────────────────────────
from sklearn.datasets   import load_iris
from sklearn.multiclass import OneVsRestClassifier

X_mc, y_mc = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_mc, y_mc, test_size=0.3)

ovr_model = OneVsRestClassifier(LogisticRegression(max_iter=500))
ovr_model.fit(X_tr, y_tr)
y_proba_mc = ovr_model.predict_proba(X_te)

macro_auc = roc_auc_score(y_te, y_proba_mc,
                            multi_class='ovr', average='macro')
print(f"Multi-class Macro AUC (Iris): {macro_auc:.4f}")

Output

Optimal threshold (Youden): 0.3847 TPR = 0.8571 FPR = 0.0526 Multi-class Macro AUC (Iris): 0.9971

Section 13

Common Mistakes with ROC & AUC

⚠️

Using ROC-AUC on Severe Imbalance

When positives are <1%, a high ROC-AUC can hide a useless model. Always complement with PR-AUC or check Precision at your operating threshold.

Fix: use PR-AUC, F-beta, or Cohen's Kappa alongside

🔀

Comparing AUC Without Confidence Intervals

AUC 0.872 vs 0.868 on a 200-sample test set is not a meaningful difference. Always report confidence intervals using bootstrap resampling before declaring a winner.

Fix: bootstrap 1000× and report 95% CI on AUC difference

📉

Fitting the Scaler on the Full Dataset

Fitting StandardScaler before the train-test split leaks test-set statistics into training, inflating all metrics including AUC. Always fit on train only, transform both.

Fix: place scaler inside a Pipeline with cross-validation

🎭

Choosing Threshold After Seeing the Test Set

Selecting the "best" threshold on the test set is data snooping — it inflates recall/precision. Choose the threshold on a validation set or via cross-validation, then evaluate once on the untouched test set.

Fix: 3-way split (train / val / test) or nested CV

🤖

Ignoring Class Probability Calibration

AUC measures ranking, not calibration. A model with AUC=0.95 whose probabilities are wildly off (0.9 when the true rate is 0.3) is dangerous for downstream decisions. Use CalibratedClassifierCV for well-calibrated probabilities.

Fix: check calibration curves and apply isotonic / Platt scaling

🏷️

Wrong Class as "Positive"

roc_auc_score assumes label=1 is the positive class. If your rare class is encoded as 0 and the majority as 1, AUC will be inverted (often <0.5). Always verify your positive class label.

Fix: confirm with confusion_matrix which label is which

Section 14

Golden Rules

🎯 ROC Curve & AUC — Key Rules

AUC is a ranking metric, not a calibration metric. A model with AUC = 0.97 may still assign probabilities of 0.85 to events that happen only 30% of the time. Always inspect a calibration plot before using raw probabilities for business decisions.

ROC-AUC is threshold-independent — that is both its power and its weakness. It summarises all thresholds, so it does not tell you which threshold to use. Always select an operating threshold separately using a business-driven criterion (Youden's J, F-beta, or cost matrix) on the validation set.

On imbalanced data (<10% positive rate), always compute PR-AUC alongside ROC-AUC. ROC-AUC can stay high while Precision collapses. Report both. If they diverge, PR-AUC is usually the more honest signal for rare-event problems.

Never compare AUC values from different test sets. A model with AUC = 0.91 on a balanced dataset is not better than one with AUC = 0.87 on a 1:100 imbalanced dataset — the numbers are not comparable. Always compare models on the same held-out test set under identical conditions.

Validate AUC with bootstrap confidence intervals before shipping. A single point estimate of AUC hides uncertainty. Bootstrap resample the test set 1 000 times and report the 95% CI. Differences within the CI are noise, not model improvement.

Cross-validate AUC, not just the final holdout. Use cross_val_score(model, X, y, scoring='roc_auc', cv=5). A model with mean AUC = 0.89 ± 0.02 across folds is far more trustworthy than one with a single holdout AUC of 0.92 but high variance.