The Story: Aisha's Three Fraud Models
Her manager asks: "Which model should we deploy? And what threshold should we use to flag a transaction as fraud?"
Aisha knows accuracy won't answer this โ only 0.5% of transactions are fraudulent, so a model that always says "Not Fraud" gets 99.5% accuracy while being completely useless. She needs a metric that evaluates model quality across all possible thresholds at once. That metric is the ROC Curve and its summary statistic, the AUC.
What Is the ROC Curve?
The ROC Curve (Receiver Operating Characteristic Curve) is a graph that shows how a classifier performs at every possible classification threshold โ from 0.0 (classify everything as fraud) to 1.0 (classify nothing as fraud).
At each threshold, the model produces a different confusion matrix. The ROC curve plots two numbers extracted from that matrix:
As you lower the fraud threshold (e.g. from 0.7 โ 0.3), the model flags more transactions. This catches more real fraud (TPR rises) but also incorrectly flags more legitimate transactions (FPR also rises). The ROC curve traces this exact trade-off as the threshold sweeps from 1 down to 0. A perfect model reaches TPR=1 while keeping FPR=0.
Building the ROC Curve Step by Step
Let's use a tiny dataset โ 10 transactions, 5 fraudulent (F) and 5 legitimate (L). A model has assigned each a fraud probability. We sort by probability descending and sweep the threshold:
| Rank | Transaction | Actual Label | Fraud Probability (pฬ) | Action at this threshold |
|---|---|---|---|---|
| 1 | T-09 | Fraud (F) | 0.95 | โ Lower threshold to here โ predict T-09 as Fraud |
| 2 | T-03 | Legit (L) | 0.88 | โ Predict T-09, T-03 as Fraud |
| 3 | T-07 | Fraud (F) | 0.82 | โ Predict top-3 as Fraud |
| 4 | T-01 | Fraud (F) | 0.71 | โ Predict top-4 as Fraud |
| 5 | T-06 | Legit (L) | 0.60 | โ Predict top-5 as Fraud |
| 6 | T-10 | Legit (L) | 0.45 | โ Predict top-6 as Fraud |
| 7 | T-04 | Fraud (F) | 0.38 | โ Predict top-7 as Fraud |
| 8 | T-02 | Legit (L) | 0.25 | โ Predict top-8 as Fraud |
| 9 | T-05 | Fraud (F) | 0.15 | โ Predict top-9 as Fraud |
| 10 | T-08 | Legit (L) | 0.05 | โ Predict all as Fraud |
Total Positives (Fraud): P = 5 |
Total Negatives (Legit): N = 5
At each rank boundary we record the cumulative TP, FP, and compute TPR = TP/5, FPR = FP/5:
| Threshold โฅ | Predicted as Fraud | TP | FP | FPR = FP/5 | TPR = TP/5 | ROC Point |
|---|---|---|---|---|---|---|
| 1.00 (nothing) | โ | 0 | 0 | 0.00 | 0.00 | (0.00, 0.00) โ Start |
| 0.95 | T-09 F | 1 | 0 | 0.00 | 0.20 | (0.00, 0.20) |
| 0.88 | + T-03 L | 1 | 1 | 0.20 | 0.20 | (0.20, 0.20) |
| 0.82 | + T-07 F | 2 | 1 | 0.20 | 0.40 | (0.20, 0.40) |
| 0.71 | + T-01 F | 3 | 1 | 0.20 | 0.60 | (0.20, 0.60) |
| 0.60 | + T-06 L | 3 | 2 | 0.40 | 0.60 | (0.40, 0.60) |
| 0.45 | + T-10 L | 3 | 3 | 0.60 | 0.60 | (0.60, 0.60) |
| 0.38 | + T-04 F | 4 | 3 | 0.60 | 0.80 | (0.60, 0.80) |
| 0.25 | + T-02 L | 4 | 4 | 0.80 | 0.80 | (0.80, 0.80) |
| 0.15 | + T-05 F | 5 | 4 | 0.80 | 1.00 | (0.80, 1.00) |
| 0.00 (all) | + T-08 L | 5 | 5 | 1.00 | 1.00 | (1.00, 1.00) โ End |
When a Fraud case is added, the point moves straight up (TPR increases, FPR stays the same โ good news). When a Legit case is added, the point moves straight right (FPR increases, TPR stays the same โ bad news). A model that perfectly ranks all frauds above all legitimate transactions would reach (0, 1.0) before moving right at all โ an AUC of 1.0.
The ROC Curve โ Visualised
Each vertical jump corresponds to catching a real fraud. Each horizontal jump is a false alarm on a legitimate transaction. AUC = 0.64 means this model ranks a random fraud above a random legitimate transaction 64% of the time.
Calculating AUC โ The Trapezoidal Rule
AUC is the area under the ROC curve, computed using the trapezoidal rule: for each consecutive pair of points (FPRโ, TPRโ) and (FPRโ, TPRโ), the area of the trapezoid beneath them is:
Seg 1: (0.00, 0.20) โ (0.20, 0.20): width = 0.20, avg TPR = 0.20
Area = 0.20 ร 0.20 = 0.040
Seg 2: (0.20, 0.60) โ (0.40, 0.60): width = 0.20, avg TPR = 0.60
Area = 0.20 ร 0.60 = 0.120
Seg 3: (0.40, 0.60) โ (0.60, 0.60): width = 0.20, avg TPR = 0.60
Area = 0.20 ร 0.60 = 0.120
Seg 4: (0.60, 0.80) โ (0.80, 0.80): width = 0.20, avg TPR = 0.80
Area = 0.20 ร 0.80 = 0.160
Seg 5: (0.80, 1.00) โ (1.00, 1.00): width = 0.20, avg TPR = 1.00
Area = 0.20 ร 1.00 = 0.200
The model has an AUC of 0.64 โ better than random (0.5) but far from a production-ready model (0.90+).
Multi-Model Comparison โ The Real Power of ROC
The true strength of ROC curves is comparing multiple models simultaneously, across all thresholds at once. Aisha plots all three of her models on the same axes:
The Gradient Boosting model dominates at every threshold โ its curve is highest everywhere. Aisha should deploy this model. The Random Forest is a solid backup. Logistic Regression lags at higher FPR values, suggesting it misses more fraud at typical operating thresholds.
A model whose ROC curve is higher and to the left at your operating FPR is better. If Aisha's fraud team can investigate 10% of all transactions (FPR โค 0.10), she reads the y-value at FPR=0.10: Gradient Boost โ 78% recall, Random Forest โ 45% recall, Logistic Regression โ 20% recall. Clear winner.
Interpreting the AUC Score
- Equivalent to random guessing
- The diagonal line on the ROC plot
- Model has learned nothing useful
- Your floor โ always beat this
- Poor discrimination
- Better than random but barely
- Revisit features & data quality
- Rarely production-worthy
- Good to very good discrimination
- Acceptable for many real problems
- Tune threshold to fit your domain
- Most production models live here
- Excellent discrimination
- Always check for data leakage
- Future data can always degrade this
- AUC = 1.0 โ perfect (too perfect?)
An AUC below 0.5 is not automatically bad โ it means the model is consistently wrong. Flip its predictions (predict 1 when it says 0) and you get a model with AUC = 1 โ original AUC. This usually means a bug in your label encoding (0 and 1 swapped) or in your feature pipeline.
ROC-AUC vs Precision-Recall AUC โ When Each Wins
The ROC curve is not always the right tool. On severely imbalanced datasets โ like fraud (0.1% positive rate) โ the ROC curve can be misleadingly optimistic. The Precision-Recall (PR) curve is more honest in those cases.
| Axis | Formula | Uses |
|---|---|---|
| X (FPR) | FP / (FP + TN) | TN โ the large negatives pool |
| Y (TPR / Recall) | TP / (TP + FN) | TP and FN only |
On imbalanced data, TN is huge. Even many FPs produce a tiny FPR, making the curve look excellent while Precision is terrible.
| Axis | Formula | Uses |
|---|---|---|
| X (Recall) | TP / (TP + FN) | No TN involved |
| Y (Precision) | TP / (TP + FP) | No TN involved |
Because TN never appears, PR-AUC is sensitive to imbalance. A bad model that generates many FPs will show up clearly here even when ROC-AUC looks fine.
| Situation | Use ROC-AUC | Use PR-AUC |
|---|---|---|
| Class ratio ~50:50 | โ Best choice | Also works |
| Mild imbalance (10:1) | โ Good choice | Also informative |
| Severe imbalance (100:1 or worse) | Can be misleading | โ Preferred |
| Comparing models across teams | โ Standard choice | Less universal |
| Care more about Precision | Less relevant | โ Shows precision clearly |
| Fraud / medical rare-event detection | Use with caution | โ Much more informative |
Confusion matrix: TP=40, FP=460, FN=10, TN=99 490
ROC-AUC: Excellent ~0.93 โ because FPR = 460/99 950 = 0.0046 (tiny denominator).
Precision: 40/500 = 8% โ for every 12 alerts sent to the fraud team, 11 are false alarms.
ROC says "great model." PR curve says "this model is drowning your fraud team in noise." In banking operations, PR-AUC is the decision metric.
Choosing the Right Operating Point
The ROC curve shows all possible thresholds. Choosing which point to operate at requires a business decision. Three common strategies:
TPR โ FPR. The point farthest from the diagonal.
Balanced โ treats FP and FN equally. Good default when costs are unknown.
cost(FP)ยทFPR + cost(FN)ยท(1โTPR). Define the cost of
a false alarm vs a missed fraud. In fraud detection, missing a ยฃ10 000 fraud
costs far more than one wasted analyst hour.
Multi-Class ROC โ One-vs-Rest (OvR)
For problems with more than two classes (e.g. classifying a tumour as Benign, Malignant Type A, or Malignant Type B), the standard binary ROC cannot be directly applied. The most common extension is One-vs-Rest (OvR):
Weighted-average AUC: weighted by class support โ better for imbalanced multi-class.
scikit-learn:
roc_auc_score(y, proba, multi_class='ovr', average='macro')
Python โ ROC Curve & AUC from Scratch
import numpy as np
# โโ Build ROC curve manually โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def roc_curve_manual(y_true, y_scores):
"""Compute TPR and FPR at every unique threshold."""
thresholds = np.sort(np.unique(y_scores))[::-1] # descending
P = np.sum(y_true == 1) # total positives
N = np.sum(y_true == 0) # total negatives
tpr_list = [0.0]
fpr_list = [0.0]
for thresh in thresholds:
y_pred = (y_scores >= thresh).astype(int)
TP = np.sum((y_pred == 1) & (y_true == 1))
FP = np.sum((y_pred == 1) & (y_true == 0))
tpr_list.append(TP / P)
fpr_list.append(FP / N)
tpr_list.append(1.0)
fpr_list.append(1.0)
return np.array(fpr_list), np.array(tpr_list)
# โโ AUC via trapezoidal rule โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def auc_trapezoid(fpr, tpr):
"""Area under the ROC curve using np.trapz."""
return np.trapz(tpr, fpr) # integrates TPR over FPR
# โโ 10-transaction example โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
y_true = np.array([1, 0, 1, 1, 0, 0, 1, 0, 1, 0])
y_scores = np.array([0.95, 0.88, 0.82, 0.71, 0.60,
0.45, 0.38, 0.25, 0.15, 0.05])
fpr, tpr = roc_curve_manual(y_true, y_scores)
auc = auc_trapezoid(fpr, tpr)
print("FPR points:", fpr.round(2))
print("TPR points:", tpr.round(2))
print(f"AUC : {auc:.4f}") # 0.6400
# โโ Youden's J โ find optimal threshold โโโโโโโโโโโโโโโโโโโโโโ
thresholds = np.sort(np.unique(y_scores))[::-1]
j_scores = tpr[1:-1] - fpr[1:-1] # TPR - FPR at each threshold
best_idx = np.argmax(j_scores)
best_thresh = thresholds[best_idx]
print(f"Best threshold (Youden's J): {best_thresh:.2f}")
print(f" TPR at best: {tpr[best_idx+1]:.2f} "
f"FPR at best: {fpr[best_idx+1]:.2f}")
Python โ Full Pipeline with scikit-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
roc_curve, roc_auc_score,
precision_recall_curve, average_precision_score,
RocCurveDisplay
)
# โโ 1. Create imbalanced dataset โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
X, y = make_classification(
n_samples=5000, n_features=20,
weights=[0.95, 0.05], # 95% legit, 5% fraud
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# โโ 2. Train three models โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
models = {
'Logistic Regression' : LogisticRegression(C=1.0, max_iter=500),
'Random Forest' : RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting' : GradientBoostingClassifier(n_estimators=100, random_state=42),
}
for name, model in models.items():
model.fit(X_train, y_train)
# โโ 3. Compute & plot ROC curves โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
colors = ['#6366f1', '#f59e0b', '#34d399']
for (name, model), color in zip(models.items(), colors):
y_proba = model.predict_proba(X_test)[:, 1]
# ROC
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
axes[0].plot(fpr, tpr, color=color, lw=2,
label=f"{name} (AUC={auc:.3f})")
# Precision-Recall
prec, rec, _ = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)
axes[1].plot(rec, prec, color=color, lw=2,
label=f"{name} (AP={ap:.3f})")
# ROC โ diagonal baseline
axes[0].plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.50)')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()
# PR โ baseline (always-positive classifier)
baseline = y_test.sum() / len(y_test)
axes[1].axhline(y=baseline, color='k', linestyle='--',
label=f'Random (AP={baseline:.3f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend()
plt.tight_layout()
plt.show()
# โโ 4. Optimal threshold via Youden's J โโโโโโโโโโโโโโโโโโโโโโ
gb_proba = models['Gradient Boosting'].predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, gb_proba)
j_scores = tpr - fpr
best_idx = np.argmax(j_scores)
best_thresh = thresholds[best_idx]
print(f"Optimal threshold (Youden): {best_thresh:.4f}")
print(f" TPR = {tpr[best_idx]:.4f} FPR = {fpr[best_idx]:.4f}")
# โโ 5. Multi-class AUC (if needed) โโโโโโโโโโโโโโโโโโโโโโโโโโโ
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
X_mc, y_mc = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_mc, y_mc, test_size=0.3)
ovr_model = OneVsRestClassifier(LogisticRegression(max_iter=500))
ovr_model.fit(X_tr, y_tr)
y_proba_mc = ovr_model.predict_proba(X_te)
macro_auc = roc_auc_score(y_te, y_proba_mc,
multi_class='ovr', average='macro')
print(f"Multi-class Macro AUC (Iris): {macro_auc:.4f}")
Common Mistakes with ROC & AUC
StandardScaler before the train-test split leaks test-set
statistics into training, inflating all metrics including AUC. Always fit on
train only, transform both.
CalibratedClassifierCV for well-calibrated probabilities.
roc_auc_score assumes label=1 is the positive class. If your rare
class is encoded as 0 and the majority as 1, AUC will be inverted (often <0.5).
Always verify your positive class label.
Golden Rules
cross_val_score(model, X, y, scoring='roc_auc', cv=5).
A model with mean AUC = 0.89 ยฑ 0.02 across folds is far more trustworthy
than one with a single holdout AUC of 0.92 but high variance.