Machine Learning πŸ“‚ Supervised Learning Β· 2 of 17 50 min read

Logistic Regression - Binary Classification from Intuition to Math

A complete guide to Logistic Regression covering the sigmoid function, log-odds derivation, binary cross-entropy loss, and gradient descent β€” followed by a deep dive into evaluation metrics: confusion matrix, precision, recall, F1-score, ROC-AUC, and threshold tuning. Includes real clinical stories, inline SVG diagrams, step-by-step calculations, regularisation techniques, multiclass extension with Softmax, and full Python implementations from scratch and with scikit-learn.

Section 01

The Story: Dr. Sharma's Diagnosis Machine

A Doctor Who Needed a Yes or No Answer
Dr. Sharma is an oncologist in Bengaluru. Every week she reviews dozens of biopsy reports, each containing measurements: tumour size (mm), cell uniformity score, clump thickness, and more. Her task is binary β€” Benign or Malignant?

She calls her data-science intern, Arjun, and says: "Build me a model that looks at these numbers and tells me the probability that a tumour is malignant. Not a continuous value β€” a probability between 0 and 1."

Arjun immediately knows that Linear Regression won't work here. A straight line can predict values below 0 and above 1 β€” meaningless for a probability. He reaches for Logistic Regression.

Logistic Regression is the go-to algorithm for binary classification β€” problems where the output is one of two classes (Yes/No, Spam/Not-Spam, Fraud/Legit, Malignant/Benign). Despite its name, it is a classification algorithm, not regression. It predicts the probability that an observation belongs to a class, then applies a threshold to make the final class decision.


Section 02

Why Linear Regression Fails for Classification

❌ Linear Regression on a Binary Problem
ProblemExample Output
Predictions go below 0Ε· = βˆ’0.34 for a small tumour
Predictions go above 1Ε· = 1.72 for a huge tumour
No natural thresholdWhat does 0.73 kg of "malignant" mean?
Sensitive to outliersOne extreme point tilts the whole line
βœ… Logistic Regression on a Binary Problem
PropertyResult
Output always in [0, 1]Ε· = 0.82 β†’ 82% chance of malignant
Interpretable thresholdΕ· β‰₯ 0.5 β†’ predict Malignant
Probabilistic output"82% confident it is malignant"
Well-defined loss functionBinary Cross-Entropy (Log Loss)
πŸ’‘
The Core Trick

Logistic Regression takes the linear equation z = Ξ²β‚€ + β₁x₁ + … + Ξ²β‚™xβ‚™ and passes it through a Sigmoid function that squashes any real number into the range (0, 1). This output is then interpreted as a probability.


Section 03

The Sigmoid Function β€” The Heart of Logistic Regression

Sigmoid Function Οƒ(z)
Οƒ(z) = 1 / (1 + eβˆ’z)
Takes any real number z (βˆ’βˆž to +∞) and maps it smoothly to the range (0, 1). Also called the logistic function.
Key Properties
Οƒ(0) = 0.5  |  Οƒ(+∞) β†’ 1  |  Οƒ(βˆ’βˆž) β†’ 0
Symmetric around 0. When z is large positive β†’ probability β‰ˆ 1. When z is large negative β†’ probability β‰ˆ 0.
πŸ“ˆ The Sigmoid Curve β€” Οƒ(z) = 1 / (1 + e⁻ᢻ)
0.0 0.25 0.5 0.75 1.0 βˆ’6 βˆ’4 βˆ’2 0 +2 +4 +6 z (linear combination = Ξ²β‚€ + β₁x) Οƒ(z) β€” Probability 0.5 Οƒ(0) = 0.5 β†’ Predict Class 0 β†’ Predict Class 1

The sigmoid squashes any linear score z into (0,1). Below z=0 the model leans toward Class 0; above z=0 it leans toward Class 1. The exact threshold is tunable.

Sigmoid Key Values at a Glance

z (linear score)Οƒ(z) (probability)Interpretation
βˆ’60.002Very confident β€” Class 0
βˆ’20.119Likely Class 0
00.500Completely uncertain
+20.881Likely Class 1
+60.998Very confident β€” Class 1

Section 04

The Full Logistic Regression Model

01
Compute the Linear Score (Logit) z
Exactly like Linear Regression: multiply each feature by its weight and sum them up.
z = Ξ²β‚€ + β₁x₁ + Ξ²β‚‚xβ‚‚ + … + Ξ²β‚™xβ‚™
For Dr. Sharma: z = Ξ²β‚€ + β₁·(tumour_size) + Ξ²β‚‚Β·(cell_uniformity)
02
Apply the Sigmoid to Get a Probability
pΜ‚ = Οƒ(z) = 1 / (1 + eβˆ’z)
This is the model's estimated probability that the sample belongs to Class 1 (Malignant). E.g. pΜ‚ = 0.87 β†’ 87% chance of malignancy.
03
Apply a Decision Threshold
Convert the probability to a hard class label:
Ε· = 1 if pΜ‚ β‰₯ 0.5, else Ε· = 0
The default threshold is 0.5 but it is tunable. In cancer screening you might lower it to 0.3 to minimise false negatives (missed cancers).
04
Interpret the Result
Ε· = 0 β†’ Predicted Benign  |  Ε· = 1 β†’ Predicted Malignant.
Report both the hard label and the probability to the doctor. "87% confident this is malignant" is far more actionable than just "Malignant."

Section 05

The Math Behind It β€” Log-Odds and the Logit

Why is it called logistic regression? Because it models the log-odds of the outcome as a linear function of the features.

πŸ“ Deriving the Logit from Probability
Step 1
Start with the probability: pΜ‚ = 1 / (1 + eβˆ’z)
Rearrange to isolate z: multiply both sides by (1 + eβˆ’z)
pΜ‚ Β· (1 + eβˆ’z) = 1  β†’  eβˆ’z = (1 βˆ’ pΜ‚) / pΜ‚
Step 2
Take the natural logarithm of both sides:
βˆ’z = ln((1 βˆ’ pΜ‚) / pΜ‚)  β†’  z = ln(pΜ‚ / (1 βˆ’ pΜ‚))
The term pΜ‚ / (1 βˆ’ pΜ‚) is called the odds ratio.
Result
ln( pΜ‚ / (1βˆ’pΜ‚) ) = Ξ²β‚€ + β₁x₁ + Ξ²β‚‚xβ‚‚ + … + Ξ²β‚™xβ‚™
The log-odds (left side) is modelled as a linear combination of features. This is why the algorithm is called logistic regression: logistic transformation of a regression equation.
Probability pΜ‚Odds pΜ‚/(1βˆ’pΜ‚)Log-Odds (z)
0.100.111βˆ’2.20
0.250.333βˆ’1.10
0.501.0000.00
0.753.000+1.10
0.909.000+2.20
πŸ“
Interpreting Coefficients

Each Ξ² coefficient represents the change in log-odds per unit increase in the feature. Exponentiate it to get the odds ratio: eβ₁. If β₁ = 0.8 for tumour size, then e0.8 β‰ˆ 2.23 β€” each extra mm of tumour size multiplies the odds of malignancy by 2.23Γ—. This is the medical interpretation Dr. Sharma cares about.


Section 06

The Cost Function β€” Binary Cross-Entropy (Log Loss)

Logistic Regression cannot use the same Mean Squared Error cost as Linear Regression β€” that gives a non-convex surface with many local minima. Instead it uses Binary Cross-Entropy, also called Log Loss.

Log Loss for a Single Sample
L = βˆ’[ yΒ·log(pΜ‚) + (1βˆ’y)Β·log(1βˆ’pΜ‚) ]
When y=1 (actual positive): only βˆ’log(pΜ‚) matters β†’ penalises low confidence.
When y=0 (actual negative): only βˆ’log(1βˆ’pΜ‚) matters β†’ penalises high confidence.
Average Log Loss Over All n Samples
J(Ξ²) = βˆ’(1/n) Β· Ξ£[ yα΅’Β·log(pΜ‚α΅’) + (1βˆ’yα΅’)Β·log(1βˆ’pΜ‚α΅’) ]
This is the function we minimise with gradient descent. It is convex β€” guaranteed to have a single global minimum.
πŸ“Š How Log Loss Punishes Wrong Confident Predictions
Loss when y = 1 (Actual: Malignant) 0 1 pΜ‚ (predicted probability) Loss High loss! Predicted 0.05 but actual = 1 Low loss Loss when y = 0 (Actual: Benign) 0 1 pΜ‚ (predicted probability) Low loss High loss! Predicted 0.95 but actual = 0

Log Loss is asymmetric and explodes toward infinity when the model is confidently wrong. This harsh penalty pushes the model toward well-calibrated probabilities.


Section 07

Gradient Descent β€” Finding the Best Weights

Unlike Linear Regression, Logistic Regression has no closed-form solution. We minimise J(Ξ²) iteratively using Gradient Descent.

Gradient of Log Loss w.r.t. Ξ²β±Ό
βˆ‚J/βˆ‚Ξ²β±Ό = (1/n) Β· Ξ£ (pΜ‚α΅’ βˆ’ yα΅’) Β· xα΅’β±Ό
Beautifully simple β€” same form as Linear Regression's gradient. The difference (pΜ‚α΅’ βˆ’ yα΅’) is the prediction error for sample i.
Weight Update Rule (Gradient Descent)
Ξ²β±Ό ← Ξ²β±Ό βˆ’ Ξ± Β· βˆ‚J/βˆ‚Ξ²β±Ό
Ξ± is the learning rate. Repeat this update for every coefficient Ξ²β±Ό on every iteration until the loss converges to its minimum.
πŸ” One Full Gradient Descent Iteration (1 Feature Example)
Init
Start with Ξ²β‚€ = 0, β₁ = 0, learning rate Ξ± = 0.1.
Training data: 4 tumours β€” sizes [2, 3, 5, 7], labels [0, 0, 1, 1].
Forward
Compute z = Ξ²β‚€ + β₁·x for each sample: [0, 0, 0, 0] (all zero at init).
Apply sigmoid: pΜ‚ = Οƒ(z) = [0.5, 0.5, 0.5, 0.5].
Errors (pΜ‚ βˆ’ y): [0.5βˆ’0, 0.5βˆ’0, 0.5βˆ’1, 0.5βˆ’1] = [0.5, 0.5, βˆ’0.5, βˆ’0.5].
Gradient
βˆ‚J/βˆ‚Ξ²β‚€ = (1/4)Β·(0.5+0.5βˆ’0.5βˆ’0.5) = 0.0
βˆ‚J/βˆ‚Ξ²β‚ = (1/4)Β·(0.5Β·2 + 0.5Β·3 + (βˆ’0.5)Β·5 + (βˆ’0.5)Β·7) = (1/4)Β·(1+1.5βˆ’2.5βˆ’3.5) = (1/4)Β·(βˆ’3.5) = βˆ’0.875
Update
Ξ²β‚€ ← 0 βˆ’ 0.1 Β· 0.0 = 0.0
β₁ ← 0 βˆ’ 0.1 Β· (βˆ’0.875) = +0.0875
The weight for tumour size has gone positive β€” larger tumours now get higher probability. βœ“
πŸ’‘
Convexity Guarantee

The Binary Cross-Entropy loss surface is convex with respect to Ξ². This means gradient descent is guaranteed to find the global minimum β€” there are no local minima to get trapped in. This is why Logistic Regression is so reliable compared to non-convex models like neural networks.


Section 08

The Decision Boundary

Once trained, Logistic Regression draws a linear decision boundary in feature space. Points on one side get Class 0, points on the other get Class 1.

πŸ“Š Decision Boundary β€” Tumour Classification (2 Features)
Tumour Size (mm) Cell Uniformity Score 2 4 6 8 10 12 1 3 5 7 9 Decision Boundary B B Benign (Class 0) Malignant (Class 1) Decision Boundary

Logistic Regression creates a linear boundary: Ξ²β‚€ + β₁·size + Ξ²β‚‚Β·uniformity = 0. Points above the boundary are predicted Benign; points below are predicted Malignant.

⚠️
Logistic Regression is Inherently Linear

The decision boundary is always a straight line (2D), plane (3D), or hyperplane (nD). If your classes are not linearly separable β€” for example, one class forms a ring around the other β€” Logistic Regression will struggle. Solutions: add polynomial features, use kernel methods, or switch to tree-based models.


Section 09

Evaluating the Model β€” The Confusion Matrix

Accuracy alone is dangerous for classification. If 95% of tumours are benign, a model that always predicts "Benign" gets 95% accuracy β€” but misses every malignant case. The Confusion Matrix gives the full picture.

πŸŸ₯ Confusion Matrix β€” Dr. Sharma's Model on 100 Test Samples
Predicted Label Predicted Benign (0)            Predicted Malignant (1) Actual Label Actual Benign (0) Actual Malignant (1) 57 True Negative Correctly said Benign (TN) 5 False Positive Wrongly said Malignant (FP) β€” Type I Error 3 False Negative Missed a Malignant! (FN) β€” Type II Error 35 True Positive Correctly said Malignant (TP)
TermAbbrCountWhat Happened
True NegativeTN57Actual Benign, predicted Benign βœ“
False PositiveFP5Actual Benign, predicted Malignant (unnecessary biopsy)
False NegativeFN3Actual Malignant, predicted Benign β€” most dangerous!
True PositiveTP35Actual Malignant, predicted Malignant βœ“

Section 10

Precision, Recall, and F1-Score

Accuracy
(TP + TN) / (TP + TN + FP + FN)
(35+57)/100 = 92% β€” misleading on imbalanced data
Precision (Positive Predictive Value)
TP / (TP + FP)
35/(35+5) = 87.5% β€” of all predicted Malignant, 87.5% were correct
Recall (Sensitivity / True Positive Rate)
TP / (TP + FN)
35/(35+3) = 92.1% β€” of all actual Malignant, model found 92.1%
F1-Score (Harmonic Mean of P and R)
2 Β· (Precision Β· Recall) / (Precision + Recall)
2Β·(0.875Β·0.921)/(0.875+0.921) = 89.7% β€” balanced single metric
Why Dr. Sharma Cares More About Recall Than Precision
In cancer screening, a False Negative (missed cancer) can be fatal β€” the patient goes home thinking they are fine. A False Positive (unnecessary biopsy) is costly and stressful, but survivable.

So Dr. Sharma sets the decision threshold at 0.3 instead of 0.5 β€” more tumours are flagged as suspicious, reducing FN at the cost of more FP. Recall goes up from 92% to 97%; Precision drops to 74%. The F1 score drops slightly β€” but in medicine, high recall is worth the trade.

This is why you should never optimise for accuracy alone. Always ask: "What is the cost of a False Negative vs a False Positive in this domain?"
MetricUse WhenAvoid When
AccuracyClasses are balancedImbalanced datasets (95% Class 0)
PrecisionCost of FP is high (spam filter β€” annoying if legitimate email goes to spam)Cost of FN is high
RecallCost of FN is high (disease detection, fraud)Cost of FP is high
F1-ScoreNeed a single metric balancing P and RWhen P and R should not be equally weighted
ROC-AUCComparing models at all thresholdsSeverely imbalanced data (use PR-AUC instead)

Section 11

ROC Curve and AUC

The ROC Curve (Receiver Operating Characteristic) plots the True Positive Rate (Recall) against the False Positive Rate at every possible threshold from 0 to 1. The area under this curve β€” AUC β€” summarises model quality in a single number.

πŸ“ˆ ROC Curve β€” Dr. Sharma's Model (AUC β‰ˆ 0.97)
False Positive Rate (1 βˆ’ Specificity) True Positive Rate (Recall) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Random (AUC=0.5) AUC = 0.97 Perfect (AUC=1.0)

AUC = 0.97 means the model ranks a random positive sample above a random negative sample 97% of the time. AUC is threshold-independent and works for imbalanced datasets.

AUC = 0.5
🎲
Random guessing. Model has learned nothing. Your baseline to beat.
AUC = 0.7–0.9
πŸ‘
Acceptable to good discrimination. Useful in many real-world problems.
AUC = 0.9–1.0
πŸ†
Excellent. Check for data leakage β€” often too good to be true on real data.

Section 12

Python β€” Logistic Regression from Scratch

import numpy as np

# ── Sigmoid function ─────────────────────────────────────────
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# ── Binary Cross-Entropy Loss ────────────────────────────────
def log_loss(y, p_hat):
    n = len(y)
    eps = 1e-15                           # clip to avoid log(0)
    p_hat = np.clip(p_hat, eps, 1 - eps)
    return -(1 / n) * np.sum(
        y * np.log(p_hat) + (1 - y) * np.log(1 - p_hat)
    )

# ── Training with Gradient Descent ──────────────────────────
def train_logistic(X, y, lr=0.1, epochs=1000):
    n, p  = X.shape
    beta  = np.zeros(p + 1)               # Ξ²β‚€ + p weights
    X_b   = np.column_stack([np.ones(n), X])  # add bias column

    for epoch in range(epochs):
        z      = X_b @ beta                # linear score
        p_hat  = sigmoid(z)              # predicted probability
        error  = p_hat - y                 # residual (pΜ‚α΅’ βˆ’ yα΅’)
        grad   = (1 / n) * (X_b.T @ error) # gradient vector
        beta  -= lr * grad                 # update weights

        if epoch % 100 == 0:
            loss = log_loss(y, p_hat)
            print(f"Epoch {epoch:4d}  |  Loss: {loss:.4f}")

    return beta

# ── Prediction ───────────────────────────────────────────────
def predict_proba(X, beta):
    X_b = np.column_stack([np.ones(len(X)), X])
    return sigmoid(X_b @ beta)

def predict(X, beta, threshold=0.5):
    return (predict_proba(X, beta) >= threshold).astype(int)

# ── Toy dataset: tumour sizes ────────────────────────────────
np.random.seed(42)
X_train = np.random.randn(100, 2)        # 100 samples, 2 features
y_train = ((X_train[:, 0] + X_train[:, 1]) > 0).astype(float)

beta = train_logistic(X_train, y_train, lr=0.5, epochs=500)
print(f"\nLearned weights: Ξ²β‚€={beta[0]:.3f}  β₁={beta[1]:.3f}  Ξ²β‚‚={beta[2]:.3f}")

proba = predict_proba(X_train[:3], beta)
print(f"Predicted probabilities (first 3): {proba.round(3)}")
Output
Epoch 0 | Loss: 0.6931 Epoch 100 | Loss: 0.2014 Epoch 200 | Loss: 0.1687 Epoch 300 | Loss: 0.1563 Epoch 400 | Loss: 0.1493 Learned weights: Ξ²β‚€=βˆ’0.012 β₁=2.184 Ξ²β‚‚=2.173 Predicted probabilities (first 3): [0.089 0.965 0.213]

Section 13

Python β€” Full Pipeline with scikit-learn

import numpy as np
from sklearn.datasets         import load_breast_cancer
from sklearn.model_selection  import train_test_split
from sklearn.preprocessing    import StandardScaler
from sklearn.linear_model     import LogisticRegression
from sklearn.metrics          import (
    classification_report, confusion_matrix,
    roc_auc_score, RocCurveDisplay
)

# ── 1. Load Data ─────────────────────────────────────────────
data    = load_breast_cancer()        # Wisconsin Breast Cancer dataset
X, y    = data.data, data.target       # 569 samples, 30 features

# ── 2. Train / Test Split ────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── 3. Feature Scaling (critical for Logistic Regression) ───
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)     # transform only β€” no fit!

# ── 4. Train Model ───────────────────────────────────────────
model = LogisticRegression(
    C=1.0,             # inverse of regularisation strength (1/Ξ»)
    solver='lbfgs',   # efficient solver for small-medium datasets
    max_iter=1000,
    random_state=42
)
model.fit(X_train, y_train)

# ── 5. Evaluate ──────────────────────────────────────────────
y_pred       = model.predict(X_test)
y_proba      = model.predict_proba(X_test)[:, 1]  # P(Class=1)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {auc:.4f}")

# ── 6. Tune threshold for high Recall ───────────────────────
threshold = 0.3
y_pred_low = (y_proba >= threshold).astype(int)
print(f"\nAt threshold={threshold}:")
print(classification_report(y_test, y_pred_low, target_names=data.target_names))
Output
Confusion Matrix: [[41 2] [ 1 70]] Classification Report: precision recall f1-score support malignant 0.98 0.95 0.96 43 benign 0.97 0.99 0.98 71 accuracy 0.97 114 ROC-AUC Score: 0.9971 At threshold=0.3: precision recall f1-score support malignant 0.93 1.00 0.96 43 benign 1.00 0.96 0.98 71
⚠️
Always Scale Before Logistic Regression

Logistic Regression uses gradient descent or solver optimisation, both of which converge far slower (or incorrectly) when features have very different scales. A feature in thousands (income) will dominate one in decimals (age in decades). Always apply StandardScaler or MinMaxScaler first β€” and fit the scaler only on training data, then transform both train and test.


Section 14

Regularisation β€” Preventing Overfitting

When there are many features (especially more features than samples), Logistic Regression can overfit β€” it memorises training noise. Regularisation adds a penalty term to the cost function to keep weights small.

πŸ”΅
L2 Regularisation (Ridge)
Adds λ · Σβⱼ² to the cost. Shrinks all weights toward zero but never to exactly zero. Keeps all features in the model with smaller coefficients. Best when most features are useful.
sklearn: penalty='l2' C=1/Ξ» (default)
🟒
L1 Regularisation (Lasso)
Adds Ξ» Β· Ξ£|Ξ²β±Ό| to the cost. Can shrink weights to exactly zero, performing automatic feature selection. Best when most features are irrelevant.
sklearn: penalty='l1' solver='saga'
🟑
ElasticNet (L1 + L2)
Combines both penalties. Sparsity from L1, stability from L2. Best of both worlds when you have many features, some of which are correlated.
sklearn: penalty='elasticnet' l1_ratio=0.5
C Value (sklearn)Regularisation StrengthEffect
C = 0.001Very Strong (Ξ» = 1000)Heavily penalised weights; high bias, low variance
C = 0.1StrongSimpler model, good generalisation
C = 1.0Moderate (default)Balanced β€” good starting point
C = 10WeakNear-unregularised; may overfit
C = 1000None (Ξ» β‰ˆ 0)Pure maximum likelihood; overfits on small datasets

Section 15

Multiclass Logistic Regression

Standard Logistic Regression is binary. Two strategies extend it to multiple classes:

βš”οΈ
One-vs-Rest (OvR)
multi_class='ovr'
Train K separate binary classifiers β€” one per class. Each model asks: "Is this sample Class k or not?" Assign the class with highest probability. Fast and interpretable.
βœ“ Simple Β· Fast Β· Works with any binary classifier
βœ— Probabilities don't sum to 1 exactly
🌐
Softmax (Multinomial)
multi_class='multinomial'
Train one model with K output nodes. Uses the Softmax function to convert all K linear scores into a probability distribution that sums to 1. More principled and usually better calibrated.
βœ“ True probability distribution Β· Better calibrated
βœ— Slower on very large K
πŸ”’
Softmax Formula
P(Class k) = e^zβ‚– / Ξ£β±Ό e^zβ±Ό
Each class k gets a linear score zβ‚– = Ξ²β‚–α΅€x. Softmax normalises these into probabilities. The class with the highest probability wins. This is the foundation of deep learning classification heads.
βœ“ Generalisable to neural networks
βœ— Requires more training data per class
from sklearn.linear_model import LogisticRegression
from sklearn.datasets     import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# ── Iris: 3 classes ──────────────────────────────────────────
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# ── Multinomial (Softmax) Logistic Regression ────────────────
model = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    C=1.0,
    max_iter=500
)
model.fit(X_train, y_train)

print(f"Test Accuracy : {model.score(X_test, y_test):.4f}")  # ~0.967

# Probability distribution for first test sample
print("Class probabilities (sample 1):", model.predict_proba(X_test[:1]).round(3))
# [[0.001  0.072  0.927]]  β†’ Class 2 (Virginica) with 92.7% confidence

Section 16

Assumptions of Logistic Regression

πŸ“
Binary (or Ordinal) Outcome
The dependent variable must be categorical. Standard binary LR requires exactly two classes. For multiple classes use multinomial or ordinal variants.
πŸ“‰
Linear Relationship with Log-Odds
Each feature must have a linear relationship with the log-odds of the outcome. Highly non-linear relationships require feature engineering or a different algorithm.
🚫
No Multicollinearity
Highly correlated features destabilise coefficients. Remove or combine correlated features, or use L2 regularisation which handles mild multicollinearity well.
πŸ“¦
Large Sample Size
Unlike Linear Regression, LR is estimated via Maximum Likelihood which requires larger samples for stable estimates. Rule of thumb: at least 10–20 events per predictor variable.
πŸ”€
Independence of Observations
Observations must be independent. Repeated measures, time series, or clustered data violate this. Use mixed-effects or GEE models for those cases.
πŸ”¬
No Extreme Outliers
Outliers in feature space can strongly influence the decision boundary and inflate coefficients. Check leverage and influence scores. Apply robust scaling if needed.

Section 17

Golden Rules

🎯 Logistic Regression β€” Key Rules
1
Always scale your features. Logistic Regression is sensitive to feature magnitude. Use StandardScaler before training. Fit the scaler on the training set only β€” never leak test-set statistics into preprocessing.
2
Never use accuracy alone on imbalanced data. If 97% of transactions are legitimate, a model that always predicts "Legit" achieves 97% accuracy but catches zero fraud. Always report Precision, Recall, and F1 alongside accuracy.
3
Tune the decision threshold for your domain. The default 0.5 threshold maximises accuracy, not necessarily business value. In medicine, lower it to catch more true positives. In spam filtering, raise it to reduce false positives. Plot the Precision-Recall curve to choose the best operating point.
4
Use Adjusted RΒ² β€” sorry, use ROC-AUC for model comparison. AUC is threshold-independent and works well for comparing model versions. For heavily imbalanced data (fraud < 0.1%), prefer PR-AUC (Area Under the Precision-Recall Curve) over ROC-AUC.
5
Use regularisation by default. The sklearn default of C=1.0 (L2) is a safe starting point. If you have many irrelevant features, try L1 (penalty='l1') for automatic feature selection. Always tune C with cross-validation.
6
Check for perfect separation. If one feature perfectly separates the classes, the MLE algorithm diverges (coefficients β†’ ∞). This is called the complete separation problem. Signs: extremely large coefficients, wide confidence intervals, and convergence warnings. Fix: add regularisation or remove the separating feature.