Logistic Regression Explained

Section 01

The Story: Dr. Sharma's Diagnosis Machine

📖 Real-World Story

A Doctor Who Needed a Yes or No Answer

Dr. Sharma is an oncologist in Bengaluru. Every week she reviews dozens of biopsy reports, each containing measurements: tumour size (mm), cell uniformity score, clump thickness, and more. Her task is binary — Benign or Malignant?

She calls her data-science intern, Arjun, and says: "Build me a model that looks at these numbers and tells me the probability that a tumour is malignant. Not a continuous value — a probability between 0 and 1."

Arjun immediately knows that Linear Regression won't work here. A straight line can predict values below 0 and above 1 — meaningless for a probability. He reaches for Logistic Regression.

Logistic Regression is the go-to algorithm for binary classification — problems where the output is one of two classes (Yes/No, Spam/Not-Spam, Fraud/Legit, Malignant/Benign). Despite its name, it is a classification algorithm, not regression. It predicts the probability that an observation belongs to a class, then applies a threshold to make the final class decision.

Section 02

Why Linear Regression Fails for Classification

❌ Linear Regression on a Binary Problem

Problem	Example Output
Predictions go below 0	ŷ = −0.34 for a small tumour
Predictions go above 1	ŷ = 1.72 for a huge tumour
No natural threshold	What does 0.73 kg of "malignant" mean?
Sensitive to outliers	One extreme point tilts the whole line

✅ Logistic Regression on a Binary Problem

Property	Result
Output always in [0, 1]	ŷ = 0.82 → 82% chance of malignant
Interpretable threshold	ŷ ≥ 0.5 → predict Malignant
Probabilistic output	"82% confident it is malignant"
Well-defined loss function	Binary Cross-Entropy (Log Loss)

💡

The Core Trick

Logistic Regression takes the linear equation z = β₀ + β₁x₁ + … + βₙxₙ and passes it through a Sigmoid function that squashes any real number into the range (0, 1). This output is then interpreted as a probability.

Section 03

The Sigmoid Function — The Heart of Logistic Regression

Sigmoid Function σ(z)

σ(z) = 1 / (1 + e^−z)

Takes any real number z (−∞ to +∞) and maps it smoothly to the range (0, 1). Also called the logistic function.

Key Properties

σ(0) = 0.5 | σ(+∞) → 1 | σ(−∞) → 0

Symmetric around 0. When z is large positive → probability ≈ 1. When z is large negative → probability ≈ 0.

📈 The Sigmoid Curve — σ(z) = 1 / (1 + e⁻ᶻ)

The sigmoid squashes any linear score z into (0,1). Below z=0 the model leans toward Class 0; above z=0 it leans toward Class 1. The exact threshold is tunable.

Sigmoid Key Values at a Glance

z (linear score)	σ(z) (probability)	Interpretation
−6	0.002	Very confident — Class 0
−2	0.119	Likely Class 0
0	0.500	Completely uncertain
+2	0.881	Likely Class 1
+6	0.998	Very confident — Class 1

Section 04

The Full Logistic Regression Model

Compute the Linear Score (Logit) z

Exactly like Linear Regression: multiply each feature by its weight and sum them up.
z = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
For Dr. Sharma: z = β₀ + β₁·(tumour_size) + β₂·(cell_uniformity)

Apply the Sigmoid to Get a Probability

p̂ = σ(z) = 1 / (1 + e^−z)
This is the model's estimated probability that the sample belongs to Class 1 (Malignant). E.g. p̂ = 0.87 → 87% chance of malignancy.

Apply a Decision Threshold

Convert the probability to a hard class label:
ŷ = 1 if p̂ ≥ 0.5, else ŷ = 0
The default threshold is 0.5 but it is tunable. In cancer screening you might lower it to 0.3 to minimise false negatives (missed cancers).

Interpret the Result

ŷ = 0 → Predicted Benign | ŷ = 1 → Predicted Malignant.
Report both the hard label and the probability to the doctor. "87% confident this is malignant" is far more actionable than just "Malignant."

Section 05

The Math Behind It — Log-Odds and the Logit

Why is it called logistic regression? Because it models the log-odds of the outcome as a linear function of the features.

📐 Deriving the Logit from Probability

Step 1

Start with the probability: p̂ = 1 / (1 + e^−z)
Rearrange to isolate z: multiply both sides by (1 + e^−z)
p̂ · (1 + e^−z) = 1 → e^−z = (1 − p̂) / p̂

Step 2

Take the natural logarithm of both sides:
−z = ln((1 − p̂) / p̂) → z = ln(p̂ / (1 − p̂))
The term p̂ / (1 − p̂) is called the odds ratio.

Result

ln( p̂ / (1−p̂) ) = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
The log-odds (left side) is modelled as a linear combination of features. This is why the algorithm is called logistic regression: logistic transformation of a regression equation.

Probability p̂	Odds p̂/(1−p̂)	Log-Odds (z)
0.10	0.111	−2.20
0.25	0.333	−1.10
0.50	1.000	0.00
0.75	3.000	+1.10
0.90	9.000	+2.20

📐

Interpreting Coefficients

Each β coefficient represents the change in log-odds per unit increase in the feature. Exponentiate it to get the odds ratio: e^β₁. If β₁ = 0.8 for tumour size, then e^0.8 ≈ 2.23 — each extra mm of tumour size multiplies the odds of malignancy by 2.23×. This is the medical interpretation Dr. Sharma cares about.

Section 06

The Cost Function — Binary Cross-Entropy (Log Loss)

Logistic Regression cannot use the same Mean Squared Error cost as Linear Regression — that gives a non-convex surface with many local minima. Instead it uses Binary Cross-Entropy, also called Log Loss.

Log Loss for a Single Sample

L = −[ y·log(p̂) + (1−y)·log(1−p̂) ]

When y=1 (actual positive): only −log(p̂) matters → penalises low confidence.
When y=0 (actual negative): only −log(1−p̂) matters → penalises high confidence.

Average Log Loss Over All n Samples

J(β) = −(1/n) · Σ[ yᵢ·log(p̂ᵢ) + (1−yᵢ)·log(1−p̂ᵢ) ]

This is the function we minimise with gradient descent. It is convex — guaranteed to have a single global minimum.

📊 How Log Loss Punishes Wrong Confident Predictions

Log Loss is asymmetric and explodes toward infinity when the model is confidently wrong. This harsh penalty pushes the model toward well-calibrated probabilities.

Section 07

Gradient Descent — Finding the Best Weights

Unlike Linear Regression, Logistic Regression has no closed-form solution. We minimise J(β) iteratively using Gradient Descent.

Gradient of Log Loss w.r.t. βⱼ

∂J/∂βⱼ = (1/n) · Σ (p̂ᵢ − yᵢ) · xᵢⱼ

Beautifully simple — same form as Linear Regression's gradient. The difference (p̂ᵢ − yᵢ) is the prediction error for sample i.

Weight Update Rule (Gradient Descent)

βⱼ ← βⱼ − α · ∂J/∂βⱼ

α is the learning rate. Repeat this update for every coefficient βⱼ on every iteration until the loss converges to its minimum.

🔁 One Full Gradient Descent Iteration (1 Feature Example)

Init

Start with β₀ = 0, β₁ = 0, learning rate α = 0.1.
Training data: 4 tumours — sizes [2, 3, 5, 7], labels [0, 0, 1, 1].

Forward

Compute z = β₀ + β₁·x for each sample: [0, 0, 0, 0] (all zero at init).
Apply sigmoid: p̂ = σ(z) = [0.5, 0.5, 0.5, 0.5].
Errors (p̂ − y): [0.5−0, 0.5−0, 0.5−1, 0.5−1] = [0.5, 0.5, −0.5, −0.5].

Gradient

∂J/∂β₀ = (1/4)·(0.5+0.5−0.5−0.5) = 0.0
∂J/∂β₁ = (1/4)·(0.5·2 + 0.5·3 + (−0.5)·5 + (−0.5)·7) = (1/4)·(1+1.5−2.5−3.5) = (1/4)·(−3.5) = −0.875

Update

β₀ ← 0 − 0.1 · 0.0 = 0.0
β₁ ← 0 − 0.1 · (−0.875) = +0.0875
The weight for tumour size has gone positive — larger tumours now get higher probability. ✓

💡

Convexity Guarantee

The Binary Cross-Entropy loss surface is convex with respect to β. This means gradient descent is guaranteed to find the global minimum — there are no local minima to get trapped in. This is why Logistic Regression is so reliable compared to non-convex models like neural networks.

Section 08

The Decision Boundary

Once trained, Logistic Regression draws a linear decision boundary in feature space. Points on one side get Class 0, points on the other get Class 1.

📊 Decision Boundary — Tumour Classification (2 Features)

Logistic Regression creates a linear boundary: β₀ + β₁·size + β₂·uniformity = 0. Points above the boundary are predicted Benign; points below are predicted Malignant.

⚠️

Logistic Regression is Inherently Linear

The decision boundary is always a straight line (2D), plane (3D), or hyperplane (nD). If your classes are not linearly separable — for example, one class forms a ring around the other — Logistic Regression will struggle. Solutions: add polynomial features, use kernel methods, or switch to tree-based models.

Section 09

Evaluating the Model — The Confusion Matrix

Accuracy alone is dangerous for classification. If 95% of tumours are benign, a model that always predicts "Benign" gets 95% accuracy — but misses every malignant case. The Confusion Matrix gives the full picture.

🟥 Confusion Matrix — Dr. Sharma's Model on 100 Test Samples

Term	Abbr	Count	What Happened
True Negative	TN	57	Actual Benign, predicted Benign ✓
False Positive	FP	5	Actual Benign, predicted Malignant (unnecessary biopsy)
False Negative	FN	3	Actual Malignant, predicted Benign — most dangerous!
True Positive	TP	35	Actual Malignant, predicted Malignant ✓

Section 10

Precision, Recall, and F1-Score

Accuracy

(TP + TN) / (TP + TN + FP + FN)

(35+57)/100 = 92% — misleading on imbalanced data

Precision (Positive Predictive Value)

TP / (TP + FP)

35/(35+5) = 87.5% — of all predicted Malignant, 87.5% were correct

Recall (Sensitivity / True Positive Rate)

TP / (TP + FN)

35/(35+3) = 92.1% — of all actual Malignant, model found 92.1%

F1-Score (Harmonic Mean of P and R)

2 · (Precision · Recall) / (Precision + Recall)

2·(0.875·0.921)/(0.875+0.921) = 89.7% — balanced single metric

🩺 Clinical Stakes

Why Dr. Sharma Cares More About Recall Than Precision

In cancer screening, a False Negative (missed cancer) can be fatal — the patient goes home thinking they are fine. A False Positive (unnecessary biopsy) is costly and stressful, but survivable.

So Dr. Sharma sets the decision threshold at 0.3 instead of 0.5 — more tumours are flagged as suspicious, reducing FN at the cost of more FP. Recall goes up from 92% to 97%; Precision drops to 74%. The F1 score drops slightly — but in medicine, high recall is worth the trade.

This is why you should never optimise for accuracy alone. Always ask: "What is the cost of a False Negative vs a False Positive in this domain?"

Metric	Use When	Avoid When
Accuracy	Classes are balanced	Imbalanced datasets (95% Class 0)
Precision	Cost of FP is high (spam filter — annoying if legitimate email goes to spam)	Cost of FN is high
Recall	Cost of FN is high (disease detection, fraud)	Cost of FP is high
F1-Score	Need a single metric balancing P and R	When P and R should not be equally weighted
ROC-AUC	Comparing models at all thresholds	Severely imbalanced data (use PR-AUC instead)

Section 11

ROC Curve and AUC

The ROC Curve (Receiver Operating Characteristic) plots the True Positive Rate (Recall) against the False Positive Rate at every possible threshold from 0 to 1. The area under this curve — AUC — summarises model quality in a single number.

📈 ROC Curve — Dr. Sharma's Model (AUC ≈ 0.97)

AUC = 0.97 means the model ranks a random positive sample above a random negative sample 97% of the time. AUC is threshold-independent and works for imbalanced datasets.

AUC = 0.5

🎲

Random guessing. Model has learned nothing. Your baseline to beat.

AUC = 0.7–0.9

👍

Acceptable to good discrimination. Useful in many real-world problems.

AUC = 0.9–1.0

🏆

Excellent. Check for data leakage — often too good to be true on real data.

Section 12

Python — Logistic Regression from Scratch

import numpy as np

# ── Sigmoid function ─────────────────────────────────────────
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# ── Binary Cross-Entropy Loss ────────────────────────────────
def log_loss(y, p_hat):
    n = len(y)
    eps = 1e-15                           # clip to avoid log(0)
    p_hat = np.clip(p_hat, eps, 1 - eps)
    return -(1 / n) * np.sum(
        y * np.log(p_hat) + (1 - y) * np.log(1 - p_hat)
    )

# ── Training with Gradient Descent ──────────────────────────
def train_logistic(X, y, lr=0.1, epochs=1000):
    n, p  = X.shape
    beta  = np.zeros(p + 1)               # β₀ + p weights
    X_b   = np.column_stack([np.ones(n), X])  # add bias column

    for epoch in range(epochs):
        z      = X_b @ beta                # linear score
        p_hat  = sigmoid(z)              # predicted probability
        error  = p_hat - y                 # residual (p̂ᵢ − yᵢ)
        grad   = (1 / n) * (X_b.T @ error) # gradient vector
        beta  -= lr * grad                 # update weights

        if epoch % 100 == 0:
            loss = log_loss(y, p_hat)
            print(f"Epoch {epoch:4d}  |  Loss: {loss:.4f}")

    return beta

# ── Prediction ───────────────────────────────────────────────
def predict_proba(X, beta):
    X_b = np.column_stack([np.ones(len(X)), X])
    return sigmoid(X_b @ beta)

def predict(X, beta, threshold=0.5):
    return (predict_proba(X, beta) >= threshold).astype(int)

# ── Toy dataset: tumour sizes ────────────────────────────────
np.random.seed(42)
X_train = np.random.randn(100, 2)        # 100 samples, 2 features
y_train = ((X_train[:, 0] + X_train[:, 1]) > 0).astype(float)

beta = train_logistic(X_train, y_train, lr=0.5, epochs=500)
print(f"\nLearned weights: β₀={beta[0]:.3f}  β₁={beta[1]:.3f}  β₂={beta[2]:.3f}")

proba = predict_proba(X_train[:3], beta)
print(f"Predicted probabilities (first 3): {proba.round(3)}")

Output

Section 13

Python — Full Pipeline with scikit-learn

import numpy as np
from sklearn.datasets         import load_breast_cancer
from sklearn.model_selection  import train_test_split
from sklearn.preprocessing    import StandardScaler
from sklearn.linear_model     import LogisticRegression
from sklearn.metrics          import (
    classification_report, confusion_matrix,
    roc_auc_score, RocCurveDisplay
)

# ── 1. Load Data ─────────────────────────────────────────────
data    = load_breast_cancer()        # Wisconsin Breast Cancer dataset
X, y    = data.data, data.target       # 569 samples, 30 features

# ── 2. Train / Test Split ────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── 3. Feature Scaling (critical for Logistic Regression) ───
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)     # transform only — no fit!

# ── 4. Train Model ───────────────────────────────────────────
model = LogisticRegression(
    C=1.0,             # inverse of regularisation strength (1/λ)
    solver='lbfgs',   # efficient solver for small-medium datasets
    max_iter=1000,
    random_state=42
)
model.fit(X_train, y_train)

# ── 5. Evaluate ──────────────────────────────────────────────
y_pred       = model.predict(X_test)
y_proba      = model.predict_proba(X_test)[:, 1]  # P(Class=1)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {auc:.4f}")

# ── 6. Tune threshold for high Recall ───────────────────────
threshold = 0.3
y_pred_low = (y_proba >= threshold).astype(int)
print(f"\nAt threshold={threshold}:")
print(classification_report(y_test, y_pred_low, target_names=data.target_names))

Output

Confusion Matrix: [[41 2] [ 1 70]] Classification Report: precision recall f1-score support malignant 0.98 0.95 0.96 43 benign 0.97 0.99 0.98 71 accuracy 0.97 114 ROC-AUC Score: 0.9971 At threshold=0.3: precision recall f1-score support malignant 0.93 1.00 0.96 43 benign 1.00 0.96 0.98 71

⚠️

Always Scale Before Logistic Regression

Logistic Regression uses gradient descent or solver optimisation, both of which converge far slower (or incorrectly) when features have very different scales. A feature in thousands (income) will dominate one in decimals (age in decades). Always apply StandardScaler or MinMaxScaler first — and fit the scaler only on training data, then transform both train and test.

Section 14

Regularisation — Preventing Overfitting

When there are many features (especially more features than samples), Logistic Regression can overfit — it memorises training noise. Regularisation adds a penalty term to the cost function to keep weights small.

🔵

L2 Regularisation (Ridge)

Adds λ · Σβⱼ² to the cost. Shrinks all weights toward zero but never to exactly zero. Keeps all features in the model with smaller coefficients. Best when most features are useful.

sklearn: penalty='l2' C=1/λ (default)

🟢

L1 Regularisation (Lasso)

Adds λ · Σ|βⱼ| to the cost. Can shrink weights to exactly zero, performing automatic feature selection. Best when most features are irrelevant.

sklearn: penalty='l1' solver='saga'

🟡

ElasticNet (L1 + L2)

Combines both penalties. Sparsity from L1, stability from L2. Best of both worlds when you have many features, some of which are correlated.

sklearn: penalty='elasticnet' l1_ratio=0.5

C Value (sklearn)	Regularisation Strength	Effect
`C = 0.001`	Very Strong (λ = 1000)	Heavily penalised weights; high bias, low variance
`C = 0.1`	Strong	Simpler model, good generalisation
`C = 1.0`	Moderate (default)	Balanced — good starting point
`C = 10`	Weak	Near-unregularised; may overfit
`C = 1000`	None (λ ≈ 0)	Pure maximum likelihood; overfits on small datasets

Section 15

Multiclass Logistic Regression

Standard Logistic Regression is binary. Two strategies extend it to multiple classes:

⚔️

One-vs-Rest (OvR)

multi_class='ovr'

Train K separate binary classifiers — one per class. Each model asks: "Is this sample Class k or not?" Assign the class with highest probability. Fast and interpretable.

✓ Simple · Fast · Works with any binary classifier

✗ Probabilities don't sum to 1 exactly

🌐

Softmax (Multinomial)

multi_class='multinomial'

Train one model with K output nodes. Uses the Softmax function to convert all K linear scores into a probability distribution that sums to 1. More principled and usually better calibrated.

✓ True probability distribution · Better calibrated

✗ Slower on very large K

🔢

Softmax Formula

P(Class k) = e^zₖ / Σⱼ e^zⱼ

Each class k gets a linear score zₖ = βₖᵀx. Softmax normalises these into probabilities. The class with the highest probability wins. This is the foundation of deep learning classification heads.

✓ Generalisable to neural networks

✗ Requires more training data per class

from sklearn.linear_model import LogisticRegression
from sklearn.datasets     import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# ── Iris: 3 classes ──────────────────────────────────────────
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# ── Multinomial (Softmax) Logistic Regression ────────────────
model = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    C=1.0,
    max_iter=500
)
model.fit(X_train, y_train)

print(f"Test Accuracy : {model.score(X_test, y_test):.4f}")  # ~0.967

# Probability distribution for first test sample
print("Class probabilities (sample 1):", model.predict_proba(X_test[:1]).round(3))
# [[0.001  0.072  0.927]]  → Class 2 (Virginica) with 92.7% confidence

Section 16

Assumptions of Logistic Regression

📏

Binary (or Ordinal) Outcome

The dependent variable must be categorical. Standard binary LR requires exactly two classes. For multiple classes use multinomial or ordinal variants.

📉

Linear Relationship with Log-Odds

Each feature must have a linear relationship with the log-odds of the outcome. Highly non-linear relationships require feature engineering or a different algorithm.

🚫

No Multicollinearity

Highly correlated features destabilise coefficients. Remove or combine correlated features, or use L2 regularisation which handles mild multicollinearity well.

📦

Large Sample Size

Unlike Linear Regression, LR is estimated via Maximum Likelihood which requires larger samples for stable estimates. Rule of thumb: at least 10–20 events per predictor variable.

🔀

Independence of Observations

Observations must be independent. Repeated measures, time series, or clustered data violate this. Use mixed-effects or GEE models for those cases.

🔬

No Extreme Outliers

Outliers in feature space can strongly influence the decision boundary and inflate coefficients. Check leverage and influence scores. Apply robust scaling if needed.

Section 17

Golden Rules

🎯 Logistic Regression — Key Rules

Always scale your features. Logistic Regression is sensitive to feature magnitude. Use StandardScaler before training. Fit the scaler on the training set only — never leak test-set statistics into preprocessing.

Never use accuracy alone on imbalanced data. If 97% of transactions are legitimate, a model that always predicts "Legit" achieves 97% accuracy but catches zero fraud. Always report Precision, Recall, and F1 alongside accuracy.

Tune the decision threshold for your domain. The default 0.5 threshold maximises accuracy, not necessarily business value. In medicine, lower it to catch more true positives. In spam filtering, raise it to reduce false positives. Plot the Precision-Recall curve to choose the best operating point.

Use Adjusted R² — sorry, use ROC-AUC for model comparison. AUC is threshold-independent and works well for comparing model versions. For heavily imbalanced data (fraud < 0.1%), prefer PR-AUC (Area Under the Precision-Recall Curve) over ROC-AUC.

Use regularisation by default. The sklearn default of C=1.0 (L2) is a safe starting point. If you have many irrelevant features, try L1 (penalty='l1') for automatic feature selection. Always tune C with cross-validation.

Check for perfect separation. If one feature perfectly separates the classes, the MLE algorithm diverges (coefficients → ∞). This is called the complete separation problem. Signs: extremely large coefficients, wide confidence intervals, and convergence warnings. Fix: add regularisation or remove the separating feature.