The Story: Dr. Sharma's Diagnosis Machine
She calls her data-science intern, Arjun, and says: "Build me a model that looks at these numbers and tells me the probability that a tumour is malignant. Not a continuous value β a probability between 0 and 1."
Arjun immediately knows that Linear Regression won't work here. A straight line can predict values below 0 and above 1 β meaningless for a probability. He reaches for Logistic Regression.
Logistic Regression is the go-to algorithm for binary classification β problems where the output is one of two classes (Yes/No, Spam/Not-Spam, Fraud/Legit, Malignant/Benign). Despite its name, it is a classification algorithm, not regression. It predicts the probability that an observation belongs to a class, then applies a threshold to make the final class decision.
Why Linear Regression Fails for Classification
| Problem | Example Output |
|---|---|
| Predictions go below 0 | Ε· = β0.34 for a small tumour |
| Predictions go above 1 | Ε· = 1.72 for a huge tumour |
| No natural threshold | What does 0.73 kg of "malignant" mean? |
| Sensitive to outliers | One extreme point tilts the whole line |
| Property | Result |
|---|---|
| Output always in [0, 1] | Ε· = 0.82 β 82% chance of malignant |
| Interpretable threshold | Ε· β₯ 0.5 β predict Malignant |
| Probabilistic output | "82% confident it is malignant" |
| Well-defined loss function | Binary Cross-Entropy (Log Loss) |
Logistic Regression takes the linear equation z = Ξ²β + Ξ²βxβ + β¦ + Ξ²βxβ
and passes it through a Sigmoid function that squashes any real number
into the range (0, 1). This output is then interpreted as a probability.
The Sigmoid Function β The Heart of Logistic Regression
The sigmoid squashes any linear score z into (0,1). Below z=0 the model leans toward Class 0; above z=0 it leans toward Class 1. The exact threshold is tunable.
Sigmoid Key Values at a Glance
| z (linear score) | Ο(z) (probability) | Interpretation |
|---|---|---|
| β6 | 0.002 | Very confident β Class 0 |
| β2 | 0.119 | Likely Class 0 |
| 0 | 0.500 | Completely uncertain |
| +2 | 0.881 | Likely Class 1 |
| +6 | 0.998 | Very confident β Class 1 |
The Full Logistic Regression Model
z = Ξ²β + Ξ²βxβ + Ξ²βxβ + β¦ + Ξ²βxβFor Dr. Sharma:
z = Ξ²β + Ξ²βΒ·(tumour_size) + Ξ²βΒ·(cell_uniformity)
pΜ = Ο(z) = 1 / (1 + eβz)This is the model's estimated probability that the sample belongs to Class 1 (Malignant). E.g. pΜ = 0.87 β 87% chance of malignancy.
Ε· = 1 if pΜ β₯ 0.5, else Ε· = 0The default threshold is 0.5 but it is tunable. In cancer screening you might lower it to 0.3 to minimise false negatives (missed cancers).
Report both the hard label and the probability to the doctor. "87% confident this is malignant" is far more actionable than just "Malignant."
The Math Behind It β Log-Odds and the Logit
Why is it called logistic regression? Because it models the log-odds of the outcome as a linear function of the features.
pΜ = 1 / (1 + eβz)Rearrange to isolate z: multiply both sides by
(1 + eβz)pΜ Β· (1 + eβz) = 1 β
eβz = (1 β pΜ) / pΜ
βz = ln((1 β pΜ) / pΜ) β
z = ln(pΜ / (1 β pΜ))The term pΜ / (1 β pΜ) is called the odds ratio.
ln( pΜ / (1βpΜ) ) = Ξ²β + Ξ²βxβ + Ξ²βxβ + β¦ + Ξ²βxβThe log-odds (left side) is modelled as a linear combination of features. This is why the algorithm is called logistic regression: logistic transformation of a regression equation.
| Probability pΜ | Odds pΜ/(1βpΜ) | Log-Odds (z) |
|---|---|---|
| 0.10 | 0.111 | β2.20 |
| 0.25 | 0.333 | β1.10 |
| 0.50 | 1.000 | 0.00 |
| 0.75 | 3.000 | +1.10 |
| 0.90 | 9.000 | +2.20 |
Each Ξ² coefficient represents the change in log-odds per unit
increase in the feature. Exponentiate it to get the odds ratio:
eΞ²β. If Ξ²β = 0.8 for tumour size, then
e0.8 β 2.23 β each extra mm of tumour size multiplies
the odds of malignancy by 2.23Γ. This is the medical interpretation
Dr. Sharma cares about.
The Cost Function β Binary Cross-Entropy (Log Loss)
Logistic Regression cannot use the same Mean Squared Error cost as Linear Regression β that gives a non-convex surface with many local minima. Instead it uses Binary Cross-Entropy, also called Log Loss.
βlog(pΜ) matters β penalises low confidence.When y=0 (actual negative): only
βlog(1βpΜ) matters β penalises high confidence.
Log Loss is asymmetric and explodes toward infinity when the model is confidently wrong. This harsh penalty pushes the model toward well-calibrated probabilities.
Gradient Descent β Finding the Best Weights
Unlike Linear Regression, Logistic Regression has no closed-form solution. We minimise J(Ξ²) iteratively using Gradient Descent.
Training data: 4 tumours β sizes [2, 3, 5, 7], labels [0, 0, 1, 1].
Apply sigmoid: pΜ = Ο(z) = [0.5, 0.5, 0.5, 0.5].
Errors (pΜ β y): [0.5β0, 0.5β0, 0.5β1, 0.5β1] = [0.5, 0.5, β0.5, β0.5].
βJ/βΞ²β = (1/4)Β·(0.5Β·2 + 0.5Β·3 + (β0.5)Β·5 + (β0.5)Β·7) = (1/4)Β·(1+1.5β2.5β3.5) = (1/4)Β·(β3.5) = β0.875
Ξ²β β 0 β 0.1 Β· (β0.875) = +0.0875
The weight for tumour size has gone positive β larger tumours now get higher probability. β
The Binary Cross-Entropy loss surface is convex with respect to Ξ². This means gradient descent is guaranteed to find the global minimum β there are no local minima to get trapped in. This is why Logistic Regression is so reliable compared to non-convex models like neural networks.
The Decision Boundary
Once trained, Logistic Regression draws a linear decision boundary in feature space. Points on one side get Class 0, points on the other get Class 1.
Logistic Regression creates a linear boundary: Ξ²β + Ξ²βΒ·size + Ξ²βΒ·uniformity = 0.
Points above the boundary are predicted Benign; points below are predicted Malignant.
The decision boundary is always a straight line (2D), plane (3D), or hyperplane (nD). If your classes are not linearly separable β for example, one class forms a ring around the other β Logistic Regression will struggle. Solutions: add polynomial features, use kernel methods, or switch to tree-based models.
Evaluating the Model β The Confusion Matrix
Accuracy alone is dangerous for classification. If 95% of tumours are benign, a model that always predicts "Benign" gets 95% accuracy β but misses every malignant case. The Confusion Matrix gives the full picture.
| Term | Abbr | Count | What Happened |
|---|---|---|---|
| True Negative | TN | 57 | Actual Benign, predicted Benign β |
| False Positive | FP | 5 | Actual Benign, predicted Malignant (unnecessary biopsy) |
| False Negative | FN | 3 | Actual Malignant, predicted Benign β most dangerous! |
| True Positive | TP | 35 | Actual Malignant, predicted Malignant β |
Precision, Recall, and F1-Score
So Dr. Sharma sets the decision threshold at 0.3 instead of 0.5 β more tumours are flagged as suspicious, reducing FN at the cost of more FP. Recall goes up from 92% to 97%; Precision drops to 74%. The F1 score drops slightly β but in medicine, high recall is worth the trade.
This is why you should never optimise for accuracy alone. Always ask: "What is the cost of a False Negative vs a False Positive in this domain?"
| Metric | Use When | Avoid When |
|---|---|---|
| Accuracy | Classes are balanced | Imbalanced datasets (95% Class 0) |
| Precision | Cost of FP is high (spam filter β annoying if legitimate email goes to spam) | Cost of FN is high |
| Recall | Cost of FN is high (disease detection, fraud) | Cost of FP is high |
| F1-Score | Need a single metric balancing P and R | When P and R should not be equally weighted |
| ROC-AUC | Comparing models at all thresholds | Severely imbalanced data (use PR-AUC instead) |
ROC Curve and AUC
The ROC Curve (Receiver Operating Characteristic) plots the True Positive Rate (Recall) against the False Positive Rate at every possible threshold from 0 to 1. The area under this curve β AUC β summarises model quality in a single number.
AUC = 0.97 means the model ranks a random positive sample above a random negative sample 97% of the time. AUC is threshold-independent and works for imbalanced datasets.
Python β Logistic Regression from Scratch
import numpy as np
# ββ Sigmoid function βββββββββββββββββββββββββββββββββββββββββ
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# ββ Binary Cross-Entropy Loss ββββββββββββββββββββββββββββββββ
def log_loss(y, p_hat):
n = len(y)
eps = 1e-15 # clip to avoid log(0)
p_hat = np.clip(p_hat, eps, 1 - eps)
return -(1 / n) * np.sum(
y * np.log(p_hat) + (1 - y) * np.log(1 - p_hat)
)
# ββ Training with Gradient Descent ββββββββββββββββββββββββββ
def train_logistic(X, y, lr=0.1, epochs=1000):
n, p = X.shape
beta = np.zeros(p + 1) # Ξ²β + p weights
X_b = np.column_stack([np.ones(n), X]) # add bias column
for epoch in range(epochs):
z = X_b @ beta # linear score
p_hat = sigmoid(z) # predicted probability
error = p_hat - y # residual (pΜα΅’ β yα΅’)
grad = (1 / n) * (X_b.T @ error) # gradient vector
beta -= lr * grad # update weights
if epoch % 100 == 0:
loss = log_loss(y, p_hat)
print(f"Epoch {epoch:4d} | Loss: {loss:.4f}")
return beta
# ββ Prediction βββββββββββββββββββββββββββββββββββββββββββββββ
def predict_proba(X, beta):
X_b = np.column_stack([np.ones(len(X)), X])
return sigmoid(X_b @ beta)
def predict(X, beta, threshold=0.5):
return (predict_proba(X, beta) >= threshold).astype(int)
# ββ Toy dataset: tumour sizes ββββββββββββββββββββββββββββββββ
np.random.seed(42)
X_train = np.random.randn(100, 2) # 100 samples, 2 features
y_train = ((X_train[:, 0] + X_train[:, 1]) > 0).astype(float)
beta = train_logistic(X_train, y_train, lr=0.5, epochs=500)
print(f"\nLearned weights: Ξ²β={beta[0]:.3f} Ξ²β={beta[1]:.3f} Ξ²β={beta[2]:.3f}")
proba = predict_proba(X_train[:3], beta)
print(f"Predicted probabilities (first 3): {proba.round(3)}")
Python β Full Pipeline with scikit-learn
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, RocCurveDisplay
)
# ββ 1. Load Data βββββββββββββββββββββββββββββββββββββββββββββ
data = load_breast_cancer() # Wisconsin Breast Cancer dataset
X, y = data.data, data.target # 569 samples, 30 features
# ββ 2. Train / Test Split ββββββββββββββββββββββββββββββββββββ
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# ββ 3. Feature Scaling (critical for Logistic Regression) βββ
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # transform only β no fit!
# ββ 4. Train Model βββββββββββββββββββββββββββββββββββββββββββ
model = LogisticRegression(
C=1.0, # inverse of regularisation strength (1/Ξ»)
solver='lbfgs', # efficient solver for small-medium datasets
max_iter=1000,
random_state=42
)
model.fit(X_train, y_train)
# ββ 5. Evaluate ββββββββββββββββββββββββββββββββββββββββββββββ
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1] # P(Class=1)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {auc:.4f}")
# ββ 6. Tune threshold for high Recall βββββββββββββββββββββββ
threshold = 0.3
y_pred_low = (y_proba >= threshold).astype(int)
print(f"\nAt threshold={threshold}:")
print(classification_report(y_test, y_pred_low, target_names=data.target_names))
Logistic Regression uses gradient descent or solver optimisation, both of which
converge far slower (or incorrectly) when features have very different scales.
A feature in thousands (income) will dominate one in decimals (age in decades).
Always apply StandardScaler or MinMaxScaler first β and
fit the scaler only on training data, then transform both train and test.
Regularisation β Preventing Overfitting
When there are many features (especially more features than samples), Logistic Regression can overfit β it memorises training noise. Regularisation adds a penalty term to the cost function to keep weights small.
λ · Σβⱼ² to the cost. Shrinks all weights toward zero but
never to exactly zero. Keeps all features in the model with smaller coefficients.
Best when most features are useful.
Ξ» Β· Ξ£|Ξ²β±Ό| to the cost. Can shrink weights to exactly
zero, performing automatic feature selection. Best when most features are irrelevant.
| C Value (sklearn) | Regularisation Strength | Effect |
|---|---|---|
C = 0.001 | Very Strong (Ξ» = 1000) | Heavily penalised weights; high bias, low variance |
C = 0.1 | Strong | Simpler model, good generalisation |
C = 1.0 | Moderate (default) | Balanced β good starting point |
C = 10 | Weak | Near-unregularised; may overfit |
C = 1000 | None (Ξ» β 0) | Pure maximum likelihood; overfits on small datasets |
Multiclass Logistic Regression
Standard Logistic Regression is binary. Two strategies extend it to multiple classes:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# ββ Iris: 3 classes ββββββββββββββββββββββββββββββββββββββββββ
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# ββ Multinomial (Softmax) Logistic Regression ββββββββββββββββ
model = LogisticRegression(
multi_class='multinomial',
solver='lbfgs',
C=1.0,
max_iter=500
)
model.fit(X_train, y_train)
print(f"Test Accuracy : {model.score(X_test, y_test):.4f}") # ~0.967
# Probability distribution for first test sample
print("Class probabilities (sample 1):", model.predict_proba(X_test[:1]).round(3))
# [[0.001 0.072 0.927]] β Class 2 (Virginica) with 92.7% confidence
Assumptions of Logistic Regression
Golden Rules
StandardScaler before training. Fit the scaler
on the training set only β never leak test-set statistics into preprocessing.
C=1.0 (L2) is a safe starting point. If you have many irrelevant
features, try L1 (penalty='l1') for automatic feature selection.
Always tune C with cross-validation.