Bayes' Theorem to Python

Section 01

The Story That Explains Naïve Bayes

📖 Real World Analogy

The New Email Intern Who Learned to Spot Spam

Imagine you hire a new intern named Bayes to sort your inbox. On his first day, you hand him 1,000 old emails — 400 spam, 600 legitimate — and tell him to take notes. He reads every single one and builds a tally:

"The word FREE appears in 380 spam emails but only 12 legitimate ones."
"The word meeting appears in 420 legitimate emails but only 3 spam ones."
"The word winner appears in 350 spam emails but only 2 legitimate ones."

A new email arrives: "You are a FREE WINNER — claim your prize now!"

Bayes does not read it like a human. He asks one cold, mathematical question: "Given each of these words, what is the probability this is spam?" He multiplies the individual word probabilities, compares Spam vs Legitimate, and announces: "99.2% spam." Correct — deleted in milliseconds.

That is Naïve Bayes. "Naïve" because it assumes each word votes independently — a simplification that is statistically wrong but works remarkably well in practice.

🔍

Why "Naïve"?

In reality the words "FREE" and "WINNER" are not independent — seeing one makes the other more likely. Naïve Bayes ignores all correlations and treats every feature as if it contributes independently to the outcome. Statistically wrong. Practically powerful.

Section 02

The Foundation — Bayes' Theorem

Everything in Naïve Bayes flows from one equation discovered by the Reverend Thomas Bayes in the 18th century. It tells you how to update a belief when you receive new evidence.

Bayes' Theorem

P(A | B) = P(B | A) × P(A) / P(B)

The probability of A given B equals the likelihood of B given A, times the prior belief in A, divided by the total probability of B

In Classification Terms

P(Class | Features) = P(Features | Class) × P(Class) / P(Features)

What is the probability of this class label, given the feature values we observed in this sample?

Term	Name	Plain English	Spam Example
`P(Class \| Features)`	Posterior	What we want — probability of the class given the evidence we observed	P(Spam \| "FREE", "WIN")
`P(Features \| Class)`	Likelihood	How probable are these features if we already know the class?	P("FREE", "WIN" \| Spam)
`P(Class)`	Prior	How common is this class in the training data — before seeing any features	P(Spam) = 400/1000 = 0.40
`P(Features)`	Evidence	How common are these features overall? Same for every class — safely cancelled	Constant divisor — ignored in practice

💡

The Denominator Trick — MAP Decision Rule

When comparing classes (Spam vs Ham), P(Features) is identical for both — it cancels out. We only need to compare P(Features | Class) × P(Class) for each class and pick the largest. No division required. This is the Maximum A Posteriori (MAP) rule — and it is all Naïve Bayes uses.

Section 03

The Naïve Independence Assumption

With many features, computing P(word₁, word₂, word₃, ... | Class) directly is impossible — you would need to have seen every exact combination of words in training. Naïve Bayes solves this with a bold simplification.

🔑 The Independence Assumption — Features Vote Separately

Without It

P("free", "win", "prize" | Spam) — must observe this exact 3-word combo in training. Rare or zero → model collapses completely.

With It

P("free" | Spam) × P("win" | Spam) × P("prize" | Spam) — multiply individual word probabilities. Each word estimated independently. Always computable.

Result

The joint probability of all features becomes a simple product of marginals. Fast, tractable, and surprisingly accurate across thousands of real applications.

General MAP Formula

ŷ = argmax P(Cₖ) ∏ P(xᵢ | Cₖ)

Choose the class Cₖ that maximises the product of its prior and the likelihood of each individual feature given that class

Log-Space Version (Numerically Stable)

ŷ = argmax [log P(Cₖ) + Σ log P(xᵢ | Cₖ)]

Logarithm converts multiplication into addition — prevents floating-point underflow when many small probabilities are multiplied together

⚠️

Why Log-Space Is Not Optional

Multiplying hundreds of small probabilities — 0.001 × 0.003 × 0.0005 × ... — produces numbers so tiny that computers silently round them to zero (floating-point underflow). Taking logarithms converts products into sums: log(a × b) = log(a) + log(b). Since log is monotone, argmax of the log-probabilities gives the identical answer as argmax of the raw probabilities. sklearn handles this internally — but you must handle it yourself when implementing from scratch.

Section 04

Step-by-Step Manual Calculation

Let us classify one new email by hand using a tiny training set of 10 emails with three binary features.

#	Contains "free"	Contains "winner"	Contains "meeting"	Label
1	✅	✅	❌	Spam
2	✅	❌	❌	Spam
3	✅	✅	❌	Spam
4	❌	✅	❌	Spam
5	✅	❌	❌	Spam
6	❌	❌	✅	Ham
7	❌	❌	✅	Ham
8	❌	❌	✅	Ham
9	✅	❌	✅	Ham
10	❌	❌	❌	Ham

New email to classify: contains "free" ✅, "winner" ✅, "meeting" ❌. Spam or Ham?

🧮 Step 1 — Compute Prior Probabilities

P(Spam)

5 spam emails out of 10 total → P(Spam) = 5/10 = 0.500

P(Ham)

5 ham emails out of 10 total → P(Ham) = 5/10 = 0.500

🧮 Step 2 — Compute Likelihoods Per Class

P(free|Spam)

4 of 5 spam emails contain "free" → 4/5 = 0.800

P(winner|Spam)

3 of 5 spam emails contain "winner" → 3/5 = 0.600

P(¬meeting|Spam)

5 of 5 spam emails lack "meeting" → 5/5 = 1.000

P(free|Ham)

1 of 5 ham emails contain "free" → 1/5 = 0.200

P(winner|Ham)

0 of 5 ham emails contain "winner" → 0/5 = 0.000 ⚠️ Zero Problem!

P(¬meeting|Ham)

2 of 5 ham emails lack "meeting" → 2/5 = 0.400

🧮 Step 3 — Multiply to Get MAP Scores

Spam

P(Spam) × P(free|Spam) × P(winner|Spam) × P(¬meeting|Spam)
= 0.5 × 0.800 × 0.600 × 1.000 = 0.240

Ham

P(Ham) × P(free|Ham) × P(winner|Ham) × P(¬meeting|Ham)
= 0.5 × 0.200 × 0.000 × 0.400 = 0.000 ← One zero kills it!

Decision

0.240 > 0.000 → Predicted: SPAM ✅ — correct, but the zero is a critical flaw (see Section 06)

🎯

Correct — But One Fatal Issue

The model predicted spam correctly. But Ham score is exactly zero because "winner" never appeared in any ham email during training. One unseen word silenced all other evidence for ham — permanently. This is the Zero Frequency Problem, solved by Laplace Smoothing in Section 06.

Section 05

The Three Variants of Naïve Bayes

Naïve Bayes is not a single algorithm — it is a family. The variant you choose depends entirely on the type of features in your data.

📊

Gaussian NB

Continuous (Real-Valued) Features

Assumes each feature follows a normal distribution within each class. Learns μ (mean) and σ² (variance) per feature per class from training data, then evaluates new samples using the Gaussian PDF. Used for sensor readings, medical measurements, physical observations.

✅ Handles continuous data naturally

❌ Fails when features are not Gaussian

📝

Multinomial NB

Count / Frequency Features

Designed for word counts or TF-IDF vectors. Models how often each word appears in each class. Industry gold standard for spam detection, news topic classification, and document categorisation. Requires non-negative integer (or frequency) features.

✅ Gold standard for NLP text tasks

❌ Cannot use negative feature values

🔘

Bernoulli NB

Binary (0 / 1) Features

Uses binary word presence/absence (1 = word in email, 0 = not). Crucially, it explicitly penalises absent words — Multinomial ignores them. Works better than Multinomial on very short texts where frequency carries less signal than presence.

✅ Best for short documents and binary features

❌ Throws away frequency information entirely

Variant	Feature Type	Likelihood Formula	Typical Use Cases
Gaussian NB	Real-valued continuous	Gaussian PDF — uses μ, σ²	Iris classification, medical data, sensors
Multinomial NB	Non-negative integer counts	Multinomial distribution over word counts	Spam, news topics, document tagging
Bernoulli NB	Binary 0 / 1	Product of Bernoulli distributions	Sentiment (short), binary feature sets
Complement NB	Non-negative integer counts	Trains on complement of each class	Imbalanced text classification

Section 06

The Zero Frequency Problem — Laplace Smoothing

📖 The Problem

One Missing Word Destroys Everything

Our intern Bayes has a catastrophic flaw. If a word appears in a new email but was never seen in any training email of a given class, the probability for that entire class becomes exactly zero — regardless of how strongly all other words point toward it. It is like a jury acquitting an obviously guilty defendant because one minor witness was unavailable, ignoring the ten who already testified.

Without Smoothing (Broken)

P(word | Class) = count(word, Class) / N(Class)

If count = 0 → probability = 0 → entire product = 0. One unseen word silences every other feature permanently.

Laplace Smoothing (Fixed)

P(word | Class) = (count(word, Class) + α) / (N(Class) + α × V)

α is the smoothing parameter (α=1 is standard). V is vocabulary size. No probability can ever be zero again.

🧮 Laplace Smoothing Applied — Fixing P(winner | Ham)

Raw (Broken)

count("winner", Ham) = 0 | N(Ham) = 5 → P = 0/5 = 0.000 💥 Kills the class

α = 1, V = 3

Vocabulary has 3 features: "free", "winner", "meeting"

Smoothed

P("winner" | Ham) = (0 + 1) / (5 + 1×3) = 1/8 = 0.125 ✅ No longer zero

Ham Score

0.5 × 0.200 × 0.125 × 0.400 = 0.005 — non-zero, fair comparison now possible

Final

Spam (0.240) vs Ham (0.005) → Still SPAM ✅ — but now with a meaningful competing score

🎛️

Tuning Alpha (α)

α = 1 → Laplace smoothing (add-one). α < 1 → Lidstone smoothing (less aggressive). α = 0 → no smoothing (dangerous in production — never use). In sklearn: MultinomialNB(alpha=1.0) is the default. Tune α between 0.01 and 2.0 using cross-validation — smaller values suit large dense datasets, larger values suit small or sparse ones.

Section 07

Gaussian NB — How Continuous Features Are Modelled

When features are continuous (like age, glucose level, temperature), counting occurrences makes no sense. Gaussian NB instead models each feature as a normal distribution per class — estimating μ and σ² from training data.

Patient	Age	Glucose (mg/dL)	BMI	Diabetes?
1	50	180	33.6	Yes
2	58	200	35.0	Yes
3	45	165	30.5	Yes
4	25	90	22.1	No
5	30	85	23.5	No
6	28	100	24.0	No

📐 Training Phase — Estimate μ and σ Per Class

Diabetic

Age: μ=51.0, σ=6.6 | Glucose: μ=181.7, σ=17.6 | BMI: μ=33.0, σ=2.3

Healthy

Age: μ=27.7, σ=2.5 | Glucose: μ=91.7, σ=7.6 | BMI: μ=23.2, σ=1.0

New patient

Age=40, Glucose=150, BMI=29. Which class?

Prediction

Evaluate the Gaussian PDF at each feature value for both classes. Multiply the three PDF values by the class prior. Highest product wins → Diabetic (values sit clearly in diabetic distribution)

Gaussian PDF

P(x | Class) = (1 / σ√2π) × e^(−(x−μ)² / 2σ²)

Probability density of observing value x given the class's mean μ and standard deviation σ

Full Gaussian NB Prediction

ŷ = argmax [log P(Cₖ) + Σ log PDF(xᵢ ; μₖᵢ, σₖᵢ)]

Sum log-probabilities from the Gaussian PDF for each feature and each class. Pick the class with the highest total.

Section 08

Python Implementation

Gaussian Naïve Bayes — Iris Classification

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
import numpy as np

# Load Iris — continuous features → Gaussian NB is ideal
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB(
    var_smoothing=1e-9   # small constant added to variance for stability
)
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
print(f"Accuracy: {gnb.score(X_test, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Inspect learned μ and σ² per class per feature
for i, cls in enumerate(iris.target_names):
    print(f"\n{cls}:")
    for j, feat in enumerate(iris.feature_names):
        print(f"  {feat:30s}  μ={gnb.theta_[i,j]:.3f}  σ²={gnb.var_[i,j]:.4f}")

OUTPUT

Accuracy: 0.9667 precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 0.91 1.00 0.95 10 virginica 1.00 0.90 0.95 10 accuracy 0.97 30 setosa: sepal length (cm) μ=5.006 σ²=0.1242 petal length (cm) μ=1.464 σ²=0.0302 versicolor: sepal length (cm) μ=5.936 σ²=0.2664 petal length (cm) μ=4.260 σ²=0.2208

Multinomial NB — Spam Detection Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

emails = [
    "Get free money now win prizes",
    "Congratulations you are a winner claim your reward",
    "Free entry in our lucky draw text WIN to 88888",
    "Hi how are you doing today everything okay",
    "Meeting scheduled for 3pm please confirm attendance",
    "Project update attached please review the report",
    "WINNER you have been selected for a cash prize",
    "Lunch tomorrow at the usual place let me know",
    "Urgent your account will be suspended click here now",
    "Can we reschedule the team standup to Friday afternoon",
]
labels = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]  # 1=spam 0=ham

X_train, X_test, y_train, y_test = train_test_split(
    emails, labels, test_size=0.3, random_state=42
)

# TF-IDF → Multinomial NB pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        stop_words='english',
        ngram_range=(1, 2),    # unigrams + bigrams
        min_df=1
    )),
    ('clf', MultinomialNB(alpha=1.0))   # Laplace smoothing
])

pipeline.fit(X_train, y_train)

# Predict with class probabilities
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)

for email, pred, prob in zip(X_test, y_pred, y_prob):
    label = "SPAM" if pred == 1 else "HAM "
    print(f"[{label}] ({prob[1]*100:5.1f}% spam)  {email[:50]}")

OUTPUT

[HAM ] ( 8.3% spam) Hi how are you doing today everything okay [SPAM] ( 94.1% spam) Urgent your account will be suspended click here now [HAM ] ( 12.5% spam) Can we reschedule the team standup to Friday afternoon

Bernoulli NB — Binary Bag-of-Words

from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# binary=True: CountVectorizer caps all counts at 1 (word present/absent)
pipeline_bnb = Pipeline([
    ('bow', CountVectorizer(binary=True, stop_words='english')),
    ('clf', BernoulliNB(alpha=1.0, binarize=None))
    # binarize=None: features already binarised by CountVectorizer
])

pipeline_bnb.fit(X_train, y_train)
print("BernoulliNB accuracy:", f"{pipeline_bnb.score(X_test, y_test):.4f}")

Comparing All Variants with Cross-Validation

from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# ComplementNB — best for imbalanced class distributions
pipeline_cnb = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf',   ComplementNB(alpha=1.0, norm=True))
])

# 3-fold CV on the full emails dataset
for name, pipe in [
    ('MultinomialNB', pipeline),
    ('BernoulliNB  ', pipeline_bnb),
    ('ComplementNB ', pipeline_cnb)
]:
    scores = cross_val_score(pipe, emails, labels, cv=3, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} ± {scores.std():.3f}")

Section 09

Full Real-World NLP Pipeline — 20 Newsgroups

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.datasets import fetch_20newsgroups
import matplotlib.pyplot as plt

# 4-topic multi-class classification
categories = ['sci.space', 'comp.graphics',
              'rec.sport.hockey', 'talk.politics.guns']

train = fetch_20newsgroups(subset='train', categories=categories)
test  = fetch_20newsgroups(subset='test',  categories=categories)

# Pipeline with hyperparameter search
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf',   MultinomialNB())
])

param_grid = {
    'tfidf__max_features': [5000, 20000, None],
    'tfidf__ngram_range':  [(1, 1), (1, 2)],
    'tfidf__sublinear_tf': [True, False],
    'clf__alpha':          [0.01, 0.1, 0.5, 1.0],
}

grid = GridSearchCV(
    pipeline, param_grid,
    cv=5, scoring='f1_macro',
    n_jobs=-1, verbose=0
)
grid.fit(train.data, train.target)

print("Best params:", grid.best_params_)
print(f"Best CV F1:   {grid.best_score_:.4f}")

y_pred = grid.predict(test.data)
print(classification_report(
    test.target, y_pred,
    target_names=train.target_names
))

# Confusion matrix
ConfusionMatrixDisplay.from_predictions(
    test.target, y_pred,
    display_labels=train.target_names,
    xticks_rotation='vertical'
)
plt.tight_layout()
plt.show()

OUTPUT

Best params: {'clf__alpha': 0.1, 'tfidf__max_features': None, 'tfidf__ngram_range': (1, 2), 'tfidf__sublinear_tf': True} Best CV F1: 0.9312 precision recall f1-score support sci.space 0.97 0.96 0.96 394 comp.graphics 0.91 0.93 0.92 389 rec.sport.hockey 0.99 0.98 0.98 399 talk.politics.guns 0.93 0.93 0.93 364 accuracy 0.95 1546

Section 10

When to Use Naïve Bayes

✅

Text Classification

Spam detection, news topic classification, document tagging, sentiment analysis. MultinomialNB with TF-IDF remains competitive with far more complex models across many NLP benchmarks.

spam · topics · sentiment · intent

✅

Very Small Datasets

When you have little training data, Naïve Bayes is robust. It only needs enough samples to estimate P(word | class) reliably — far fewer than deep learning or Random Forest require.

low-data regime · cold start

✅

Real-Time Inference

Naïve Bayes is one of the fastest classifiers at prediction time — O(features × classes). Excellent for streaming data, on-device email filtering, or any latency-critical system.

edge devices · streaming · IoT

✅

Many Classes

Handles hundreds of classes natively — each gets its own prior and likelihoods. No one-vs-rest tricks required. Scales cleanly to product catalogues, news feeds, language detection.

multi-class · hundreds of labels

❌

Correlated Features

When features are strongly correlated, the independence assumption fails badly. Probabilities become miscalibrated. Logistic Regression or tree-based models handle correlations far better.

use Logistic Regression instead

❌

Structured Tabular Data

For mixed-type tabular data with numeric features, Gaussian NB's normality assumption often breaks severely. Random Forest and Gradient Boosting almost always win here.

use Random Forest instead

Section 11

Naïve Bayes vs Other Classifiers

Property	Naïve Bayes	Logistic Regression	Random Forest	SVM
Training speed	⚡ Fastest	Fast	Moderate	Slow on large N
Inference speed	⚡ Fastest	Fast	Moderate	Moderate
Text / NLP	Excellent	Excellent	Poor (sparse data)	Excellent
Probability calibration	Overconfident	Well calibrated	Moderate	Needs Platt scaling
Handles missing values	Yes (skip features)	Needs imputation	Needs imputation	Needs imputation
Feature scaling needed	No	Yes	No	Yes
Correlated features	Assumption violated	Handles well	Handles well	Handles well
Data requirement	Very low	Moderate	Moderate–High	Moderate

Section 12

Strengths vs Weaknesses

✅ Strengths

Extremely fast to train — linear in data size

Fastest inference of any classifier

Excellent on text and NLP tasks

Scales to millions of sparse features

Works well with very small training sets

Multi-class native — no extra tricks needed

Interpretable — inspect learned probabilities

No feature scaling or normalisation required

Handles missing features gracefully

❌ Weaknesses

Independence assumption almost always violated

Poorly calibrated — probabilities too extreme

Zero frequency problem (requires smoothing)

Cannot model feature interactions at all

Gaussian NB breaks on non-normal distributions

Outperformed by boosting and LR on most tabular tasks

Word order completely ignored in text (bag-of-words)

Sensitive to irrelevant / redundant features

📐

Fix Overconfident Probabilities with Calibration

Naïve Bayes probabilities tend to cluster near 0 and 1 — overconfident. If you need reliable probability estimates (e.g. for risk scoring or decision thresholds), wrap the model with sklearn's CalibratedClassifierCV using method='isotonic' for large datasets or 'sigmoid' for small ones. For the classification decision itself (not the probability), raw outputs are fine.

Section 13

Naïve Bayes in the Real World

Email Spam Filters

📧

Used by SpamAssassin since 2001
Still in production in mail servers
Processes millions of emails per second
Updates continuously with new examples

Medical Diagnosis

🏥

Symptom → disease classification
Probabilistic output aids clinicians
Works reliably with limited patient data
Used in early diagnostic support tools

Sentiment & NLP

💬

Product review classification
Social media real-time monitoring
News category tagging at scale
Language detection in multi-lingual systems

Section 14

Golden Rules

🎯 Naïve Bayes — Non-Negotiable Rules

Always use Laplace smoothing — never set alpha=0 in production. Zero probabilities destroy the model silently: no error is thrown, the output is simply always wrong for any class containing an unseen feature. The default alpha=1.0 is a safe start.

Match the variant to your feature type precisely. Continuous numeric features → GaussianNB. Word counts or TF-IDF → MultinomialNB. Binary presence/absence → BernoulliNB. Imbalanced text → ComplementNB. Using the wrong variant is the single most common Naïve Bayes mistake.

Use Naïve Bayes as your mandatory baseline. It trains in milliseconds and gives you a strong performance floor. If your complex model cannot beat it, something is wrong with the complex model — not with Naïve Bayes.

Tune alpha with cross-validation — it is the single most impactful hyperparameter. Try values on a log scale: 0.001, 0.01, 0.1, 0.5, 1.0, 2.0. Smaller alpha values work better on large, dense, high-quality data. Larger values help on sparse or tiny datasets.

Do not trust raw probabilities for high-stakes decisions. Apply CalibratedClassifierCV(method='isotonic') to post-process the outputs if your application requires reliable probability estimates — risk scoring, fraud thresholds, medical triage. For a simple classification decision, raw outputs are perfectly fine.

Always work in log-space when implementing from scratch. Use log P(Class) + Σ log P(xᵢ | Class) — never multiply raw probabilities directly. Underflow is silent and produces wrong answers with no warning.

The independence assumption being violated is normal — and acceptable. It affects the calibration of probabilities, not necessarily the final class decision. Naïve Bayes consistently outperforms more sophisticated models on text tasks despite every word in English being correlated with every other word in context.