Machine Learning 📂 Supervised Learning · 10 of 17 34 min read

Naïve Bayes Classifier

Learn how Naïve Bayes uses Bayes' Theorem and the conditional independence assumption to classify text, detect spam, and diagnose disease — fast, interpretable, and surprisingly accurate even on tiny datasets.

Section 01

The Story That Explains Naïve Bayes

The New Email Intern Who Learned to Spot Spam
Imagine you hire a new intern named Bayes to sort your inbox. On his first day, you hand him 1,000 old emails — 400 spam, 600 legitimate — and tell him to take notes. He reads every single one and builds a tally:

"The word FREE appears in 380 spam emails but only 12 legitimate ones."
"The word meeting appears in 420 legitimate emails but only 3 spam ones."
"The word winner appears in 350 spam emails but only 2 legitimate ones."

A new email arrives: "You are a FREE WINNER — claim your prize now!"

Bayes does not read it like a human. He asks one cold, mathematical question: "Given each of these words, what is the probability this is spam?" He multiplies the individual word probabilities, compares Spam vs Legitimate, and announces: "99.2% spam." Correct — deleted in milliseconds.

That is Naïve Bayes. "Naïve" because it assumes each word votes independently — a simplification that is statistically wrong but works remarkably well in practice.
🔍
Why "Naïve"?

In reality the words "FREE" and "WINNER" are not independent — seeing one makes the other more likely. Naïve Bayes ignores all correlations and treats every feature as if it contributes independently to the outcome. Statistically wrong. Practically powerful.


Section 02

The Foundation — Bayes' Theorem

Everything in Naïve Bayes flows from one equation discovered by the Reverend Thomas Bayes in the 18th century. It tells you how to update a belief when you receive new evidence.

Bayes' Theorem
P(A | B) = P(B | A) × P(A) / P(B)
The probability of A given B equals the likelihood of B given A, times the prior belief in A, divided by the total probability of B
In Classification Terms
P(Class | Features) = P(Features | Class) × P(Class) / P(Features)
What is the probability of this class label, given the feature values we observed in this sample?
Term Name Plain English Spam Example
P(Class | Features) Posterior What we want — probability of the class given the evidence we observed P(Spam | "FREE", "WIN")
P(Features | Class) Likelihood How probable are these features if we already know the class? P("FREE", "WIN" | Spam)
P(Class) Prior How common is this class in the training data — before seeing any features P(Spam) = 400/1000 = 0.40
P(Features) Evidence How common are these features overall? Same for every class — safely cancelled Constant divisor — ignored in practice
💡
The Denominator Trick — MAP Decision Rule

When comparing classes (Spam vs Ham), P(Features) is identical for both — it cancels out. We only need to compare P(Features | Class) × P(Class) for each class and pick the largest. No division required. This is the Maximum A Posteriori (MAP) rule — and it is all Naïve Bayes uses.


Section 03

The Naïve Independence Assumption

With many features, computing P(word₁, word₂, word₃, ... | Class) directly is impossible — you would need to have seen every exact combination of words in training. Naïve Bayes solves this with a bold simplification.

🔑 The Independence Assumption — Features Vote Separately
Without It
P("free", "win", "prize" | Spam) — must observe this exact 3-word combo in training. Rare or zero → model collapses completely.
With It
P("free" | Spam) × P("win" | Spam) × P("prize" | Spam) — multiply individual word probabilities. Each word estimated independently. Always computable.
Result
The joint probability of all features becomes a simple product of marginals. Fast, tractable, and surprisingly accurate across thousands of real applications.
General MAP Formula
ŷ = argmax P(Cₖ) ∏ P(xᵢ | Cₖ)
Choose the class Cₖ that maximises the product of its prior and the likelihood of each individual feature given that class
Log-Space Version (Numerically Stable)
ŷ = argmax [log P(Cₖ) + Σ log P(xᵢ | Cₖ)]
Logarithm converts multiplication into addition — prevents floating-point underflow when many small probabilities are multiplied together
⚠️
Why Log-Space Is Not Optional

Multiplying hundreds of small probabilities — 0.001 × 0.003 × 0.0005 × ... — produces numbers so tiny that computers silently round them to zero (floating-point underflow). Taking logarithms converts products into sums: log(a × b) = log(a) + log(b). Since log is monotone, argmax of the log-probabilities gives the identical answer as argmax of the raw probabilities. sklearn handles this internally — but you must handle it yourself when implementing from scratch.


Section 04

Step-by-Step Manual Calculation

Let us classify one new email by hand using a tiny training set of 10 emails with three binary features.

# Contains "free" Contains "winner" Contains "meeting" Label
1Spam
2Spam
3Spam
4Spam
5Spam
6Ham
7Ham
8Ham
9Ham
10Ham

New email to classify: contains "free" ✅, "winner" ✅, "meeting" ❌. Spam or Ham?

🧮 Step 1 — Compute Prior Probabilities
P(Spam)
5 spam emails out of 10 total → P(Spam) = 5/10 = 0.500
P(Ham)
5 ham emails out of 10 total → P(Ham) = 5/10 = 0.500
🧮 Step 2 — Compute Likelihoods Per Class
P(free|Spam)
4 of 5 spam emails contain "free" → 4/5 = 0.800
P(winner|Spam)
3 of 5 spam emails contain "winner" → 3/5 = 0.600
P(¬meeting|Spam)
5 of 5 spam emails lack "meeting" → 5/5 = 1.000
P(free|Ham)
1 of 5 ham emails contain "free" → 1/5 = 0.200
P(winner|Ham)
0 of 5 ham emails contain "winner" → 0/5 = 0.000 ⚠️ Zero Problem!
P(¬meeting|Ham)
2 of 5 ham emails lack "meeting" → 2/5 = 0.400
🧮 Step 3 — Multiply to Get MAP Scores
Spam
P(Spam) × P(free|Spam) × P(winner|Spam) × P(¬meeting|Spam)
= 0.5 × 0.800 × 0.600 × 1.000 = 0.240
Ham
P(Ham) × P(free|Ham) × P(winner|Ham) × P(¬meeting|Ham)
= 0.5 × 0.200 × 0.000 × 0.400 = 0.000 ← One zero kills it!
Decision
0.240 > 0.000 → Predicted: SPAM ✅  — correct, but the zero is a critical flaw (see Section 06)
🎯
Correct — But One Fatal Issue

The model predicted spam correctly. But Ham score is exactly zero because "winner" never appeared in any ham email during training. One unseen word silenced all other evidence for ham — permanently. This is the Zero Frequency Problem, solved by Laplace Smoothing in Section 06.


Section 05

The Three Variants of Naïve Bayes

Naïve Bayes is not a single algorithm — it is a family. The variant you choose depends entirely on the type of features in your data.

📊
Gaussian NB
Continuous (Real-Valued) Features
Assumes each feature follows a normal distribution within each class. Learns μ (mean) and σ² (variance) per feature per class from training data, then evaluates new samples using the Gaussian PDF. Used for sensor readings, medical measurements, physical observations.
✅ Handles continuous data naturally
❌ Fails when features are not Gaussian
📝
Multinomial NB
Count / Frequency Features
Designed for word counts or TF-IDF vectors. Models how often each word appears in each class. Industry gold standard for spam detection, news topic classification, and document categorisation. Requires non-negative integer (or frequency) features.
✅ Gold standard for NLP text tasks
❌ Cannot use negative feature values
🔘
Bernoulli NB
Binary (0 / 1) Features
Uses binary word presence/absence (1 = word in email, 0 = not). Crucially, it explicitly penalises absent words — Multinomial ignores them. Works better than Multinomial on very short texts where frequency carries less signal than presence.
✅ Best for short documents and binary features
❌ Throws away frequency information entirely
Variant Feature Type Likelihood Formula Typical Use Cases
Gaussian NB Real-valued continuous Gaussian PDF — uses μ, σ² Iris classification, medical data, sensors
Multinomial NB Non-negative integer counts Multinomial distribution over word counts Spam, news topics, document tagging
Bernoulli NB Binary 0 / 1 Product of Bernoulli distributions Sentiment (short), binary feature sets
Complement NB Non-negative integer counts Trains on complement of each class Imbalanced text classification

Section 06

The Zero Frequency Problem — Laplace Smoothing

One Missing Word Destroys Everything
Our intern Bayes has a catastrophic flaw. If a word appears in a new email but was never seen in any training email of a given class, the probability for that entire class becomes exactly zero — regardless of how strongly all other words point toward it. It is like a jury acquitting an obviously guilty defendant because one minor witness was unavailable, ignoring the ten who already testified.
Without Smoothing (Broken)
P(word | Class) = count(word, Class) / N(Class)
If count = 0 → probability = 0 → entire product = 0. One unseen word silences every other feature permanently.
Laplace Smoothing (Fixed)
P(word | Class) = (count(word, Class) + α) / (N(Class) + α × V)
α is the smoothing parameter (α=1 is standard). V is vocabulary size. No probability can ever be zero again.
🧮 Laplace Smoothing Applied — Fixing P(winner | Ham)
Raw (Broken)
count("winner", Ham) = 0  |  N(Ham) = 5 → P = 0/5 = 0.000 💥 Kills the class
α = 1, V = 3
Vocabulary has 3 features: "free", "winner", "meeting"
Smoothed
P("winner" | Ham) = (0 + 1) / (5 + 1×3) = 1/8 = 0.125 ✅ No longer zero
Ham Score
0.5 × 0.200 × 0.125 × 0.400 = 0.005 — non-zero, fair comparison now possible
Final
Spam (0.240) vs Ham (0.005) → Still SPAM ✅ — but now with a meaningful competing score
🎛️
Tuning Alpha (α)

α = 1 → Laplace smoothing (add-one). α < 1 → Lidstone smoothing (less aggressive). α = 0 → no smoothing (dangerous in production — never use). In sklearn: MultinomialNB(alpha=1.0) is the default. Tune α between 0.01 and 2.0 using cross-validation — smaller values suit large dense datasets, larger values suit small or sparse ones.


Section 07

Gaussian NB — How Continuous Features Are Modelled

When features are continuous (like age, glucose level, temperature), counting occurrences makes no sense. Gaussian NB instead models each feature as a normal distribution per class — estimating μ and σ² from training data.

Patient Age Glucose (mg/dL) BMI Diabetes?
15018033.6Yes
25820035.0Yes
34516530.5Yes
4259022.1No
5308523.5No
62810024.0No
📐 Training Phase — Estimate μ and σ Per Class
Diabetic
Age: μ=51.0, σ=6.6  |  Glucose: μ=181.7, σ=17.6  |  BMI: μ=33.0, σ=2.3
Healthy
Age: μ=27.7, σ=2.5  |  Glucose: μ=91.7, σ=7.6  |  BMI: μ=23.2, σ=1.0
New patient
Age=40, Glucose=150, BMI=29. Which class?
Prediction
Evaluate the Gaussian PDF at each feature value for both classes. Multiply the three PDF values by the class prior. Highest product wins → Diabetic (values sit clearly in diabetic distribution)
Gaussian PDF
P(x | Class) = (1 / σ√2π) × e^(−(x−μ)² / 2σ²)
Probability density of observing value x given the class's mean μ and standard deviation σ
Full Gaussian NB Prediction
ŷ = argmax [log P(Cₖ) + Σ log PDF(xᵢ ; μₖᵢ, σₖᵢ)]
Sum log-probabilities from the Gaussian PDF for each feature and each class. Pick the class with the highest total.

Section 08

Python Implementation

Gaussian Naïve Bayes — Iris Classification

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
import numpy as np

# Load Iris — continuous features → Gaussian NB is ideal
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB(
    var_smoothing=1e-9   # small constant added to variance for stability
)
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
print(f"Accuracy: {gnb.score(X_test, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Inspect learned μ and σ² per class per feature
for i, cls in enumerate(iris.target_names):
    print(f"\n{cls}:")
    for j, feat in enumerate(iris.feature_names):
        print(f"  {feat:30s}  μ={gnb.theta_[i,j]:.3f}  σ²={gnb.var_[i,j]:.4f}")
OUTPUT
Accuracy: 0.9667 precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 0.91 1.00 0.95 10 virginica 1.00 0.90 0.95 10 accuracy 0.97 30 setosa: sepal length (cm) μ=5.006 σ²=0.1242 petal length (cm) μ=1.464 σ²=0.0302 versicolor: sepal length (cm) μ=5.936 σ²=0.2664 petal length (cm) μ=4.260 σ²=0.2208

Multinomial NB — Spam Detection Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

emails = [
    "Get free money now win prizes",
    "Congratulations you are a winner claim your reward",
    "Free entry in our lucky draw text WIN to 88888",
    "Hi how are you doing today everything okay",
    "Meeting scheduled for 3pm please confirm attendance",
    "Project update attached please review the report",
    "WINNER you have been selected for a cash prize",
    "Lunch tomorrow at the usual place let me know",
    "Urgent your account will be suspended click here now",
    "Can we reschedule the team standup to Friday afternoon",
]
labels = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]  # 1=spam 0=ham

X_train, X_test, y_train, y_test = train_test_split(
    emails, labels, test_size=0.3, random_state=42
)

# TF-IDF → Multinomial NB pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        stop_words='english',
        ngram_range=(1, 2),    # unigrams + bigrams
        min_df=1
    )),
    ('clf', MultinomialNB(alpha=1.0))   # Laplace smoothing
])

pipeline.fit(X_train, y_train)

# Predict with class probabilities
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)

for email, pred, prob in zip(X_test, y_pred, y_prob):
    label = "SPAM" if pred == 1 else "HAM "
    print(f"[{label}] ({prob[1]*100:5.1f}% spam)  {email[:50]}")
OUTPUT
[HAM ] ( 8.3% spam) Hi how are you doing today everything okay [SPAM] ( 94.1% spam) Urgent your account will be suspended click here now [HAM ] ( 12.5% spam) Can we reschedule the team standup to Friday afternoon

Bernoulli NB — Binary Bag-of-Words

from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# binary=True: CountVectorizer caps all counts at 1 (word present/absent)
pipeline_bnb = Pipeline([
    ('bow', CountVectorizer(binary=True, stop_words='english')),
    ('clf', BernoulliNB(alpha=1.0, binarize=None))
    # binarize=None: features already binarised by CountVectorizer
])

pipeline_bnb.fit(X_train, y_train)
print("BernoulliNB accuracy:", f"{pipeline_bnb.score(X_test, y_test):.4f}")

Comparing All Variants with Cross-Validation

from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# ComplementNB — best for imbalanced class distributions
pipeline_cnb = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf',   ComplementNB(alpha=1.0, norm=True))
])

# 3-fold CV on the full emails dataset
for name, pipe in [
    ('MultinomialNB', pipeline),
    ('BernoulliNB  ', pipeline_bnb),
    ('ComplementNB ', pipeline_cnb)
]:
    scores = cross_val_score(pipe, emails, labels, cv=3, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} ± {scores.std():.3f}")

Section 09

Full Real-World NLP Pipeline — 20 Newsgroups

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.datasets import fetch_20newsgroups
import matplotlib.pyplot as plt

# 4-topic multi-class classification
categories = ['sci.space', 'comp.graphics',
              'rec.sport.hockey', 'talk.politics.guns']

train = fetch_20newsgroups(subset='train', categories=categories)
test  = fetch_20newsgroups(subset='test',  categories=categories)

# Pipeline with hyperparameter search
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf',   MultinomialNB())
])

param_grid = {
    'tfidf__max_features': [5000, 20000, None],
    'tfidf__ngram_range':  [(1, 1), (1, 2)],
    'tfidf__sublinear_tf': [True, False],
    'clf__alpha':          [0.01, 0.1, 0.5, 1.0],
}

grid = GridSearchCV(
    pipeline, param_grid,
    cv=5, scoring='f1_macro',
    n_jobs=-1, verbose=0
)
grid.fit(train.data, train.target)

print("Best params:", grid.best_params_)
print(f"Best CV F1:   {grid.best_score_:.4f}")

y_pred = grid.predict(test.data)
print(classification_report(
    test.target, y_pred,
    target_names=train.target_names
))

# Confusion matrix
ConfusionMatrixDisplay.from_predictions(
    test.target, y_pred,
    display_labels=train.target_names,
    xticks_rotation='vertical'
)
plt.tight_layout()
plt.show()
OUTPUT
Best params: {'clf__alpha': 0.1, 'tfidf__max_features': None, 'tfidf__ngram_range': (1, 2), 'tfidf__sublinear_tf': True} Best CV F1: 0.9312 precision recall f1-score support sci.space 0.97 0.96 0.96 394 comp.graphics 0.91 0.93 0.92 389 rec.sport.hockey 0.99 0.98 0.98 399 talk.politics.guns 0.93 0.93 0.93 364 accuracy 0.95 1546

Section 10

When to Use Naïve Bayes

Text Classification
Spam detection, news topic classification, document tagging, sentiment analysis. MultinomialNB with TF-IDF remains competitive with far more complex models across many NLP benchmarks.
spam · topics · sentiment · intent
Very Small Datasets
When you have little training data, Naïve Bayes is robust. It only needs enough samples to estimate P(word | class) reliably — far fewer than deep learning or Random Forest require.
low-data regime · cold start
Real-Time Inference
Naïve Bayes is one of the fastest classifiers at prediction time — O(features × classes). Excellent for streaming data, on-device email filtering, or any latency-critical system.
edge devices · streaming · IoT
Many Classes
Handles hundreds of classes natively — each gets its own prior and likelihoods. No one-vs-rest tricks required. Scales cleanly to product catalogues, news feeds, language detection.
multi-class · hundreds of labels
Correlated Features
When features are strongly correlated, the independence assumption fails badly. Probabilities become miscalibrated. Logistic Regression or tree-based models handle correlations far better.
use Logistic Regression instead
Structured Tabular Data
For mixed-type tabular data with numeric features, Gaussian NB's normality assumption often breaks severely. Random Forest and Gradient Boosting almost always win here.
use Random Forest instead

Section 11

Naïve Bayes vs Other Classifiers

Property Naïve Bayes Logistic Regression Random Forest SVM
Training speed ⚡ Fastest Fast Moderate Slow on large N
Inference speed ⚡ Fastest Fast Moderate Moderate
Text / NLP Excellent Excellent Poor (sparse data) Excellent
Probability calibration Overconfident Well calibrated Moderate Needs Platt scaling
Handles missing values Yes (skip features) Needs imputation Needs imputation Needs imputation
Feature scaling needed No Yes No Yes
Correlated features Assumption violated Handles well Handles well Handles well
Data requirement Very low Moderate Moderate–High Moderate

Section 12

Strengths vs Weaknesses

✅ Strengths
Extremely fast to train — linear in data size
Fastest inference of any classifier
Excellent on text and NLP tasks
Scales to millions of sparse features
Works well with very small training sets
Multi-class native — no extra tricks needed
Interpretable — inspect learned probabilities
No feature scaling or normalisation required
Handles missing features gracefully
❌ Weaknesses
Independence assumption almost always violated
Poorly calibrated — probabilities too extreme
Zero frequency problem (requires smoothing)
Cannot model feature interactions at all
Gaussian NB breaks on non-normal distributions
Outperformed by boosting and LR on most tabular tasks
Word order completely ignored in text (bag-of-words)
Sensitive to irrelevant / redundant features
📐
Fix Overconfident Probabilities with Calibration

Naïve Bayes probabilities tend to cluster near 0 and 1 — overconfident. If you need reliable probability estimates (e.g. for risk scoring or decision thresholds), wrap the model with sklearn's CalibratedClassifierCV using method='isotonic' for large datasets or 'sigmoid' for small ones. For the classification decision itself (not the probability), raw outputs are fine.


Section 13

Naïve Bayes in the Real World

Email Spam Filters
📧
  • Used by SpamAssassin since 2001
  • Still in production in mail servers
  • Processes millions of emails per second
  • Updates continuously with new examples
Medical Diagnosis
🏥
  • Symptom → disease classification
  • Probabilistic output aids clinicians
  • Works reliably with limited patient data
  • Used in early diagnostic support tools
Sentiment & NLP
💬
  • Product review classification
  • Social media real-time monitoring
  • News category tagging at scale
  • Language detection in multi-lingual systems

Section 14

Golden Rules

🎯 Naïve Bayes — Non-Negotiable Rules
1
Always use Laplace smoothing — never set alpha=0 in production. Zero probabilities destroy the model silently: no error is thrown, the output is simply always wrong for any class containing an unseen feature. The default alpha=1.0 is a safe start.
2
Match the variant to your feature type precisely. Continuous numeric features → GaussianNB. Word counts or TF-IDF → MultinomialNB. Binary presence/absence → BernoulliNB. Imbalanced text → ComplementNB. Using the wrong variant is the single most common Naïve Bayes mistake.
3
Use Naïve Bayes as your mandatory baseline. It trains in milliseconds and gives you a strong performance floor. If your complex model cannot beat it, something is wrong with the complex model — not with Naïve Bayes.
4
Tune alpha with cross-validation — it is the single most impactful hyperparameter. Try values on a log scale: 0.001, 0.01, 0.1, 0.5, 1.0, 2.0. Smaller alpha values work better on large, dense, high-quality data. Larger values help on sparse or tiny datasets.
5
Do not trust raw probabilities for high-stakes decisions. Apply CalibratedClassifierCV(method='isotonic') to post-process the outputs if your application requires reliable probability estimates — risk scoring, fraud thresholds, medical triage. For a simple classification decision, raw outputs are perfectly fine.
6
Always work in log-space when implementing from scratch. Use log P(Class) + Σ log P(xᵢ | Class) — never multiply raw probabilities directly. Underflow is silent and produces wrong answers with no warning.
7
The independence assumption being violated is normal — and acceptable. It affects the calibration of probabilities, not necessarily the final class decision. Naïve Bayes consistently outperforms more sophisticated models on text tasks despite every word in English being correlated with every other word in context.