The Story That Explains Naïve Bayes
"The word FREE appears in 380 spam emails but only 12 legitimate ones."
"The word meeting appears in 420 legitimate emails but only 3 spam ones."
"The word winner appears in 350 spam emails but only 2 legitimate ones."
A new email arrives: "You are a FREE WINNER — claim your prize now!"
Bayes does not read it like a human. He asks one cold, mathematical question: "Given each of these words, what is the probability this is spam?" He multiplies the individual word probabilities, compares Spam vs Legitimate, and announces: "99.2% spam." Correct — deleted in milliseconds.
That is Naïve Bayes. "Naïve" because it assumes each word votes independently — a simplification that is statistically wrong but works remarkably well in practice.
In reality the words "FREE" and "WINNER" are not independent — seeing one makes the other more likely. Naïve Bayes ignores all correlations and treats every feature as if it contributes independently to the outcome. Statistically wrong. Practically powerful.
The Foundation — Bayes' Theorem
Everything in Naïve Bayes flows from one equation discovered by the Reverend Thomas Bayes in the 18th century. It tells you how to update a belief when you receive new evidence.
| Term | Name | Plain English | Spam Example |
|---|---|---|---|
P(Class | Features) |
Posterior | What we want — probability of the class given the evidence we observed | P(Spam | "FREE", "WIN") |
P(Features | Class) |
Likelihood | How probable are these features if we already know the class? | P("FREE", "WIN" | Spam) |
P(Class) |
Prior | How common is this class in the training data — before seeing any features | P(Spam) = 400/1000 = 0.40 |
P(Features) |
Evidence | How common are these features overall? Same for every class — safely cancelled | Constant divisor — ignored in practice |
When comparing classes (Spam vs Ham), P(Features) is identical
for both — it cancels out. We only need to compare
P(Features | Class) × P(Class) for each class and pick the
largest. No division required. This is the
Maximum A Posteriori (MAP) rule — and it is all Naïve Bayes uses.
The Naïve Independence Assumption
With many features, computing P(word₁, word₂, word₃, ... | Class) directly
is impossible — you would need to have seen every exact combination of words in training.
Naïve Bayes solves this with a bold simplification.
Multiplying hundreds of small probabilities — 0.001 × 0.003 × 0.0005 × ... —
produces numbers so tiny that computers silently round them to
zero (floating-point underflow). Taking logarithms converts
products into sums: log(a × b) = log(a) + log(b).
Since log is monotone, argmax of the log-probabilities
gives the identical answer as argmax of the raw probabilities.
sklearn handles this internally — but you must handle it yourself
when implementing from scratch.
Step-by-Step Manual Calculation
Let us classify one new email by hand using a tiny training set of 10 emails with three binary features.
| # | Contains "free" | Contains "winner" | Contains "meeting" | Label |
|---|---|---|---|---|
| 1 | ✅ | ✅ | ❌ | Spam |
| 2 | ✅ | ❌ | ❌ | Spam |
| 3 | ✅ | ✅ | ❌ | Spam |
| 4 | ❌ | ✅ | ❌ | Spam |
| 5 | ✅ | ❌ | ❌ | Spam |
| 6 | ❌ | ❌ | ✅ | Ham |
| 7 | ❌ | ❌ | ✅ | Ham |
| 8 | ❌ | ❌ | ✅ | Ham |
| 9 | ✅ | ❌ | ✅ | Ham |
| 10 | ❌ | ❌ | ❌ | Ham |
New email to classify: contains "free" ✅, "winner" ✅, "meeting" ❌. Spam or Ham?
= 0.5 × 0.800 × 0.600 × 1.000 = 0.240
= 0.5 × 0.200 × 0.000 × 0.400 = 0.000 ← One zero kills it!
The model predicted spam correctly. But Ham score is exactly zero because "winner" never appeared in any ham email during training. One unseen word silenced all other evidence for ham — permanently. This is the Zero Frequency Problem, solved by Laplace Smoothing in Section 06.
The Three Variants of Naïve Bayes
Naïve Bayes is not a single algorithm — it is a family. The variant you choose depends entirely on the type of features in your data.
| Variant | Feature Type | Likelihood Formula | Typical Use Cases |
|---|---|---|---|
| Gaussian NB | Real-valued continuous | Gaussian PDF — uses μ, σ² | Iris classification, medical data, sensors |
| Multinomial NB | Non-negative integer counts | Multinomial distribution over word counts | Spam, news topics, document tagging |
| Bernoulli NB | Binary 0 / 1 | Product of Bernoulli distributions | Sentiment (short), binary feature sets |
| Complement NB | Non-negative integer counts | Trains on complement of each class | Imbalanced text classification |
The Zero Frequency Problem — Laplace Smoothing
α = 1 → Laplace smoothing (add-one).
α < 1 → Lidstone smoothing (less aggressive).
α = 0 → no smoothing (dangerous in production — never use).
In sklearn: MultinomialNB(alpha=1.0) is the default.
Tune α between 0.01 and 2.0 using cross-validation —
smaller values suit large dense datasets, larger values suit
small or sparse ones.
Gaussian NB — How Continuous Features Are Modelled
When features are continuous (like age, glucose level, temperature), counting occurrences makes no sense. Gaussian NB instead models each feature as a normal distribution per class — estimating μ and σ² from training data.
| Patient | Age | Glucose (mg/dL) | BMI | Diabetes? |
|---|---|---|---|---|
| 1 | 50 | 180 | 33.6 | Yes |
| 2 | 58 | 200 | 35.0 | Yes |
| 3 | 45 | 165 | 30.5 | Yes |
| 4 | 25 | 90 | 22.1 | No |
| 5 | 30 | 85 | 23.5 | No |
| 6 | 28 | 100 | 24.0 | No |
Python Implementation
Gaussian Naïve Bayes — Iris Classification
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
import numpy as np
# Load Iris — continuous features → Gaussian NB is ideal
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
gnb = GaussianNB(
var_smoothing=1e-9 # small constant added to variance for stability
)
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(f"Accuracy: {gnb.score(X_test, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Inspect learned μ and σ² per class per feature
for i, cls in enumerate(iris.target_names):
print(f"\n{cls}:")
for j, feat in enumerate(iris.feature_names):
print(f" {feat:30s} μ={gnb.theta_[i,j]:.3f} σ²={gnb.var_[i,j]:.4f}")
Multinomial NB — Spam Detection Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
emails = [
"Get free money now win prizes",
"Congratulations you are a winner claim your reward",
"Free entry in our lucky draw text WIN to 88888",
"Hi how are you doing today everything okay",
"Meeting scheduled for 3pm please confirm attendance",
"Project update attached please review the report",
"WINNER you have been selected for a cash prize",
"Lunch tomorrow at the usual place let me know",
"Urgent your account will be suspended click here now",
"Can we reschedule the team standup to Friday afternoon",
]
labels = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0] # 1=spam 0=ham
X_train, X_test, y_train, y_test = train_test_split(
emails, labels, test_size=0.3, random_state=42
)
# TF-IDF → Multinomial NB pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(
stop_words='english',
ngram_range=(1, 2), # unigrams + bigrams
min_df=1
)),
('clf', MultinomialNB(alpha=1.0)) # Laplace smoothing
])
pipeline.fit(X_train, y_train)
# Predict with class probabilities
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)
for email, pred, prob in zip(X_test, y_pred, y_prob):
label = "SPAM" if pred == 1 else "HAM "
print(f"[{label}] ({prob[1]*100:5.1f}% spam) {email[:50]}")
Bernoulli NB — Binary Bag-of-Words
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# binary=True: CountVectorizer caps all counts at 1 (word present/absent)
pipeline_bnb = Pipeline([
('bow', CountVectorizer(binary=True, stop_words='english')),
('clf', BernoulliNB(alpha=1.0, binarize=None))
# binarize=None: features already binarised by CountVectorizer
])
pipeline_bnb.fit(X_train, y_train)
print("BernoulliNB accuracy:", f"{pipeline_bnb.score(X_test, y_test):.4f}")
Comparing All Variants with Cross-Validation
from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# ComplementNB — best for imbalanced class distributions
pipeline_cnb = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', ComplementNB(alpha=1.0, norm=True))
])
# 3-fold CV on the full emails dataset
for name, pipe in [
('MultinomialNB', pipeline),
('BernoulliNB ', pipeline_bnb),
('ComplementNB ', pipeline_cnb)
]:
scores = cross_val_score(pipe, emails, labels, cv=3, scoring='f1')
print(f"{name}: F1 = {scores.mean():.3f} ± {scores.std():.3f}")
Full Real-World NLP Pipeline — 20 Newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.datasets import fetch_20newsgroups
import matplotlib.pyplot as plt
# 4-topic multi-class classification
categories = ['sci.space', 'comp.graphics',
'rec.sport.hockey', 'talk.politics.guns']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
# Pipeline with hyperparameter search
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MultinomialNB())
])
param_grid = {
'tfidf__max_features': [5000, 20000, None],
'tfidf__ngram_range': [(1, 1), (1, 2)],
'tfidf__sublinear_tf': [True, False],
'clf__alpha': [0.01, 0.1, 0.5, 1.0],
}
grid = GridSearchCV(
pipeline, param_grid,
cv=5, scoring='f1_macro',
n_jobs=-1, verbose=0
)
grid.fit(train.data, train.target)
print("Best params:", grid.best_params_)
print(f"Best CV F1: {grid.best_score_:.4f}")
y_pred = grid.predict(test.data)
print(classification_report(
test.target, y_pred,
target_names=train.target_names
))
# Confusion matrix
ConfusionMatrixDisplay.from_predictions(
test.target, y_pred,
display_labels=train.target_names,
xticks_rotation='vertical'
)
plt.tight_layout()
plt.show()
When to Use Naïve Bayes
Naïve Bayes vs Other Classifiers
| Property | Naïve Bayes | Logistic Regression | Random Forest | SVM |
|---|---|---|---|---|
| Training speed | ⚡ Fastest | Fast | Moderate | Slow on large N |
| Inference speed | ⚡ Fastest | Fast | Moderate | Moderate |
| Text / NLP | Excellent | Excellent | Poor (sparse data) | Excellent |
| Probability calibration | Overconfident | Well calibrated | Moderate | Needs Platt scaling |
| Handles missing values | Yes (skip features) | Needs imputation | Needs imputation | Needs imputation |
| Feature scaling needed | No | Yes | No | Yes |
| Correlated features | Assumption violated | Handles well | Handles well | Handles well |
| Data requirement | Very low | Moderate | Moderate–High | Moderate |
Strengths vs Weaknesses
| Extremely fast to train — linear in data size |
| Fastest inference of any classifier |
| Excellent on text and NLP tasks |
| Scales to millions of sparse features |
| Works well with very small training sets |
| Multi-class native — no extra tricks needed |
| Interpretable — inspect learned probabilities |
| No feature scaling or normalisation required |
| Handles missing features gracefully |
| Independence assumption almost always violated |
| Poorly calibrated — probabilities too extreme |
| Zero frequency problem (requires smoothing) |
| Cannot model feature interactions at all |
| Gaussian NB breaks on non-normal distributions |
| Outperformed by boosting and LR on most tabular tasks |
| Word order completely ignored in text (bag-of-words) |
| Sensitive to irrelevant / redundant features |
Naïve Bayes probabilities tend to cluster near 0 and 1 — overconfident.
If you need reliable probability estimates (e.g. for risk scoring or decision thresholds),
wrap the model with sklearn's
CalibratedClassifierCV using method='isotonic' for
large datasets or 'sigmoid' for small ones.
For the classification decision itself (not the probability), raw outputs are fine.
Naïve Bayes in the Real World
- Used by SpamAssassin since 2001
- Still in production in mail servers
- Processes millions of emails per second
- Updates continuously with new examples
- Symptom → disease classification
- Probabilistic output aids clinicians
- Works reliably with limited patient data
- Used in early diagnostic support tools
- Product review classification
- Social media real-time monitoring
- News category tagging at scale
- Language detection in multi-lingual systems
Golden Rules
alpha=0
in production. Zero probabilities destroy the model silently:
no error is thrown, the output is simply always wrong for any class
containing an unseen feature. The default alpha=1.0 is a safe start.
alpha with cross-validation — it is the
single most impactful hyperparameter.
Try values on a log scale: 0.001, 0.01, 0.1, 0.5, 1.0, 2.0.
Smaller alpha values work better on large, dense, high-quality data.
Larger values help on sparse or tiny datasets.
CalibratedClassifierCV(method='isotonic') to
post-process the outputs if your application requires reliable probability
estimates — risk scoring, fraud thresholds, medical triage.
For a simple classification decision, raw outputs are perfectly fine.
log P(Class) + Σ log P(xᵢ | Class) — never multiply
raw probabilities directly. Underflow is silent and produces wrong
answers with no warning.