Machine Learning Naive Bayes
Naive Bayes is a fast, probabilistic classification algorithm based on Bayes' Theorem. It calculates the probability of each class given the input features and predicts the most likely class. The "naive" part means it assumes all features are completely independent of each other — a simplification that rarely holds in reality, but works surprisingly well in practice.
Bayes' Theorem: The Foundation
Bayes' Theorem:
P(Class | Features) = P(Features | Class) × P(Class) / P(Features)
In plain words:
"Probability of class given these features"
= "How likely are these features in this class"
× "How common is this class overall"
÷ "How common are these features overall"
We compare this probability for each class.
The class with the highest probability wins.
Example: Spam Detection
Training Data: 100 emails
70 Not Spam, 30 Spam
Prior Probabilities:
P(Spam) = 30/100 = 0.30
P(Not Spam) = 70/100 = 0.70
Word frequencies (in spam vs not spam):
Word "FREE":
Appears in 25 of 30 spam emails → P(FREE | Spam) = 0.83
Appears in 5 of 70 not spam → P(FREE | Not Spam) = 0.07
Word "CLICK":
Appears in 20 of 30 spam emails → P(CLICK | Spam) = 0.67
Appears in 7 of 70 not spam → P(CLICK | Not Spam) = 0.10
New Email: contains "FREE" and "CLICK"
Naive Bayes Calculation:
P(Spam | FREE, CLICK) ∝ P(FREE|Spam) × P(CLICK|Spam) × P(Spam)
= 0.83 × 0.67 × 0.30 = 0.167
P(Not Spam | FREE, CLICK) ∝ P(FREE|Not Spam) × P(CLICK|Not Spam) × P(Not Spam)
= 0.07 × 0.10 × 0.70 = 0.0049
Compare: 0.167 >> 0.0049
Prediction: SPAM ✓
Types of Naive Bayes
┌──────────────────────┬──────────────────────────────────────────┐ │ Type │ Best For │ ├──────────────────────┼──────────────────────────────────────────┤ │ Gaussian Naive Bayes │ Continuous numerical features │ │ │ Assumes features follow a normal │ │ │ distribution (bell curve) │ │ Multinomial NB │ Text data — word counts or frequencies │ │ │ Most common for document classification │ │ Bernoulli NB │ Binary features (word present or not) │ │ │ Good for short documents or sentiment │ └──────────────────────┴──────────────────────────────────────────┘
Why "Naive"?
The algorithm assumes features are INDEPENDENT of each other. Real world: "FREE" and "PRIZE" often appear together in spam. They are correlated. Naive Bayes ignores this correlation and treats each word as if it appeared independently. Despite this incorrect assumption, Naive Bayes often performs very well, especially on text classification.
Advantages and Limitations
Advantages:
✓ Extremely fast to train and predict
✓ Works very well with small datasets
✓ Handles high-dimensional data well (text)
✓ Not affected by irrelevant features as much
✓ Great baseline for NLP tasks
Limitations:
✗ Independence assumption is almost never true
✗ Probability estimates can be poorly calibrated
✗ If a word never appears in training, it gets zero probability
(solved by Laplace smoothing — add 1 to all counts)
✗ Not ideal for complex numerical relationships
Laplace Smoothing (Handling Zero Probabilities)
Problem: New email contains word "BITCOIN" "BITCOIN" never appeared in training spam emails P(BITCOIN | Spam) = 0/30 = 0 Multiplying any probability by 0 gives 0 → Model always predicts NOT SPAM for this email Solution — Add 1 to all counts (Laplace Smoothing): P(BITCOIN | Spam) = (0 + 1) / (30 + total_unique_words) > 0 → Problem solved. Zero probability disappears.
