Machine Learning Naive Bayes

Naive Bayes is a fast, probabilistic classification algorithm based on Bayes' Theorem. It calculates the probability of each class given the input features and predicts the most likely class. The "naive" part means it assumes all features are completely independent of each other — a simplification that rarely holds in reality, but works surprisingly well in practice.

Bayes' Theorem: The Foundation

Bayes' Theorem:
  P(Class | Features) = P(Features | Class) × P(Class) / P(Features)

In plain words:
  "Probability of class given these features"
    = "How likely are these features in this class"
      × "How common is this class overall"
      ÷ "How common are these features overall"

We compare this probability for each class.
The class with the highest probability wins.

Example: Spam Detection

Training Data: 100 emails
  70 Not Spam, 30 Spam

Prior Probabilities:
  P(Spam)     = 30/100 = 0.30
  P(Not Spam) = 70/100 = 0.70

Word frequencies (in spam vs not spam):
  Word "FREE":
    Appears in 25 of 30 spam emails → P(FREE | Spam) = 0.83
    Appears in 5 of 70 not spam     → P(FREE | Not Spam) = 0.07

  Word "CLICK":
    Appears in 20 of 30 spam emails → P(CLICK | Spam) = 0.67
    Appears in 7 of 70 not spam     → P(CLICK | Not Spam) = 0.10

New Email: contains "FREE" and "CLICK"

Naive Bayes Calculation:
  P(Spam | FREE, CLICK) ∝ P(FREE|Spam) × P(CLICK|Spam) × P(Spam)
                         = 0.83 × 0.67 × 0.30 = 0.167

  P(Not Spam | FREE, CLICK) ∝ P(FREE|Not Spam) × P(CLICK|Not Spam) × P(Not Spam)
                             = 0.07 × 0.10 × 0.70 = 0.0049

  Compare: 0.167 >> 0.0049
  Prediction: SPAM ✓

Types of Naive Bayes

┌──────────────────────┬──────────────────────────────────────────┐
│ Type                 │ Best For                                 │
├──────────────────────┼──────────────────────────────────────────┤
│ Gaussian Naive Bayes │ Continuous numerical features            │
│                      │ Assumes features follow a normal         │
│                      │ distribution (bell curve)                │
│ Multinomial NB       │ Text data — word counts or frequencies  │
│                      │ Most common for document classification  │
│ Bernoulli NB         │ Binary features (word present or not)   │
│                      │ Good for short documents or sentiment    │
└──────────────────────┴──────────────────────────────────────────┘

Why "Naive"?

The algorithm assumes features are INDEPENDENT of each other.

Real world: "FREE" and "PRIZE" often appear together in spam.
They are correlated.

Naive Bayes ignores this correlation and treats each word
as if it appeared independently.

Despite this incorrect assumption, Naive Bayes often performs
very well, especially on text classification.

Advantages and Limitations

Advantages:
  ✓ Extremely fast to train and predict
  ✓ Works very well with small datasets
  ✓ Handles high-dimensional data well (text)
  ✓ Not affected by irrelevant features as much
  ✓ Great baseline for NLP tasks

Limitations:
  ✗ Independence assumption is almost never true
  ✗ Probability estimates can be poorly calibrated
  ✗ If a word never appears in training, it gets zero probability
    (solved by Laplace smoothing — add 1 to all counts)
  ✗ Not ideal for complex numerical relationships

Laplace Smoothing (Handling Zero Probabilities)

Problem:
  New email contains word "BITCOIN"
  "BITCOIN" never appeared in training spam emails
  P(BITCOIN | Spam) = 0/30 = 0

  Multiplying any probability by 0 gives 0
  → Model always predicts NOT SPAM for this email

Solution — Add 1 to all counts (Laplace Smoothing):
  P(BITCOIN | Spam) = (0 + 1) / (30 + total_unique_words) > 0
  → Problem solved. Zero probability disappears.

Previous lessons

Back to courses

Next lessons