Machine Learning Overfitting and Underfitting

Overfitting and underfitting are the two failure modes of Machine Learning models. Understanding them is essential for building models that perform reliably on data they have never seen before.

Underfitting

Underfitting happens when a model is too simple for the complexity of the data.
It misses real patterns and performs poorly on both training and test data.

Example: Predicting house price using only ONE feature (number of rooms)
  when the actual price depends on location, size, age, amenities, etc.

Training accuracy: 58%
Test accuracy:     55%
→ Both are low. The model learned nothing useful.

Visual (fit to wavy data):

  Data:   ●      ●       ●     ●   ●     ●
                                ●       ●
  ●              ●
  
  Underfit line: ─────────────────────── (flat, misses the wave)

Overfitting

Overfitting happens when a model is too complex and memorizes the training data,
including its noise and random quirks. It performs very well on training data
but fails on new data.

Example: A Decision Tree with no depth limit trained on 50 records.
  It creates 50 leaf nodes — one for each training record.
  Perfect on training. Useless on test data.

Training accuracy: 100%
Test accuracy:     63%
→ Large gap. Model memorized instead of learning.

Visual:

  Actual data points: ●  (each is a real measurement)
  Overfit curve: ∿∿∿∿∿  (follows every tiny wiggle including noise)

The Ideal Fit

A well-fitting model captures the real underlying pattern
without chasing noise.

Training accuracy: 88%
Test accuracy:     85%
→ Small gap. Model generalizes well.

Visual:

  Data:   ●    ●        ●     ●      ●
             ●        ●          ●
  Good fit: ────────smooth curve────────

The curve follows the overall trend, not every bump.

Causes of Overfitting

┌─────────────────────────────┬─────────────────────────────────────┐
│ Cause                       │ Example                             │
├─────────────────────────────┼─────────────────────────────────────┤
│ Too many features           │ 500 features, only 200 records      │
│ Model too complex           │ Decision Tree with no depth limit   │
│ Too little training data    │ 50 records for a complex problem    │
│ Training too long           │ Neural network trained for 1000     │
│                             │ epochs on a small dataset           │
│ No regularization applied   │ Coefficients grow unrestricted      │
└─────────────────────────────┴─────────────────────────────────────┘

How to Fix Overfitting

Fix 1: Get more training data
  More diverse examples → less memorization

Fix 2: Reduce model complexity
  Decision Tree: set max_depth
  Neural Network: use fewer layers/neurons

Fix 3: Apply regularization (L1 / L2 — next topic)

Fix 4: Feature selection
  Remove irrelevant features that only add noise

Fix 5: Cross validation
  Reliable estimate of true performance prevents overconfidence

Fix 6: Dropout (Neural Networks)
  Randomly disable neurons during training to prevent co-dependence

Leave a Comment