Deep Learning Overfitting and Regularization

A model that performs perfectly on training data but fails on new data has learned the wrong thing. It memorized the training examples instead of learning general patterns. This problem is called overfitting, and regularization is the set of techniques that prevent it.

Understanding Overfitting

The Exam Preparation Analogy

Student A: Studies concepts deeply → Understands the subject
Student B: Memorizes exact past exam questions → Only knows those questions

On a new exam:
Student A → Passes with 85%
Student B → Fails because the questions are new

Student B = an overfitted model

Overfitting on a Graph

Loss
 ^
 |  Training loss    ────────────── (keeps going down)
 |                 ╲
 |                  ╲ (gap grows)
 |  Validation loss  ╲____/‾‾‾‾‾‾‾ (starts going UP)
 |                        ↑
 |                   Overfitting begins here
 └──────────────────────────────────→ Epochs

When the training loss keeps dropping but the validation loss starts rising, the model has begun memorizing the training data.

Three Signs of Overfitting

Training accuracy is very high (95%+) but validation accuracy is much lower (70%)
The gap between training loss and validation loss keeps widening
The model performs poorly on completely new data

Underfitting: The Opposite Problem

Underfitting happens when the model is too simple to learn the patterns in the data. Both training and validation accuracy stay low.

Underfit: Model draws a straight line through curved data → misses everything
Overfit:  Model traces every single data point → memorizes noise
Just right: Model finds the general pattern → generalizes well

DATA POINTS:  *   *   *
                *     *
                  *

Underfit:  ─────────────── (flat line, ignores curve)
Overfit:   /\/\/\/\/\/\/\  (wiggles through every point)
Good fit:  ────────────    (smooth curve through the middle)

Regularization Techniques

1. Dropout

Dropout randomly switches off a percentage of neurons during each training step. The network cannot rely on any single neuron to do all the work — it must learn robust, spread-out patterns.

Normal training step:
  [N1] [N2] [N3] [N4] [N5] ← all neurons active

Dropout (50% rate):
  [N1] [  ] [N3] [  ] [N5] ← N2 and N4 randomly switched off

Next training step:
  [  ] [N2] [N3] [N4] [  ] ← different neurons switched off

The network learns to work without any individual neuron.
Result: more robust, general representations.

Dropout only applies during training. At inference (prediction) time, all neurons are active and their outputs are scaled to compensate.

2. L2 Regularization (Weight Decay)

L2 regularization adds a penalty to the loss function for large weight values. Large weights are a sign the model is memorizing specific training examples. Penalizing them forces the model to keep weights small and spread across many neurons.

Without L2:
  Loss = prediction error only
  Weights can grow very large → model overfits

With L2:
  Loss = prediction error + penalty for large weights
  The model balances being accurate AND keeping weights small
  → Simpler model → better generalization

Weight penalty term: λ × sum of all (weight²)
  λ (lambda) controls how strong the penalty is
  Larger λ = stronger regularization = simpler model

3. Early Stopping

Training continues until validation performance stops improving, then training halts automatically. This prevents the model from spending too many epochs memorizing the training data.

Epoch  Train Loss  Val Loss
  10     0.45       0.46
  20     0.30       0.31
  30     0.20       0.22
  40     0.12       0.25  ← Val loss starts increasing
  50     0.08       0.31
  60     0.05       0.38

STOP at Epoch 30 — save the model from that point

4. Data Augmentation

More training data almost always reduces overfitting. When you cannot collect more real data, augmentation creates new variations of existing examples.

1 real photo of a cat
→ Flipped horizontally    = new training example
→ Rotated 10°             = new training example
→ Brightness adjusted     = new training example
→ Cropped slightly        = new training example

4 training examples become 5 without collecting a single new photo.

5. Batch Normalization

Batch normalization standardizes the outputs of each layer during training. This keeps the values flowing through the network in a consistent range, which acts as a mild regularizer while also speeding up training.

Without Batch Norm:
  Layer outputs vary wildly → training is unstable

With Batch Norm:
  Each layer's outputs are rescaled to have mean=0, std=1
  → Stable training → less overfitting → faster convergence

Choosing the Right Technique

Technique	Best Used When	How Hard to Apply
Dropout	Large fully-connected networks	Easy — one line of code
L2 Regularization	Any network, especially dense layers	Easy — one parameter to tune
Early Stopping	Any training run	Easy — set a callback
Data Augmentation	Small image datasets	Moderate — requires preprocessing
Batch Normalization	Deep networks, convolutional networks	Easy — one layer to insert

The Bias-Variance Trade-off

Every model balances two competing errors:

Bias — error from being too simple (underfitting)
Variance — error from being too complex (overfitting)

High Bias, Low Variance  = Underfit (model too simple)
Low Bias, High Variance  = Overfit  (model too complex)
Low Bias, Low Variance   = Just right (goal)

Regularization techniques pull the model away from high variance, toward the balanced middle ground.

Key Terms

Overfitting — model memorizes training data, fails on new data
Underfitting — model is too simple to learn any useful pattern
Dropout — randomly disables neurons during training
L2 Regularization — penalizes large weight values
Early Stopping — halts training when validation performance stops improving
Bias — error from a model that is too simple
Variance — error from a model that is too sensitive to training data

Previous lessons

Back to courses

Next lessons