Deep Learning Loss Functions and Optimization

Every time a neural network makes a prediction, it needs feedback on how wrong it was. Loss functions provide that feedback in the form of a number. Optimizers use that number to improve the model's weights. These two components work together to make a model learn.

What Is a Loss Function?

A loss function measures the distance between the model's prediction and the correct answer. A low loss means the model is accurate. A high loss means the model needs to improve.

The Archery Analogy

Target center = correct answer
Arrow landing = model's prediction

Arrow hits center        → Loss = 0   (perfect)
Arrow 2 cm from center   → Loss = 4   (small error)
Arrow 10 cm from center  → Loss = 100 (large error)

The loss function calculates exactly how far off the arrow landed. The optimizer then adjusts the bow (the weights) so the next arrow lands closer.

Common Loss Functions

1. Mean Squared Error (MSE)

Use MSE when your model predicts a continuous number — such as a house price, temperature, or stock value.

Formula: MSE = average of (prediction − actual)²

Example:
  Actual price:     $300,000
  Model predicted:  $350,000
  Error:            $50,000
  Squared error:    2,500,000,000

  Squaring the error punishes large mistakes more than small ones.
  A model that is off by $100,000 is penalized far more than one off by $1,000.

2. Binary Cross-Entropy

Use this for problems with two classes: yes/no, spam/not spam, sick/healthy.

Correct answer: 1 (spam)
Model output:   0.9 → Loss is low  (model was right)
Model output:   0.1 → Loss is high (model was very wrong)
Model output:   0.5 → Loss is moderate (model was uncertain)

The loss increases dramatically when the model is confident but wrong.

3. Categorical Cross-Entropy

Use this for problems with three or more classes, combined with a Softmax output layer.

Correct class: "cat" → represented as [1, 0, 0]
Model output (Softmax):
  cat = 0.70, dog = 0.20, bird = 0.10 → Loss is low
  cat = 0.10, dog = 0.60, bird = 0.30 → Loss is high

Loss Function Selection Guide

Task	Output Type	Loss Function
Predict a price or temperature	Continuous number	Mean Squared Error
Classify spam or not spam	Two classes	Binary Cross-Entropy
Classify cat, dog, or bird	3+ classes	Categorical Cross-Entropy

What Is an Optimizer?

The optimizer reads the loss and decides how to adjust the model's weights to reduce it. The most famous technique is called Gradient Descent.

The Hill-Walking Analogy

Imagine you are blindfolded on a hilly landscape.
Your goal: reach the lowest valley (minimum loss).

Your strategy:
  1. Feel the slope under your feet.
  2. Take a small step downhill.
  3. Stop. Feel the slope again.
  4. Take another small step downhill.
  5. Repeat until the ground feels flat.

Gradient Descent does exactly this — but with numbers and calculus.
The "slope" is called the gradient.
The "step size" is called the learning rate.

Gradient Descent Diagram

Loss
 ^
 |  *
 |    \
 |     \
 |      \      *
 |       \    / \
 |        \  /   \
 |         \/     \
 |       minimum   *
 └─────────────────────→ Weights
         ↑
   Goal: reach here

The Learning Rate

The learning rate controls how large each weight update step is.

Learning rate too HIGH:
→ Takes huge steps
→ Jumps over the valley and never settles
→ Loss bounces around and never decreases properly

Learning rate too LOW:
→ Takes tiny steps
→ Takes forever to reach the valley
→ Training is extremely slow

Learning rate just right:
→ Steady, reliable descent to the minimum
→ Fast enough to finish training, careful enough to converge

Popular Optimizers

SGD (Stochastic Gradient Descent)

The classic optimizer. It updates weights using one training example at a time. It is simple but can be slow and noisy.

Adam (Adaptive Moment Estimation)

Adam is the most popular optimizer in modern Deep Learning. It automatically adjusts the learning rate for each weight individually, based on how frequently that weight is updated. It learns faster and more reliably than plain SGD in most situations.

RMSprop

RMSprop is commonly used in recurrent networks and reinforcement learning. It adjusts the learning rate based on recent gradient history, making it stable for tasks where gradients change dramatically.

How One Training Step Works

Step 1: Forward Pass
  Input data → travels through all layers → produces a prediction

Step 2: Calculate Loss
  Compare prediction to correct answer → compute loss number

Step 3: Backward Pass (Backpropagation)
  Calculate how much each weight contributed to the loss

Step 4: Update Weights
  Optimizer adjusts each weight slightly to reduce the loss

Step 5: Repeat for next batch of examples

Tracking Loss During Training

A correctly training model shows a loss that decreases over time. Plotting it reveals whether training is going well.

Loss
1.0 |*
0.8 | **
0.6 |   ***
0.4 |      ****
0.2 |          ******
0.1 |                ──────── (loss flattens = model converged)
    └────────────────────────→ Epochs

If the loss flattens early and stays high, the model has stopped improving — a sign of a poorly tuned learning rate or insufficient network capacity.

Key Terms

Loss Function — measures how wrong the model's predictions are
MSE — loss function for predicting numbers
Cross-Entropy — loss function for classification tasks
Optimizer — adjusts weights to reduce loss
Gradient Descent — the core strategy for weight adjustment
Learning Rate — the size of each weight-update step
Adam — the most popular adaptive optimizer

Previous lesson

Back to course

Next lesson