Deep Learning Backpropagation

Backpropagation is the algorithm that makes neural networks learn. After the network makes a prediction and the loss is calculated, backpropagation figures out exactly which weights were responsible for the error — and by how much. The optimizer then uses this information to improve every weight in the network.

The Problem Backpropagation Solves

A network can have millions of weights. When the prediction is wrong, which weights caused the error? You cannot adjust them all equally — some weights contributed a lot to the mistake, others barely at all.

Backpropagation uses calculus to assign a blame score to every weight in the network. Weights that caused more of the error receive a larger adjustment.

The Report Card Analogy

A student fails an exam.
Who is responsible?

← The student didn't study (big blame)
← The book had one confusing chapter (small blame)
← The classroom was noisy (tiny blame)

Backpropagation gives each "cause" its correct share of blame.
Then the teacher adjusts each thing proportionally to fix the problem.

The Two Passes of Training

Forward Pass

Input → Layer 1 → Layer 2 → Layer 3 → Prediction
                                           ↓
                                     Compare to Answer
                                           ↓
                                      Calculate Loss

Data moves forward through the network from input to output. The network produces a prediction. The loss function computes how wrong the prediction is.

Backward Pass (Backpropagation)

Loss → Layer 3 → Layer 2 → Layer 1
  (gradients flow backwards, layer by layer)

The error signal travels backwards through the network. At each layer, calculus computes how much each weight contributed to the loss. This contribution is called the gradient.

What Is a Gradient?

A gradient is a number that tells you two things about a weight:

  • Direction — should the weight increase or decrease?
  • Magnitude — by how much?

Gradient on a Slope

Loss
 ^
 |  ╲  ← steep slope = large gradient = large weight adjustment needed
 |   ╲
 |    ╲_____
 |         ╲__ ← gentle slope = small gradient = tiny adjustment needed
 |             ─── (minimum)
 └──────────────────────────→ Weight value

At the steep part of the curve, the gradient is large — the weight is far from optimal and needs a big fix. Near the minimum, the gradient is small — the weight is nearly correct and needs only a small nudge.

The Chain Rule: How Gradients Flow Backward

Networks have many layers. To find how much a weight in Layer 1 contributed to the final loss, you must trace its effect all the way through Layer 2, Layer 3, and out to the prediction. The chain rule from calculus makes this tracing possible.

Layered Responsibility Diagram

Final Loss
    ↑
Layer 3 weight error  →  how much did Layer 3 cause this loss?
    ↑
Layer 2 weight error  →  how much did Layer 2 feed into Layer 3's error?
    ↑
Layer 1 weight error  →  how much did Layer 1 feed into Layer 2's error?

Each layer "passes the blame" to the layer before it.
The chain rule multiplies the partial effects together.

The Full Learning Cycle

┌──────────────────────────────────────────────┐
│                                              │
│  1. Forward Pass                             │
│     Input → Layers → Prediction              │
│                                              │
│  2. Loss Calculation                         │
│     Prediction vs. Correct Answer → Loss     │
│                                              │
│  3. Backward Pass (Backpropagation)          │
│     Loss → Gradient for each weight          │
│                                              │
│  4. Weight Update (Optimizer)                │
│  weight = weight − (learning rate × gradient)|
│                                              │
│  5. Repeat for next batch                    │
│                                              │
└──────────────────────────────────────────────┘

A Concrete Numerical Example

Setup

Single weight W = 2.0
Gradient computed by backpropagation = 0.5
Learning rate = 0.1

Weight update:
W_new = W − (learning rate × gradient)
W_new = 2.0 − (0.1 × 0.5)
W_new = 2.0 − 0.05
W_new = 1.95

The weight moved from 2.0 to 1.95 — a tiny step in the direction that reduces the loss. This happens for every weight in the network, all at once, on every training step.

The Vanishing Gradient Problem

In very deep networks, gradients can shrink to nearly zero as they travel backward through many layers. When gradients become too small, the weights in the early layers barely update — and those layers stop learning.

Vanishing Gradient Diagram

Layer 8 (output end)   → Gradient = 0.500  (strong signal)
Layer 7                → Gradient = 0.250
Layer 6                → Gradient = 0.125
Layer 5                → Gradient = 0.062
Layer 4                → Gradient = 0.031
Layer 3                → Gradient = 0.015
Layer 2                → Gradient = 0.007
Layer 1 (input end)    → Gradient = 0.004  (almost no learning)

Each layer halves the gradient. By the time it reaches the early layers, it is too small to cause any meaningful weight update. Modern architectures use techniques like ReLU activation, batch normalization, and residual connections to fight this problem.

The Exploding Gradient Problem

The opposite problem also occurs. If gradients grow too large as they travel back, weight updates become massive — the model overshoots and becomes unstable.

Exploding gradient symptom:
  Loss: 0.5 → 1.2 → 4.7 → 23.1 → NaN (Not a Number)

Fix: Gradient Clipping — cap gradients above a set value
  if gradient > 1.0:  set gradient = 1.0

Why You Do Not Need to Code Backpropagation

Frameworks like TensorFlow and PyTorch compute gradients automatically. When you call model.fit(), backpropagation runs behind the scenes. Understanding how it works helps you diagnose training problems — but you do not write the calculus yourself.

Key Terms

  • Backpropagation — algorithm that computes gradients by working backward from the loss
  • Forward Pass — data moving from input to output to generate a prediction
  • Gradient — a number showing the direction and size of a weight adjustment
  • Chain Rule — the calculus rule that links gradients across layers
  • Vanishing Gradient — gradients shrinking to near-zero in early layers
  • Exploding Gradient — gradients growing too large and making training unstable
  • Gradient Clipping — capping gradient size to prevent explosion

Leave a Comment

Your email address will not be published. Required fields are marked *