Deep Learning Activation Functions

Activation functions are the decision-makers inside a neural network. Without them, every layer would just multiply numbers together — and the whole network would behave like a single straight line, no matter how many layers you stack. Activation functions introduce curves and complexity that allow the network to learn almost anything.

The Core Problem They Solve

Imagine stacking 10 identical translucent sheets of glass. You see through all 10 the same way you see through 1 — they just pile up into one combined layer. Neural network layers without activation functions behave the same way. Adding more layers adds nothing new.

An activation function breaks this pattern. It bends and reshapes the output at each layer, giving the network the ability to model complex, non-linear patterns.

Without vs With Activation

WITHOUT ACTIVATION:
Input → Layer 1 → Layer 2 → Layer 3 → Output
(All three layers collapse into one straight line — useless depth)

WITH ACTIVATION:
Input → Layer 1 → [Bend] → Layer 2 → [Bend] → Layer 3 → [Bend] → Output
(Each bend adds expressive power — the network learns complex shapes)

The Most Important Activation Functions

1. ReLU (Rectified Linear Unit)

ReLU is the most widely used activation function in hidden layers today. Its rule is simple: if the input is negative, output 0. If the input is positive, output the input as-is.

ReLU rule: output = max(0, input)

Input: -5  → Output: 0
Input: -1  → Output: 0
Input:  0  → Output: 0
Input:  3  → Output: 3
Input:  7  → Output: 7

Shape on a graph:
         /
        /
───────/
     0

ReLU is fast, works well, and solves many problems that older activation functions had. Most hidden layers in modern networks use it by default.

2. Sigmoid

Sigmoid squishes any number — no matter how large or small — into a value between 0 and 1. This makes it ideal for output layers in binary classification tasks, where you want a probability.

Sigmoid output range: (0, 1)

Input: -10  → Output ≈ 0.00
Input:  -2  → Output ≈ 0.12
Input:   0  → Output = 0.50
Input:   2  → Output ≈ 0.88
Input:  10  → Output ≈ 1.00

Shape on a graph:
          ___
         /
        /
   ____/

Think of Sigmoid as a volume knob that turns any number into a percentage. A score of 0.85 means the model is 85% confident in the positive class.

3. Softmax

Softmax extends Sigmoid to handle more than two categories. It takes a group of numbers and converts them into probabilities that all add up to 100%.

Example: Classify an animal as cat, dog, or bird

Raw scores from the network:
  cat  = 2.0
  dog  = 1.0
  bird = 0.5

After Softmax:
  cat  = 59%
  dog  = 26%
  bird = 15%
  ─────────
  Total = 100%

The model predicts: cat (highest probability)

Use Softmax in the output layer whenever you have 3 or more categories to classify.

4. Tanh (Hyperbolic Tangent)

Tanh is similar to Sigmoid but outputs values between -1 and 1 instead of 0 and 1. The negative range makes it better than Sigmoid for hidden layers in certain models.

Tanh output range: (-1, 1)

Input: -10  → Output ≈ -1.00
Input:  -1  → Output ≈ -0.76
Input:   0  → Output =  0.00
Input:   1  → Output ≈  0.76
Input:  10  → Output ≈  1.00

Shape on a graph:
     ___
    /
   /
__/

5. Leaky ReLU

Standard ReLU outputs zero for all negative inputs. This means neurons that always receive negative input stop contributing to learning — a problem called the "dying ReLU." Leaky ReLU fixes this by allowing a tiny slope for negative values.

Standard ReLU:  Input -5 → Output 0
Leaky ReLU:     Input -5 → Output -0.05  (0.01 × -5)

Shape on a graph:
         /
        /
  ─────/ ← tiny downward slope for negatives (not flat zero)

When to Use Which Activation Function

Function	Use In	Task
ReLU	Hidden layers	Almost everything — default choice
Leaky ReLU	Hidden layers	When dying ReLU is a problem
Sigmoid	Output layer	Binary classification (yes/no, spam/not spam)
Softmax	Output layer	Multi-class classification (cat/dog/bird)
Tanh	Hidden layers (older models)	When centered output (-1 to 1) helps

A Visual Summary

ReLU:       ___/   (flat at 0, then rises)
Sigmoid:   ─S─    (smooth S-curve, 0 to 1)
Softmax:   [a, b, c] → [%a, %b, %c] sums to 100%
Tanh:      ─S─    (smooth S-curve, -1 to 1)
Leaky ReLU: ╲/    (tiny slope left of 0, rises right)

Key Terms

Activation Function — a mathematical rule applied to a neuron's output
Non-linear — a curve or bend, as opposed to a straight line
ReLU — outputs zero for negatives, passes positives unchanged
Sigmoid — compresses output to 0–1 for binary probability
Softmax — converts a list of scores into probabilities summing to 100%
Dying ReLU — when a neuron always outputs zero and stops learning

Previous lessons

Back to courses

Next lessons