Machine Learning Neural Networks Basics

A Neural Network is a Machine Learning model loosely inspired by the structure of the human brain. It consists of layers of interconnected units called neurons. Each neuron receives inputs, processes them, and passes an output to the next layer. Neural Networks learn complex patterns that traditional algorithms cannot capture, making them the foundation of modern Artificial Intelligence.

The Biological Inspiration

Human Brain Neuron:
  Dendrites receive signals from other neurons
  Cell body processes those signals
  Axon sends the result to the next neuron

Artificial Neuron (Perceptron):
  Inputs receive feature values
  Weights determine how important each input is
  Activation function processes the weighted sum
  Output passes to the next layer

Biological Neuron          Artificial Neuron
──────────────────         ─────────────────
Dendrites         →        Input values (X1, X2, X3)
Synapse strength  →        Weights (W1, W2, W3)
Cell body         →        Weighted sum + bias
Firing threshold  →        Activation function
Axon output       →        Output (0 or 1, or any value)

A Single Neuron (Perceptron)

Inputs and Weights:
  X1=0.5, W1=0.8   →  0.5×0.8 = 0.40
  X2=0.3, W2=0.6   →  0.3×0.6 = 0.18
  X3=0.9, W3=0.4   →  0.9×0.4 = 0.36
  Bias (b) = 0.1

Weighted Sum (z):
  z = 0.40 + 0.18 + 0.36 + 0.10 = 1.04

Activation Function:
  Apply ReLU: Output = max(0, 1.04) = 1.04

The neuron outputs 1.04 to the next layer.

Diagram of One Neuron:

  X1 ──(W1)──┐
             │
  X2 ──(W2)──┤ → Σ(Xi×Wi) + b → Activation → Output
             │
  X3 ──(W3)──┘

Neural Network Architecture: Layers

A Neural Network stacks neurons into layers:

Input Layer    Hidden Layer(s)    Output Layer
    │                 │                │
[X1]            [N1] [N2]          [Y]
[X2]            [N3] [N4]
[X3]            [N5] [N6]

Each circle = one neuron
Each arrow = a weighted connection

Layer Types:
  Input Layer:   One neuron per input feature. No computation.
  Hidden Layers: Where learning happens. Can have many layers.
  Output Layer:  Produces the final prediction.

Example — Predicting loan default:
  Input Layer:   5 neurons (Age, Income, Debt, CreditScore, LoanAmt)
  Hidden Layer 1: 8 neurons
  Hidden Layer 2: 4 neurons
  Output Layer:  1 neuron (probability of default: 0.0 to 1.0)

Activation Functions

Activation functions decide whether a neuron should "fire" and how strongly. Without activation functions, a neural network would just be a series of linear transformations — no more powerful than a single linear equation. Activation functions introduce non-linearity, which allows the network to learn complex patterns.

Sigmoid

Output range: 0 to 1
Formula: σ(z) = 1 / (1 + e^(-z))

Best for: Output layer in binary classification.
Problem: Vanishing gradient — gradients shrink to near zero
         for large inputs, making deep networks slow to learn.

Shape:
  1.0│              ───────────────
  0.5│           ╱
  0.0│──────────╱
      ────────────────────────────►

ReLU (Rectified Linear Unit)

Output range: 0 to ∞
Formula: f(z) = max(0, z)

Best for: Hidden layers in most modern networks.
Advantage: Simple, fast, and prevents vanishing gradient.
Problem: "Dying ReLU" — neurons can get stuck outputting 0.

Shape:
     │        ╱
  0.0│───────╱
      ───────────────────────────►
         0

Softmax

Output range: 0 to 1 for each class, all sum to 1.0
Best for: Output layer in multi-class classification.

Example (3-class: Cat, Dog, Rabbit):
  Raw outputs from last hidden layer: [2.1, 0.8, 0.3]

  Softmax converts to probabilities:
    Cat   = e^2.1 / (e^2.1 + e^0.8 + e^0.3)
          = 8.16 / (8.16 + 2.23 + 1.35) = 0.70 (70%)
    Dog   = 2.23 / 11.74 = 0.19 (19%)
    Rabbit= 1.35 / 11.74 = 0.11 (11%)

  All probabilities sum to 1.0 ✓
  Prediction: Cat (highest probability)

Other Common Activations

Tanh:
  Output range: -1 to 1
  Similar shape to Sigmoid but centered at 0
  Better than Sigmoid for hidden layers (stronger gradients)

Leaky ReLU:
  Like ReLU but allows small negative outputs
  f(z) = max(0.01×z, z)
  Fixes the "dying ReLU" problem

How a Neural Network Learns: Forward and Backward Pass

Forward Pass:
  Data flows LEFT to RIGHT through the network.
  Each layer computes outputs from previous layer.
  Final layer produces a prediction.

  Input → Hidden Layer 1 → Hidden Layer 2 → Output (Prediction)

Backward Pass (Backpropagation):
  The prediction is compared to the actual label → Loss calculated.
  Error signal flows RIGHT to LEFT through the network.
  Each weight is adjusted to reduce the loss.

  Output Error → Hidden Layer 2 (adjust weights) →
  Hidden Layer 1 (adjust weights) → Input

Combined:
  One forward pass + one backward pass = one training iteration.
  This repeats for all batches of data in each epoch.

Loss Functions

Loss functions measure prediction error.

┌─────────────────────────────┬──────────────────────────────────────┐
│ Problem Type                │ Loss Function                        │
├─────────────────────────────┼──────────────────────────────────────┤
│ Binary Classification       │ Binary Cross-Entropy (Log Loss)      │
│ Multi-Class Classification  │ Categorical Cross-Entropy            │
│ Regression                  │ Mean Squared Error (MSE)             │
└─────────────────────────────┴──────────────────────────────────────┘

The optimizer uses the gradient of the loss to update weights.

Training Terminology

┌──────────────────────┬────────────────────────────────────────────┐
│ Term                 │ Meaning                                    │
├──────────────────────┼────────────────────────────────────────────┤
│ Epoch                │ One complete pass through all training data│
│ Batch Size           │ Number of records processed before one     │
│                      │ weight update. Common: 32, 64, 128         │
│ Iteration            │ One forward+backward pass on one batch     │
│ Learning Rate        │ Size of each weight update step            │
│ Optimizer            │ Algorithm for updating weights             │
│                      │ (SGD, Adam, RMSprop)                       │
└──────────────────────┴────────────────────────────────────────────┘

Example:
  Dataset: 1000 records
  Batch size: 100
  1 epoch = 10 iterations (1000/100 = 10 batches)
  Training for 50 epochs = 500 iterations total

The Adam Optimizer

Adam (Adaptive Moment Estimation) is the most commonly used optimizer.

Standard Gradient Descent: Same learning rate for all weights.
Adam: Adjusts the learning rate individually for each weight.

Benefits:
  ✓ Faster convergence than standard gradient descent
  ✓ Works well with sparse data (common in NLP)
  ✓ Good default for most problems
  ✓ Requires little tuning

Default settings:
  learning_rate = 0.001
  beta1 = 0.9  (momentum for first moment)
  beta2 = 0.999 (momentum for second moment)

A Complete Neural Network Example

Problem: Classify handwritten digits (0–9)
Dataset: Images of 28×28 pixels = 784 input features

Architecture:
  Input Layer:    784 neurons (one per pixel)
  Hidden Layer 1: 256 neurons, ReLU activation
  Hidden Layer 2: 128 neurons, ReLU activation
  Output Layer:   10 neurons, Softmax (one per digit 0–9)

Training:
  Loss function: Categorical Cross-Entropy
  Optimizer: Adam, learning_rate=0.001
  Batch size: 64
  Epochs: 20

Results:
  After Epoch 1:  Training Accuracy = 91.2%
  After Epoch 5:  Training Accuracy = 97.4%
  After Epoch 10: Training Accuracy = 98.8%
  After Epoch 20: Training Accuracy = 99.1%
  Test Accuracy:  98.6%

Common Hyperparameters in Neural Networks

┌──────────────────────┬────────────────────────────────────────────┐
│ Hyperparameter       │ Typical Values to Try                      │
├──────────────────────┼────────────────────────────────────────────┤
│ Number of layers     │ 1–5 for simple problems                    │
│ Neurons per layer    │ 32, 64, 128, 256, 512                      │
│ Activation (hidden)  │ ReLU (default), Leaky ReLU, Tanh           │
│ Activation (output)  │ Sigmoid (binary), Softmax (multi-class),   │
│                      │ Linear (regression)                        │
│ Learning rate        │ 0.1, 0.01, 0.001, 0.0001                  │
│ Batch size           │ 16, 32, 64, 128                            │
│ Epochs               │ 10–1000 (with early stopping)              │
│ Dropout rate         │ 0.2–0.5 (to prevent overfitting)           │
└──────────────────────┴────────────────────────────────────────────┘

Previous lesson

Back to course

Next lesson