Machine Learning Neural Networks Basics
A Neural Network is a Machine Learning model loosely inspired by the structure of the human brain. It consists of layers of interconnected units called neurons. Each neuron receives inputs, processes them, and passes an output to the next layer. Neural Networks learn complex patterns that traditional algorithms cannot capture, making them the foundation of modern Artificial Intelligence.
The Biological Inspiration
Human Brain Neuron: Dendrites receive signals from other neurons Cell body processes those signals Axon sends the result to the next neuron Artificial Neuron (Perceptron): Inputs receive feature values Weights determine how important each input is Activation function processes the weighted sum Output passes to the next layer Biological Neuron Artificial Neuron ────────────────── ───────────────── Dendrites → Input values (X1, X2, X3) Synapse strength → Weights (W1, W2, W3) Cell body → Weighted sum + bias Firing threshold → Activation function Axon output → Output (0 or 1, or any value)
A Single Neuron (Perceptron)
Inputs and Weights:
X1=0.5, W1=0.8 → 0.5×0.8 = 0.40
X2=0.3, W2=0.6 → 0.3×0.6 = 0.18
X3=0.9, W3=0.4 → 0.9×0.4 = 0.36
Bias (b) = 0.1
Weighted Sum (z):
z = 0.40 + 0.18 + 0.36 + 0.10 = 1.04
Activation Function:
Apply ReLU: Output = max(0, 1.04) = 1.04
The neuron outputs 1.04 to the next layer.
Diagram of One Neuron:
X1 ──(W1)──┐
│
X2 ──(W2)──┤ → Σ(Xi×Wi) + b → Activation → Output
│
X3 ──(W3)──┘
Neural Network Architecture: Layers
A Neural Network stacks neurons into layers:
Input Layer Hidden Layer(s) Output Layer
│ │ │
[X1] [N1] [N2] [Y]
[X2] [N3] [N4]
[X3] [N5] [N6]
Each circle = one neuron
Each arrow = a weighted connection
Layer Types:
Input Layer: One neuron per input feature. No computation.
Hidden Layers: Where learning happens. Can have many layers.
Output Layer: Produces the final prediction.
Example — Predicting loan default:
Input Layer: 5 neurons (Age, Income, Debt, CreditScore, LoanAmt)
Hidden Layer 1: 8 neurons
Hidden Layer 2: 4 neurons
Output Layer: 1 neuron (probability of default: 0.0 to 1.0)
Activation Functions
Activation functions decide whether a neuron should "fire" and how strongly. Without activation functions, a neural network would just be a series of linear transformations — no more powerful than a single linear equation. Activation functions introduce non-linearity, which allows the network to learn complex patterns.
Sigmoid
Output range: 0 to 1
Formula: σ(z) = 1 / (1 + e^(-z))
Best for: Output layer in binary classification.
Problem: Vanishing gradient — gradients shrink to near zero
for large inputs, making deep networks slow to learn.
Shape:
1.0│ ───────────────
0.5│ ╱
0.0│──────────╱
────────────────────────────►
ReLU (Rectified Linear Unit)
Output range: 0 to ∞
Formula: f(z) = max(0, z)
Best for: Hidden layers in most modern networks.
Advantage: Simple, fast, and prevents vanishing gradient.
Problem: "Dying ReLU" — neurons can get stuck outputting 0.
Shape:
│ ╱
0.0│───────╱
───────────────────────────►
0
Softmax
Output range: 0 to 1 for each class, all sum to 1.0
Best for: Output layer in multi-class classification.
Example (3-class: Cat, Dog, Rabbit):
Raw outputs from last hidden layer: [2.1, 0.8, 0.3]
Softmax converts to probabilities:
Cat = e^2.1 / (e^2.1 + e^0.8 + e^0.3)
= 8.16 / (8.16 + 2.23 + 1.35) = 0.70 (70%)
Dog = 2.23 / 11.74 = 0.19 (19%)
Rabbit= 1.35 / 11.74 = 0.11 (11%)
All probabilities sum to 1.0 ✓
Prediction: Cat (highest probability)
Other Common Activations
Tanh: Output range: -1 to 1 Similar shape to Sigmoid but centered at 0 Better than Sigmoid for hidden layers (stronger gradients) Leaky ReLU: Like ReLU but allows small negative outputs f(z) = max(0.01×z, z) Fixes the "dying ReLU" problem
How a Neural Network Learns: Forward and Backward Pass
Forward Pass: Data flows LEFT to RIGHT through the network. Each layer computes outputs from previous layer. Final layer produces a prediction. Input → Hidden Layer 1 → Hidden Layer 2 → Output (Prediction) Backward Pass (Backpropagation): The prediction is compared to the actual label → Loss calculated. Error signal flows RIGHT to LEFT through the network. Each weight is adjusted to reduce the loss. Output Error → Hidden Layer 2 (adjust weights) → Hidden Layer 1 (adjust weights) → Input Combined: One forward pass + one backward pass = one training iteration. This repeats for all batches of data in each epoch.
Loss Functions
Loss functions measure prediction error. ┌─────────────────────────────┬──────────────────────────────────────┐ │ Problem Type │ Loss Function │ ├─────────────────────────────┼──────────────────────────────────────┤ │ Binary Classification │ Binary Cross-Entropy (Log Loss) │ │ Multi-Class Classification │ Categorical Cross-Entropy │ │ Regression │ Mean Squared Error (MSE) │ └─────────────────────────────┴──────────────────────────────────────┘ The optimizer uses the gradient of the loss to update weights.
Training Terminology
┌──────────────────────┬────────────────────────────────────────────┐ │ Term │ Meaning │ ├──────────────────────┼────────────────────────────────────────────┤ │ Epoch │ One complete pass through all training data│ │ Batch Size │ Number of records processed before one │ │ │ weight update. Common: 32, 64, 128 │ │ Iteration │ One forward+backward pass on one batch │ │ Learning Rate │ Size of each weight update step │ │ Optimizer │ Algorithm for updating weights │ │ │ (SGD, Adam, RMSprop) │ └──────────────────────┴────────────────────────────────────────────┘ Example: Dataset: 1000 records Batch size: 100 1 epoch = 10 iterations (1000/100 = 10 batches) Training for 50 epochs = 500 iterations total
The Adam Optimizer
Adam (Adaptive Moment Estimation) is the most commonly used optimizer. Standard Gradient Descent: Same learning rate for all weights. Adam: Adjusts the learning rate individually for each weight. Benefits: ✓ Faster convergence than standard gradient descent ✓ Works well with sparse data (common in NLP) ✓ Good default for most problems ✓ Requires little tuning Default settings: learning_rate = 0.001 beta1 = 0.9 (momentum for first moment) beta2 = 0.999 (momentum for second moment)
A Complete Neural Network Example
Problem: Classify handwritten digits (0–9) Dataset: Images of 28×28 pixels = 784 input features Architecture: Input Layer: 784 neurons (one per pixel) Hidden Layer 1: 256 neurons, ReLU activation Hidden Layer 2: 128 neurons, ReLU activation Output Layer: 10 neurons, Softmax (one per digit 0–9) Training: Loss function: Categorical Cross-Entropy Optimizer: Adam, learning_rate=0.001 Batch size: 64 Epochs: 20 Results: After Epoch 1: Training Accuracy = 91.2% After Epoch 5: Training Accuracy = 97.4% After Epoch 10: Training Accuracy = 98.8% After Epoch 20: Training Accuracy = 99.1% Test Accuracy: 98.6%
Common Hyperparameters in Neural Networks
┌──────────────────────┬────────────────────────────────────────────┐ │ Hyperparameter │ Typical Values to Try │ ├──────────────────────┼────────────────────────────────────────────┤ │ Number of layers │ 1–5 for simple problems │ │ Neurons per layer │ 32, 64, 128, 256, 512 │ │ Activation (hidden) │ ReLU (default), Leaky ReLU, Tanh │ │ Activation (output) │ Sigmoid (binary), Softmax (multi-class), │ │ │ Linear (regression) │ │ Learning rate │ 0.1, 0.01, 0.001, 0.0001 │ │ Batch size │ 16, 32, 64, 128 │ │ Epochs │ 10–1000 (with early stopping) │ │ Dropout rate │ 0.2–0.5 (to prevent overfitting) │ └──────────────────────┴────────────────────────────────────────────┘
