ML Convolutional Neural Networks

A Convolutional Neural Network (CNN) is a deep learning architecture designed specifically for processing data that has a spatial or grid-like structure — most commonly images. CNNs automatically learn to detect visual patterns such as edges, shapes, textures, and objects from raw pixel values, without any manual feature extraction.

Why Standard Neural Networks Fail on Images

Problem with a regular Fully Connected Network on images:

Image size: 224×224×3 (RGB) = 150,528 pixel values
First hidden layer: 512 neurons

Connections needed = 150,528 × 512 = 77,070,336 (77 million parameters)
→ Massive computation
→ Easily overfits
→ Destroys spatial relationships between nearby pixels

Key issue: A regular network treats each pixel independently.
It has no idea that nearby pixels are spatially related.

CNN solution:
  Reuse the same small filter across the entire image.
  Capture local patterns (edges, curves) without massive connections.
  Preserve the 2D spatial structure of the image.

The Convolution Operation

A filter (kernel) is a small grid of weights — typically 3×3 or 5×5.
It slides across the entire image and produces a feature map.

Example — A 3×3 edge-detecting filter on a 5×5 image:

Input Image (5×5):
  1  1  1  0  0
  1  1  1  0  0
  1  1  1  0  0
  0  0  0  0  0
  0  0  0  0  0

Edge-detecting filter (3×3):
  -1 -1 -1
   0  0  0
   1  1  1

Operation: Place filter at top-left corner.
Multiply each filter value with the corresponding image pixel.
Sum all 9 products → one value in the output feature map.
Slide the filter one step → compute next value.
Repeat until the filter has covered the entire image.

After convolution → Feature Map shows WHERE edges were detected.
High values = strong edge present at that location.
Low values = no edge at that location.

CNN Architecture: Layer by Layer

Full CNN Architecture:

Input Image
    │
    ▼
Convolutional Layer (apply filters → feature maps)
    │
    ▼
Activation Function (ReLU applied to each feature map)
    │
    ▼
Pooling Layer (downsample each feature map)
    │
    ▼
(Repeat Convolution → ReLU → Pooling as needed)
    │
    ▼
Flatten Layer (convert 2D feature maps to 1D vector)
    │
    ▼
Fully Connected Layer(s) (standard neural network layers)
    │
    ▼
Output Layer (Softmax → class probabilities)

Pooling Layer

Pooling reduces the spatial size of feature maps.
This decreases computation and makes the model less sensitive
to small shifts in the image (translation invariance).

Max Pooling (most common) — 2×2 with stride 2:

Feature Map (4×4):
  4  7  2  1
  3  8  5  0
  1  2  6  3
  0  4  1  9

After 2×2 Max Pooling:
  Takes the maximum value in each 2×2 region:

  Region top-left:    4,7,3,8  → max = 8
  Region top-right:   2,1,5,0  → max = 5
  Region bottom-left: 1,2,0,4  → max = 4
  Region bottom-right:6,3,1,9  → max = 9

Result (2×2):
  8  5
  4  9

Map shrinks from 4×4 to 2×2 — half the spatial size.
Computation is reduced by 75%.

Feature Map Depth and Multiple Filters

A single convolutional layer applies MANY filters, not just one.
Each filter detects a different pattern.

Example: 32 filters applied to a 224×224 image:
  Output: 32 feature maps, each 222×222 (assuming no padding)
  The 32 maps are "stacked" into a 3D volume: 222×222×32

As we go deeper:
  Early layers:  Detect simple patterns (edges, gradients)
  Middle layers: Detect complex shapes (curves, circles, textures)
  Deep layers:   Detect high-level objects (eyes, wheels, text)

Depth grows as we stack more convolutional layers:
  Conv Layer 1: 32 filters  → 32 feature maps
  Conv Layer 2: 64 filters  → 64 feature maps
  Conv Layer 3: 128 filters → 128 feature maps

More filters = More patterns detected = Richer representation

Padding and Stride

Padding:
  Adding zeros around the image border before convolution.
  Preserves the original spatial dimensions in the output.

  Without padding: 5×5 image + 3×3 filter → 3×3 output (shrinks)
  With "same" padding: 5×5 image + 3×3 filter → 5×5 output (preserved)

Stride:
  How many pixels the filter moves at each step.
  Stride=1: Filter moves one pixel at a time (detailed output)
  Stride=2: Filter moves two pixels at a time (smaller output, faster)

  224×224 image + 3×3 filter + stride=2 → 111×111 output

Complete CNN Example: Image Classification

Task: Classify images of cats, dogs, and rabbits (3 classes)
Input: 64×64 RGB images (64×64×3)

Architecture:
  Conv Layer 1: 32 filters, 3×3, stride=1, padding=same, ReLU
    → Output: 64×64×32
  Max Pooling: 2×2
    → Output: 32×32×32

  Conv Layer 2: 64 filters, 3×3, stride=1, padding=same, ReLU
    → Output: 32×32×64
  Max Pooling: 2×2
    → Output: 16×16×64

  Conv Layer 3: 128 filters, 3×3, stride=1, padding=same, ReLU
    → Output: 16×16×128
  Max Pooling: 2×2
    → Output: 8×8×128

  Flatten: 8×8×128 = 8192 values in a 1D vector

  Fully Connected Layer 1: 256 neurons, ReLU, Dropout(0.5)
  Fully Connected Layer 2: 64 neurons, ReLU

  Output Layer: 3 neurons, Softmax
    → Cat: 0.82, Dog: 0.13, Rabbit: 0.05
    → Prediction: Cat ✓

Famous CNN Architectures

┌──────────────────┬──────┬──────────────────────────────────────────┐
│ Architecture     │ Year │ Key Innovation                           │
├──────────────────┼──────┼──────────────────────────────────────────┤
│ LeNet-5          │ 1998 │ First successful CNN for digit reading   │
│ AlexNet          │ 2012 │ First deep CNN to win ImageNet           │
│                  │      │ Used ReLU and Dropout                    │
│ VGG16/19         │ 2014 │ Very deep (16–19 layers), simple design  │
│                  │      │ All 3×3 convolutions                     │
│ ResNet           │ 2015 │ Residual connections (skip connections)  │
│                  │      │ Solved vanishing gradient for 100+ layers│
│ Inception/       │ 2014 │ Parallel convolutions at different sizes │
│ GoogLeNet        │      │ Very efficient (fewer parameters)        │
│ EfficientNet     │ 2019 │ Scales width, depth, resolution together │
│                  │      │ State-of-the-art accuracy/efficiency     │
└──────────────────┴──────┴──────────────────────────────────────────┘

Residual Connections (ResNet)

Problem: Very deep networks (50+ layers) suffer from vanishing gradients.
Even with ReLU, gradients can fade before reaching early layers.

ResNet Solution: Add "skip connections" that bypass 2–3 layers.

Standard Block:
  Input → Conv → ReLU → Conv → ReLU → Output

Residual Block:
  Input → Conv → ReLU → Conv → (+) → ReLU → Output
    │                           ↑
    └───────────────────────────┘ (skip connection)

The (+) adds the original input directly to the layer output.
Even if the convolution learns nothing, the input passes through.
Gradients flow directly back through the skip connection.

Result: Networks with 152 layers trained successfully.
        Performance improved with depth instead of degrading.

Applications of CNNs

┌────────────────────────────┬───────────────────────────────────────┐
│ Application                │ Example                               │
├────────────────────────────┼───────────────────────────────────────┤
│ Image Classification       │ Cat vs Dog, medical image diagnosis   │
│ Object Detection           │ Finding and locating objects in photos│
│ Face Recognition           │ Unlock phone, tag people in photos    │
│ Medical Imaging            │ Detecting tumors in X-rays and MRI    │
│ Document OCR               │ Reading text from scanned documents   │
│ Self-Driving Cars          │ Detecting lanes, signs, pedestrians   │
│ Satellite Image Analysis   │ Mapping farmland, tracking deforestat.│
└────────────────────────────┴───────────────────────────────────────┘

Previous lessons

Back to courses

Next lessons