Deep Learning Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are the architecture behind image recognition, video analysis, and medical imaging. They are designed to process grid-like data — especially images — far more efficiently than a standard neural network. Every time your phone identifies a face or an app reads a handwritten digit, a CNN is likely doing the work.

The Problem with Using Standard Networks for Images

A 224×224 color image contains 224 × 224 × 3 = 150,528 numbers. If you connect all of these directly to 1,000 hidden neurons, you need 150,528 × 1,000 = 150 million weights in just the first layer. That is slow, expensive, and prone to overfitting.

CNNs solve this by sharing weights across the image using a technique called convolution.

The Core Idea: A Sliding Window

Instead of connecting every pixel to every neuron, a CNN slides a small filter (also called a kernel) across the image. The filter looks at a tiny patch of pixels at a time and detects one specific pattern — like an edge, a curve, or a color gradient.

Convolution Diagram

Image (6×6 pixels):                Filter (3×3):
┌───────────────────┐             ┌──────────┐
│ 1  1  1  0  0  0  │             │ 1  0  -1 │
│ 1  1  1  0  0  0  │     ×       │ 1  0  -1 │
│ 1  1  1  0  0  0  │             │ 1  0  -1 │
│ 0  0  0  1  1  1  │             └──────────┘
│ 0  0  0  1  1  1  │         (detects vertical edges)
│ 0  0  0  1  1  1  │
└───────────────────┘

The filter slides across every 3×3 patch of the image.
At each position: multiply, sum → one output number.
Result: a Feature Map showing where vertical edges appear.

What Filters Learn to Detect

Early layers (shallow):      Middle layers:         Deep layers:
  Horizontal edges             Curves                Ears
  Vertical edges               Textures              Eyes
  Diagonal edges               Simple shapes         Faces
  Color gradients              Patterns              Objects

Each filter specializes in detecting one feature. A network trained on faces might have thousands of filters, each finding a different facial element at different scales.

Key CNN Components

1. Convolutional Layer

The core layer that applies multiple filters to the input image. Each filter produces a separate feature map. 32 filters produce 32 feature maps — each highlighting a different pattern in the image.

2. Pooling Layer

Pooling reduces the size of each feature map while keeping the most important information. The most common type is Max Pooling.

Max Pooling (2×2):

Input feature map:        After 2×2 max pooling:
┌────────────┐            ┌──────┐
│  1   3     │            │  3   │
│  2   4     │   →        │      │
│            │            │      │
│  5   6     │            │  9   │
│  7   9     │            └──────┘
└────────────┘

Each 2×2 block is replaced by its largest value.
The map shrinks from 4×4 to 2×2.

Pooling makes the network less sensitive to exactly where a pattern appears in the image — a cat is still a cat whether it's on the left or right side of the photo.

3. Flattening

After several convolutional and pooling layers, the 2D feature maps are flattened into a 1D list of numbers. This vector feeds into standard dense layers for classification.

Feature Map (4×4×32):
  ↓
Flatten
  ↓
Vector of 512 numbers → Dense Layer → Output

4. Fully Connected Layer

The final dense layers combine all the detected features to make a classification decision. This is identical to the dense layers from a standard neural network.

Full CNN Architecture for Image Classification

Input Image (224×224×3)
      ↓
Conv Layer (32 filters, 3×3) → 32 Feature Maps
      ↓
Max Pooling (2×2) → Feature Maps shrink by half
      ↓
Conv Layer (64 filters, 3×3) → 64 Feature Maps (more complex patterns)
      ↓
Max Pooling (2×2) → Shrink again
      ↓
Flatten → Long 1D vector
      ↓
Dense Layer (128 neurons, ReLU)
      ↓
Output Layer (Softmax) → Probabilities for each class
      ↓
"Dog: 91%, Cat: 7%, Bird: 2%"

Famous CNN Architectures

Architecture	Year	Known For
LeNet-5	1998	First practical CNN, handwritten digit recognition
AlexNet	2012	Won ImageNet, proved deep CNNs work
VGGNet	2014	Simple and deep — 16 to 19 layers
ResNet	2015	Skip connections — allows 100+ layer training
EfficientNet	2019	High accuracy with fewer parameters

Real-World CNN Applications

Face Unlock — iPhone and Android unlock using CNN-based face recognition
Medical Imaging — CNNs detect tumors in X-rays, CT scans, and MRIs
Self-Driving Cars — cameras feed CNN models that detect road signs, pedestrians, and lanes
Quality Control — factory cameras use CNNs to spot defective products on assembly lines
Satellite Analysis — CNNs identify roads, buildings, and crops from satellite images

Key Terms

Convolutional Layer — slides filters across input to produce feature maps
Filter / Kernel — a small matrix that detects one specific pattern
Feature Map — the output of applying a filter to an image
Max Pooling — reduces feature map size by keeping only the maximum value in each region
Flattening — converts 2D feature maps into a 1D vector for dense layers
Stride — how many pixels the filter moves each step (controls output size)

Previous lesson

Back to course

Next lesson