Deep Learning Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are the architecture behind image recognition, video analysis, and medical imaging. They are designed to process grid-like data — especially images — far more efficiently than a standard neural network. Every time your phone identifies a face or an app reads a handwritten digit, a CNN is likely doing the work.
The Problem with Using Standard Networks for Images
A 224×224 color image contains 224 × 224 × 3 = 150,528 numbers. If you connect all of these directly to 1,000 hidden neurons, you need 150,528 × 1,000 = 150 million weights in just the first layer. That is slow, expensive, and prone to overfitting.
CNNs solve this by sharing weights across the image using a technique called convolution.
The Core Idea: A Sliding Window
Instead of connecting every pixel to every neuron, a CNN slides a small filter (also called a kernel) across the image. The filter looks at a tiny patch of pixels at a time and detects one specific pattern — like an edge, a curve, or a color gradient.
Convolution Diagram
Image (6×6 pixels): Filter (3×3): ┌───────────────────┐ ┌──────────┐ │ 1 1 1 0 0 0 │ │ 1 0 -1 │ │ 1 1 1 0 0 0 │ × │ 1 0 -1 │ │ 1 1 1 0 0 0 │ │ 1 0 -1 │ │ 0 0 0 1 1 1 │ └──────────┘ │ 0 0 0 1 1 1 │ (detects vertical edges) │ 0 0 0 1 1 1 │ └───────────────────┘ The filter slides across every 3×3 patch of the image. At each position: multiply, sum → one output number. Result: a Feature Map showing where vertical edges appear.
What Filters Learn to Detect
Early layers (shallow): Middle layers: Deep layers: Horizontal edges Curves Ears Vertical edges Textures Eyes Diagonal edges Simple shapes Faces Color gradients Patterns Objects
Each filter specializes in detecting one feature. A network trained on faces might have thousands of filters, each finding a different facial element at different scales.
Key CNN Components
1. Convolutional Layer
The core layer that applies multiple filters to the input image. Each filter produces a separate feature map. 32 filters produce 32 feature maps — each highlighting a different pattern in the image.
2. Pooling Layer
Pooling reduces the size of each feature map while keeping the most important information. The most common type is Max Pooling.
Max Pooling (2×2): Input feature map: After 2×2 max pooling: ┌────────────┐ ┌──────┐ │ 1 3 │ │ 3 │ │ 2 4 │ → │ │ │ │ │ │ │ 5 6 │ │ 9 │ │ 7 9 │ └──────┘ └────────────┘ Each 2×2 block is replaced by its largest value. The map shrinks from 4×4 to 2×2.
Pooling makes the network less sensitive to exactly where a pattern appears in the image — a cat is still a cat whether it's on the left or right side of the photo.
3. Flattening
After several convolutional and pooling layers, the 2D feature maps are flattened into a 1D list of numbers. This vector feeds into standard dense layers for classification.
Feature Map (4×4×32): ↓ Flatten ↓ Vector of 512 numbers → Dense Layer → Output
4. Fully Connected Layer
The final dense layers combine all the detected features to make a classification decision. This is identical to the dense layers from a standard neural network.
Full CNN Architecture for Image Classification
Input Image (224×224×3)
↓
Conv Layer (32 filters, 3×3) → 32 Feature Maps
↓
Max Pooling (2×2) → Feature Maps shrink by half
↓
Conv Layer (64 filters, 3×3) → 64 Feature Maps (more complex patterns)
↓
Max Pooling (2×2) → Shrink again
↓
Flatten → Long 1D vector
↓
Dense Layer (128 neurons, ReLU)
↓
Output Layer (Softmax) → Probabilities for each class
↓
"Dog: 91%, Cat: 7%, Bird: 2%"
Famous CNN Architectures
| Architecture | Year | Known For |
|---|---|---|
| LeNet-5 | 1998 | First practical CNN, handwritten digit recognition |
| AlexNet | 2012 | Won ImageNet, proved deep CNNs work |
| VGGNet | 2014 | Simple and deep — 16 to 19 layers |
| ResNet | 2015 | Skip connections — allows 100+ layer training |
| EfficientNet | 2019 | High accuracy with fewer parameters |
Real-World CNN Applications
- Face Unlock — iPhone and Android unlock using CNN-based face recognition
- Medical Imaging — CNNs detect tumors in X-rays, CT scans, and MRIs
- Self-Driving Cars — cameras feed CNN models that detect road signs, pedestrians, and lanes
- Quality Control — factory cameras use CNNs to spot defective products on assembly lines
- Satellite Analysis — CNNs identify roads, buildings, and crops from satellite images
Key Terms
- Convolutional Layer — slides filters across input to produce feature maps
- Filter / Kernel — a small matrix that detects one specific pattern
- Feature Map — the output of applying a filter to an image
- Max Pooling — reduces feature map size by keeping only the maximum value in each region
- Flattening — converts 2D feature maps into a 1D vector for dense layers
- Stride — how many pixels the filter moves each step (controls output size)
