CV Convolutional Neural Networks

A Convolutional Neural Network (CNN) is the core deep learning architecture for computer vision tasks. It automatically learns to extract features from images — edges, textures, shapes, and objects — by training on large datasets. CNNs power modern image classification, detection, segmentation, and much more.

The Problem with Regular Neural Networks for Images

A standard (fully connected) neural network connects every neuron to every input pixel. For a 224×224 color image, that is 150,528 input values. The first layer alone would need millions of parameters — computationally impractical and prone to overfitting. CNNs solve this by sharing parameters across the image using convolution.

Fully Connected vs. CNN

FULLY CONNECTED (bad for images):
  224×224×3 = 150,528 pixels
  First layer has 1000 neurons
  Parameters = 150,528 × 1000 = 150 MILLION — just in layer 1!

CNN (efficient):
  Same image
  First layer: 32 filters, each 3×3×3 = 27 values
  Parameters = 32 × 27 = 864  (99.99% fewer!)

CNN achieves this by:
  1. Using small filters shared across all positions (parameter sharing)
  2. Connecting each neuron only to a small local region (local connectivity)

The Convolutional Layer

The core operation in a CNN is convolution. A small filter (kernel) slides across the image, computing a weighted sum at each position. Each filter learns to detect a specific pattern — one filter for horizontal edges, another for vertical edges, another for corners, and so on.

Convolutional Layer Diagram

INPUT (5×5 image):        FILTER (3×3):        OUTPUT (3×3 feature map):
+--+--+--+--+--+          +--+--+--+            
| 1| 2| 3| 0| 1|          | 1| 0|-1|            
+--+--+--+--+--+          +--+--+--+           
| 4| 5| 6| 1| 2|          | 2| 0|-2|           
+--+--+--+--+--+     →    +--+--+--+    →    [Feature Map]
| 7| 8| 9| 2| 3|          | 1| 0|-1|           
+--+--+--+--+--+          +--+--+--+           
| 0| 1| 2| 3| 4|
+--+--+--+--+--+

Position (0,0):
  (1×1 + 2×0 + 3×(-1)) +
  (4×2 + 5×0 + 6×(-2)) +
  (7×1 + 8×0 + 9×(-1))
  = (1+0−3) + (8+0−12) + (7+0−9) = −8 → feature map[0,0] = −8

Large negative value = strong vertical edge (right is brighter than left).
Repeat for all positions → builds the full feature map.

Multiple Filters = Multiple Feature Maps

32 different filters applied to the same input:
  Filter 1 → Feature map 1 (detects horizontal edges)
  Filter 2 → Feature map 2 (detects vertical edges)
  Filter 3 → Feature map 3 (detects diagonal edges)
  ...
  Filter 32 → Feature map 32

Stacked together: output has depth 32.
Input (H × W × 3) → Conv layer → Output (H × W × 32)

Activation Function: ReLU

After each convolution, the output passes through an activation function. The most common is ReLU (Rectified Linear Unit). ReLU replaces all negative values with zero and keeps positive values unchanged. This introduces non-linearity, allowing the network to learn complex patterns.

ReLU Function

Input → ReLU → Output:
  −8   →    0  (negative → zero)
  −3   →    0
   0   →    0
   2   →    2  (positive → unchanged)
   7   →    7
  15   →   15

Graph:
  Output │      ╱
         │     ╱
       0 │────╱
         │
         └──────── Input
               0

ReLU is fast to compute and avoids the vanishing gradient problem that plagued earlier activation functions like sigmoid.

Pooling Layer

Pooling reduces the spatial size of feature maps — making the network smaller and faster while keeping the most important information. Max pooling keeps the largest value in each window. Average pooling keeps the average. Both discard fine spatial details but retain the strongest activations.

Max Pooling (2×2, stride 2)

Input (4×4 feature map):      After 2×2 Max Pooling (2×2 output):
+----+----+----+----+          
|  1 |  3 |  2 |  4 |    →    Max(1,3,5,6) = 6  | Max(2,4,7,8) = 8
+----+----+----+----+          ────────────────────────────────────
|  5 |  6 |  7 |  8 |    →    Max(9,11,10,12)=12 | Max(13,15,14,16)=16
+----+----+----+----+
|  9 | 11 | 10 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |

Output (2×2):
+----+----+
|  6 |  8 |
+----+----+
| 12 | 16 |
+----+----+

Size reduced by 4×. The strongest responses (largest values) are kept.

A Complete CNN Architecture

A typical CNN stacks convolutional, activation, and pooling layers. Early layers detect simple patterns (edges, colors). Deeper layers combine these to detect complex structures (eyes, wheels, text). The final fully connected layers combine all features to make the class prediction.

Classic CNN Stack

INPUT IMAGE: 224 × 224 × 3

Layer 1: Conv (64 filters, 3×3) + ReLU → 224 × 224 × 64
Layer 2: Max Pool (2×2)                → 112 × 112 × 64
Layer 3: Conv (128 filters, 3×3) + ReLU → 112 × 112 × 128
Layer 4: Max Pool (2×2)                → 56  × 56  × 128
Layer 5: Conv (256 filters, 3×3) + ReLU → 56  × 56  × 256
Layer 6: Max Pool (2×2)                → 28  × 28  × 256
Layer 7: Conv (512 filters, 3×3) + ReLU → 28  × 28  × 512
Layer 8: Max Pool (2×2)                → 14  × 14  × 512
Layer 9: Flatten                       → 100,352 values
Layer 10: Fully Connected (4096)       → 4096 values
Layer 11: Fully Connected (1000)       → 1000 class scores
Layer 12: Softmax                      → 1000 probabilities

Predicted class: highest probability.

ResNet: Solving the Vanishing Gradient with Skip Connections

Very deep networks (50+ layers) suffer from vanishing gradients — error signals become too small to update early layers effectively. ResNet (Residual Network) introduces skip connections (also called residual connections) that let gradients flow directly past layers.

Residual Block Diagram

WITHOUT skip connection (standard):
  Input → [Conv + ReLU] → [Conv] → Output

  If gradient becomes tiny here ↑, the network stops learning.

WITH skip connection (residual):
  Input ──────────────────────────────┐
     ↓                                │
  [Conv + ReLU] → [Conv] → + (add) → Output
                             ↑
                   Skip connection adds input directly.

The network learns the RESIDUAL (difference) between output and input.
If nothing useful is learned → just pass through identity (skip).
→ Enables training of 50, 100, even 152+ layer networks.

Notable CNN Architectures

Architecture	Year	Depth	Key Innovation
LeNet-5	1998	5 layers	First practical CNN for handwriting
AlexNet	2012	8 layers	GPU training, ReLU, Dropout
VGGNet	2014	16–19 layers	Only 3×3 convolutions, very deep
GoogLeNet	2014	22 layers	Inception modules, multiple filter sizes
ResNet	2015	50–152 layers	Skip connections, surpassed human accuracy
EfficientNet	2019	Scaled	Balances depth, width, and resolution

Key Takeaways

CNNs use small shared filters — far fewer parameters than fully connected networks.
Convolutional layers detect patterns using learned filters slid across the image.
ReLU activation removes negative values, enabling the network to learn complex patterns.
Pooling layers reduce spatial size while retaining strong activations.
Early CNN layers detect edges and textures; deeper layers detect object parts and objects.
ResNet's skip connections allow training very deep networks without vanishing gradients.

Previous lessons

Back to courses

Next lessons