Computer Vision Object Detection

Object detection locates one or more objects in an image and identifies what each object is. It answers two questions at once: "What is it?" and "Where is it?" — represented as a class label and a bounding box.

Classification vs. Detection

Image classification says what is in the entire image. Object detection finds every object and draws a box around each one. An image might have one label ("cat") in classification, but detection might return "cat at position (x1, y1, x2, y2)" and "dog at position (x1, y1, x2, y2)" simultaneously.

Output Comparison

INPUT IMAGE: A living room photo with a cat, a dog, and a sofa.

CLASSIFICATION OUTPUT:
  "Living room"  (only one label for the whole image)

DETECTION OUTPUT:
  Cat    → box: (50, 80, 200, 250)
  Dog    → box: (300, 100, 500, 350)
  Sofa   → box: (20, 200, 600, 400)

  ┌──────────────────────────────┐
  │  ┌─────┐        ┌─────┐      │
  │  │ Cat │        │ Dog │      │
  │  └─────┘        └─────┘      │
  │  ┌──────────────────────┐    │
  │  │         Sofa         │    │
  │  └──────────────────────┘    │
  └──────────────────────────────┘

The Bounding Box

A bounding box is a rectangle that surrounds a detected object. It is defined by four numbers: the x and y coordinates of the top-left corner, and the width and height of the rectangle (or alternatively the x, y of top-left and bottom-right corners).

Bounding Box Coordinates

Image (600 × 400 pixels)

(0,0) ─────────────────────── (600,0)
  │                                 │
  │    (80, 50)                     │
  │       ┌──────────────┐          │
  │       │    Object    │          │
  │       │              │height=150│
  │       └──────────────┘          │
  │              (280, 200)         │
  │       width = 200               │
  │                                 │
(0,400)───────────────────── (600,400)

Box = [x=80, y=50, width=200, height=150]
  or  [x1=80, y1=50, x2=280, y2=200]

Sliding Window: The Classic Approach

Before deep learning, object detection used a sliding window. A fixed-size window slides across the image at every position and scale. At each position, a classifier decides: "Does this patch contain the object?" The window that yields the highest confidence score marks the object's location.

Sliding Window Process

Step 1: Set window size (e.g., 64×128 for a pedestrian detector).

Step 2: Slide window across image:
  ┌────────────────────────────────┐
  │ [W]→→→→→→→→→→→→→→→→→→→→→→→  │
  │  ↓                             │
  │ [W]→→→→→→→→→→→→→→→→→→→→→→→  │
  │  ↓                             │
  │  ... (thousands of positions)  │
  └────────────────────────────────┘

Step 3: At each position, extract HOG descriptor from [W].

Step 4: Pass HOG descriptor through SVM classifier.
  → "Person" confidence: 0.87  (above threshold 0.5 → detect!)
  → "Person" confidence: 0.12  (below threshold → skip)

Step 5: Scale image down (e.g., by 0.8×) and repeat.
  → Detects people of different sizes.

Problem: Very slow. Thousands of positions × multiple scales = millions of checks.

IoU: Intersection over Union

IoU measures how well a predicted bounding box matches the true (ground truth) bounding box. It equals the area of overlap divided by the total area covered by both boxes. An IoU of 1.0 is a perfect match. An IoU above 0.5 is generally considered a good detection.

IoU Calculation

Ground truth box:  ┌────────────────┐
                   │                │
                   │   True box     │
                   │       ┌────────┼────┐
                   │       │Overlap │    │
                   └───────┼────────┘    │
                           │Predicted box│
                           └─────────────┘

IoU = Area of Overlap / Area of Union

Example:
  Overlap area = 600 sq pixels
  Union area   = 1400 sq pixels
  IoU = 600 / 1400 = 0.43   → Moderate match

Non-Maximum Suppression (NMS)

Object detectors often produce multiple overlapping boxes for the same object. NMS keeps only the best box and removes the rest.

NMS Step by Step

Detected boxes around one car (with confidence scores):
  Box A: confidence 0.92  ← highest
  Box B: confidence 0.85  ← overlaps heavily with A
  Box C: confidence 0.78  ← overlaps heavily with A
  Box D: confidence 0.91  ← far from A (different car)

Step 1: Pick box with highest score → Box A (0.92). Keep it.

Step 2: Remove all boxes with IoU > 0.5 with Box A:
  → Box B (IoU=0.82 with A) → REMOVE
  → Box C (IoU=0.71 with A) → REMOVE
  → Box D (IoU=0.08 with A) → KEEP (different object)

Step 3: From remaining boxes, pick highest → Box D (0.91). Keep.

Result: Two final boxes, one per car.

YOLO: You Only Look Once

YOLO revolutionized object detection by treating it as a single regression problem. Instead of scanning the image multiple times, YOLO divides the image into a grid and predicts boxes and labels for all cells simultaneously — in one forward pass through a neural network.

YOLO Grid Prediction

Image divided into S×S grid (e.g., 7×7 = 49 cells):

+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+
|     |     |  ★  |     |     |     |     |  ← ★ = car center in this cell
+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     |  ◆  |     |     |  ← ◆ = person center
+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+

Each cell predicts:
  • B bounding boxes (with x, y, w, h, confidence)
  • C class probabilities (car=0.95, person=0.02, ...)

Total output: 7 × 7 × (B×5 + C) numbers — all in one pass!

Speed: 45+ frames per second on a GPU.
Use: Real-time video detection.

YOLO Version Improvements

Version	Key Improvement	Speed
YOLOv1 (2016)	First one-pass detection	45 fps
YOLOv3 (2018)	Multi-scale detection	30 fps, better small objects
YOLOv5 (2020)	PyTorch, easy to use	Very fast, popular
YOLOv8 (2023)	Unified architecture, tasks	State of the art speed+accuracy

Two-Stage Detectors: Faster R-CNN

Faster R-CNN uses two stages. The first stage (Region Proposal Network) suggests regions likely to contain objects. The second stage classifies each proposed region and refines its bounding box. This approach is slower than YOLO but typically more accurate, especially for small objects.

Two-Stage Pipeline

Stage 1 – Region Proposal Network (RPN):
  [Full image] → CNN → Feature Map
  Feature Map → RPN → ~2000 candidate boxes
  (Boxes where the network thinks objects might be)

Stage 2 – Detection Head:
  For each candidate box:
    Extract its features from the feature map.
    Classify: Car, Person, Dog, Background?
    Refine: Adjust box coordinates precisely.

Result: Accurate boxes with class labels.
Trade-off: Slower than YOLO (5–7 fps vs. 45+ fps).

Real-World Applications

Retail analytics – Count customers, detect empty shelves.
Traffic monitoring – Count and classify vehicles by type.
Sports broadcasting – Automatically track ball and players.
Wildlife conservation – Detect and count animals from aerial drone footage.
Industrial inspection – Spot defective parts on a production line.

Key Takeaways

Object detection outputs both a class label and a bounding box for each detected object.
IoU measures the accuracy of a predicted bounding box — higher is better.
Non-Maximum Suppression removes duplicate overlapping boxes, keeping only the best.
YOLO detects all objects in one pass — fast enough for real-time video.
Faster R-CNN uses two stages (propose then classify) — slower but more accurate.

Previous lessons

Back to courses

Next lessons