Computer Vision Object Detection
Object detection locates one or more objects in an image and identifies what each object is. It answers two questions at once: "What is it?" and "Where is it?" — represented as a class label and a bounding box.
Classification vs. Detection
Image classification says what is in the entire image. Object detection finds every object and draws a box around each one. An image might have one label ("cat") in classification, but detection might return "cat at position (x1, y1, x2, y2)" and "dog at position (x1, y1, x2, y2)" simultaneously.
Output Comparison
INPUT IMAGE: A living room photo with a cat, a dog, and a sofa. CLASSIFICATION OUTPUT: "Living room" (only one label for the whole image) DETECTION OUTPUT: Cat → box: (50, 80, 200, 250) Dog → box: (300, 100, 500, 350) Sofa → box: (20, 200, 600, 400) ┌──────────────────────────────┐ │ ┌─────┐ ┌─────┐ │ │ │ Cat │ │ Dog │ │ │ └─────┘ └─────┘ │ │ ┌──────────────────────┐ │ │ │ Sofa │ │ │ └──────────────────────┘ │ └──────────────────────────────┘
The Bounding Box
A bounding box is a rectangle that surrounds a detected object. It is defined by four numbers: the x and y coordinates of the top-left corner, and the width and height of the rectangle (or alternatively the x, y of top-left and bottom-right corners).
Bounding Box Coordinates
Image (600 × 400 pixels) (0,0) ─────────────────────── (600,0) │ │ │ (80, 50) │ │ ┌──────────────┐ │ │ │ Object │ │ │ │ │height=150│ │ └──────────────┘ │ │ (280, 200) │ │ width = 200 │ │ │ (0,400)───────────────────── (600,400) Box = [x=80, y=50, width=200, height=150] or [x1=80, y1=50, x2=280, y2=200]
Sliding Window: The Classic Approach
Before deep learning, object detection used a sliding window. A fixed-size window slides across the image at every position and scale. At each position, a classifier decides: "Does this patch contain the object?" The window that yields the highest confidence score marks the object's location.
Sliding Window Process
Step 1: Set window size (e.g., 64×128 for a pedestrian detector). Step 2: Slide window across image: ┌────────────────────────────────┐ │ [W]→→→→→→→→→→→→→→→→→→→→→→→ │ │ ↓ │ │ [W]→→→→→→→→→→→→→→→→→→→→→→→ │ │ ↓ │ │ ... (thousands of positions) │ └────────────────────────────────┘ Step 3: At each position, extract HOG descriptor from [W]. Step 4: Pass HOG descriptor through SVM classifier. → "Person" confidence: 0.87 (above threshold 0.5 → detect!) → "Person" confidence: 0.12 (below threshold → skip) Step 5: Scale image down (e.g., by 0.8×) and repeat. → Detects people of different sizes. Problem: Very slow. Thousands of positions × multiple scales = millions of checks.
IoU: Intersection over Union
IoU measures how well a predicted bounding box matches the true (ground truth) bounding box. It equals the area of overlap divided by the total area covered by both boxes. An IoU of 1.0 is a perfect match. An IoU above 0.5 is generally considered a good detection.
IoU Calculation
Ground truth box: ┌────────────────┐
│ │
│ True box │
│ ┌────────┼────┐
│ │Overlap │ │
└───────┼────────┘ │
│Predicted box│
└─────────────┘
IoU = Area of Overlap / Area of Union
Example:
Overlap area = 600 sq pixels
Union area = 1400 sq pixels
IoU = 600 / 1400 = 0.43 → Moderate match
Non-Maximum Suppression (NMS)
Object detectors often produce multiple overlapping boxes for the same object. NMS keeps only the best box and removes the rest.
NMS Step by Step
Detected boxes around one car (with confidence scores): Box A: confidence 0.92 ← highest Box B: confidence 0.85 ← overlaps heavily with A Box C: confidence 0.78 ← overlaps heavily with A Box D: confidence 0.91 ← far from A (different car) Step 1: Pick box with highest score → Box A (0.92). Keep it. Step 2: Remove all boxes with IoU > 0.5 with Box A: → Box B (IoU=0.82 with A) → REMOVE → Box C (IoU=0.71 with A) → REMOVE → Box D (IoU=0.08 with A) → KEEP (different object) Step 3: From remaining boxes, pick highest → Box D (0.91). Keep. Result: Two final boxes, one per car.
YOLO: You Only Look Once
YOLO revolutionized object detection by treating it as a single regression problem. Instead of scanning the image multiple times, YOLO divides the image into a grid and predicts boxes and labels for all cells simultaneously — in one forward pass through a neural network.
YOLO Grid Prediction
Image divided into S×S grid (e.g., 7×7 = 49 cells): +-----+-----+-----+-----+-----+-----+-----+ | | | | | | | | +-----+-----+-----+-----+-----+-----+-----+ | | | ★ | | | | | ← ★ = car center in this cell +-----+-----+-----+-----+-----+-----+-----+ | | | | | ◆ | | | ← ◆ = person center +-----+-----+-----+-----+-----+-----+-----+ | | | | | | | | +-----+-----+-----+-----+-----+-----+-----+ Each cell predicts: • B bounding boxes (with x, y, w, h, confidence) • C class probabilities (car=0.95, person=0.02, ...) Total output: 7 × 7 × (B×5 + C) numbers — all in one pass! Speed: 45+ frames per second on a GPU. Use: Real-time video detection.
YOLO Version Improvements
| Version | Key Improvement | Speed |
|---|---|---|
| YOLOv1 (2016) | First one-pass detection | 45 fps |
| YOLOv3 (2018) | Multi-scale detection | 30 fps, better small objects |
| YOLOv5 (2020) | PyTorch, easy to use | Very fast, popular |
| YOLOv8 (2023) | Unified architecture, tasks | State of the art speed+accuracy |
Two-Stage Detectors: Faster R-CNN
Faster R-CNN uses two stages. The first stage (Region Proposal Network) suggests regions likely to contain objects. The second stage classifies each proposed region and refines its bounding box. This approach is slower than YOLO but typically more accurate, especially for small objects.
Two-Stage Pipeline
Stage 1 – Region Proposal Network (RPN):
[Full image] → CNN → Feature Map
Feature Map → RPN → ~2000 candidate boxes
(Boxes where the network thinks objects might be)
Stage 2 – Detection Head:
For each candidate box:
Extract its features from the feature map.
Classify: Car, Person, Dog, Background?
Refine: Adjust box coordinates precisely.
Result: Accurate boxes with class labels.
Trade-off: Slower than YOLO (5–7 fps vs. 45+ fps).
Real-World Applications
- Retail analytics – Count customers, detect empty shelves.
- Traffic monitoring – Count and classify vehicles by type.
- Sports broadcasting – Automatically track ball and players.
- Wildlife conservation – Detect and count animals from aerial drone footage.
- Industrial inspection – Spot defective parts on a production line.
Key Takeaways
- Object detection outputs both a class label and a bounding box for each detected object.
- IoU measures the accuracy of a predicted bounding box — higher is better.
- Non-Maximum Suppression removes duplicate overlapping boxes, keeping only the best.
- YOLO detects all objects in one pass — fast enough for real-time video.
- Faster R-CNN uses two stages (propose then classify) — slower but more accurate.
