Computer Vision Image Segmentation

Image segmentation assigns a label to every single pixel in an image. Instead of drawing a rectangle around an object, segmentation draws its exact outline — pixel by pixel. This gives a precise understanding of exactly where each object is and how much space it occupies.

Segmentation vs. Detection vs. Classification

Three Levels of Understanding

INPUT: Photo of two people on a park bench

CLASSIFICATION:
  Output: "People"   (one label, no location)

OBJECT DETECTION:
  Output:
    Person 1 → [box: x=50,  y=30, w=120, h=280]
    Person 2 → [box: x=200, y=40, w=110, h=270]
    Bench    → [box: x=30,  y=200, w=320, h=100]
  (rectangles, not exact shapes)

SEGMENTATION:
  Output:
    Each pixel labeled:
    ██████░░░░████████░░░░░░░░░   ← Row 50
    ██████░░░░████████░░░░░░░░░   ← Row 51
    [Person1] [gap] [Person2] [background] [bench]

  Exact pixel-level outline of every object.

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel but does not distinguish between different instances of the same class. All cars get the same "car" label, all people get the same "person" label. It answers: "What class does this pixel belong to?"

Semantic Segmentation Color Map

Street scene:

Original Photo:  [cars, road, sidewalk, buildings, sky, trees, people]

Semantic Labels (each color = one class):
  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  Sky      (light blue)
  ██████████████████████████████    Buildings (gray)
  ▓▓▓▓▓▓▓  ▓▓▓▓▓▓▓  ▓▓▓▓▓▓▓▓▓▓▓   Trees    (green)
  ○○○○○    ○○○○○                    People   (red) — all same color
     ▲▲▲▲▲▲▲   ▲▲▲▲▲▲              Cars     (blue) — all same color
  ════════════════════════════════  Road     (dark gray)

ALL people = red. ALL cars = blue.
Cannot tell Person 1 from Person 2.

FCN: Fully Convolutional Network

The Fully Convolutional Network (FCN) was the first major deep learning approach to semantic segmentation. It replaces the dense (fully connected) layers at the end of a classification CNN with convolutional layers, allowing the network to output a spatial map of class labels instead of a single label.

Standard CNN:
  [Image] → [Conv layers] → [Flatten] → [Dense layers] → [One label]

FCN:
  [Image] → [Conv layers] → [Upsampling layers] → [Label map same size as input]
                                       ↑
                 "Upsampling" grows the small feature map back to image size.

U-Net: Skip Connections for Precise Edges

U-Net is the dominant architecture for medical image segmentation. It has a U-shaped design: the left side (encoder) shrinks the image to understand context, and the right side (decoder) expands it back to full resolution. Skip connections pass high-resolution details from the encoder directly to the decoder, producing sharp, accurate segment boundaries.

U-Net Architecture Diagram

ENCODER (left side — shrinks)     DECODER (right side — expands)
[224×224×3 input]                 [224×224×classes output]
       ↓                                   ↑
[Conv → 112×112×64]  ─────────→  [Conv → 112×112×64]
       ↓                                   ↑
[Conv → 56×56×128]   ─────────→  [Conv → 56×56×128]
       ↓                                   ↑
[Conv → 28×28×256]   ─────────→  [Conv → 28×28×256]
       ↓                                   ↑
[Conv → 14×14×512] → [Bottleneck 7×7×1024] →

Arrows (→) = Skip connections carrying spatial detail
Bottleneck = deepest point, most abstract features

Skip connections let the decoder know exactly where edges are, even though the encoder compressed that information. This precision is why U-Net excels at segmenting thin structures like blood vessels or nerve fibers in medical scans.

Instance Segmentation

Instance segmentation goes further than semantic segmentation. It gives every individual object its own unique label — not just "car" but "Car 1," "Car 2," and "Car 3." Each instance gets its own pixel mask.

Instance Segmentation Example

Three overlapping apples on a plate:

Semantic:
  Apple Apple Apple Plate Background
  █████ █████ █████ ░░░░░ · · · · ·
  (All apples = same color)

Instance:
  Apple_1  Apple_2  Apple_3  Plate  Background
  ███████  ▓▓▓▓▓▓▓  ▒▒▒▒▒▒▒  ░░░░░  · · · · ·
  (Each apple = unique color/ID)

Instance segmentation can count objects and
tell them apart even when they touch.

Mask R-CNN

Mask R-CNN extends Faster R-CNN by adding a third output: a pixel mask for each detected object. It runs in three parallel streams — bounding box, class label, and pixel mask — for each proposed region.

For each detected object region:

  Stream 1: [Box refinement]   → precise bounding box
  Stream 2: [Classification]   → class label + confidence
  Stream 3: [Mask prediction]  → 28×28 binary mask
                                 (which pixels inside the box belong to the object?)

Final output:
  Object 1: Label=Car, Box=(100,80,300,220), Mask=[28×28 binary grid]
  Object 2: Label=Person, Box=(350,50,420,300), Mask=[28×28 binary grid]

Panoptic Segmentation

Panoptic segmentation unifies semantic and instance segmentation. Countable objects (people, cars, animals) get individual instance IDs. Uncountable regions (sky, road, grass) get semantic labels only — there is no meaningful way to count individual patches of sky.

Panoptic Output Map

Pixel Region	Semantic Label	Instance ID	Type
Sky pixels	Sky	— (no instance)	Stuff (uncountable)
Road pixels	Road	— (no instance)	Stuff (uncountable)
First person	Person	ID = 1	Thing (countable)
Second person	Person	ID = 2	Thing (countable)
Car	Car	ID = 3	Thing (countable)

Evaluation Metric: Mean IoU (mIoU)

The primary metric for segmentation quality is Mean Intersection over Union (mIoU). For each class, it computes the IoU between predicted pixels and true pixels, then averages across all classes.

For class "Car":
  Predicted car pixels:   800 pixels
  True car pixels:        900 pixels
  Overlap:                700 pixels
  Union:                  800 + 900 - 700 = 1000 pixels
  IoU(car) = 700 / 1000 = 0.70

Compute IoU for every class → Average them = mIoU.
State-of-the-art models achieve 80–85% mIoU on standard benchmarks.

Real-World Applications

Autonomous driving – Label every pixel of a video frame as road, pedestrian, vehicle, or obstacle.
Medical imaging – Precisely outline tumors, organs, or surgical targets in scans.
Satellite mapping – Color-code land cover (forest, farmland, urban, water) at the pixel level.
Virtual try-on – Separate a person's body from the background to overlay virtual clothing.
Robotics – Help robots understand and interact with the exact shape of objects in a scene.

Key Takeaways

Segmentation labels every pixel — more precise than detection's bounding boxes.
Semantic segmentation labels pixel classes but cannot tell two cars apart.
Instance segmentation labels each individual object separately.
Panoptic segmentation combines both: instances for countable objects, semantic labels for background.
U-Net uses skip connections to preserve spatial detail — the standard for medical segmentation.
Mask R-CNN extends object detection to produce per-object pixel masks.

Previous lesson

Back to course

Next lesson