Computer Vision Image Segmentation
Image segmentation assigns a label to every single pixel in an image. Instead of drawing a rectangle around an object, segmentation draws its exact outline — pixel by pixel. This gives a precise understanding of exactly where each object is and how much space it occupies.
Segmentation vs. Detection vs. Classification
Three Levels of Understanding
INPUT: Photo of two people on a park bench
CLASSIFICATION:
Output: "People" (one label, no location)
OBJECT DETECTION:
Output:
Person 1 → [box: x=50, y=30, w=120, h=280]
Person 2 → [box: x=200, y=40, w=110, h=270]
Bench → [box: x=30, y=200, w=320, h=100]
(rectangles, not exact shapes)
SEGMENTATION:
Output:
Each pixel labeled:
██████░░░░████████░░░░░░░░░ ← Row 50
██████░░░░████████░░░░░░░░░ ← Row 51
[Person1] [gap] [Person2] [background] [bench]
Exact pixel-level outline of every object.
Semantic Segmentation
Semantic segmentation assigns a class label to every pixel but does not distinguish between different instances of the same class. All cars get the same "car" label, all people get the same "person" label. It answers: "What class does this pixel belong to?"
Semantic Segmentation Color Map
Street scene:
Original Photo: [cars, road, sidewalk, buildings, sky, trees, people]
Semantic Labels (each color = one class):
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Sky (light blue)
██████████████████████████████ Buildings (gray)
▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓ Trees (green)
○○○○○ ○○○○○ People (red) — all same color
▲▲▲▲▲▲▲ ▲▲▲▲▲▲ Cars (blue) — all same color
════════════════════════════════ Road (dark gray)
ALL people = red. ALL cars = blue.
Cannot tell Person 1 from Person 2.
FCN: Fully Convolutional Network
The Fully Convolutional Network (FCN) was the first major deep learning approach to semantic segmentation. It replaces the dense (fully connected) layers at the end of a classification CNN with convolutional layers, allowing the network to output a spatial map of class labels instead of a single label.
Standard CNN:
[Image] → [Conv layers] → [Flatten] → [Dense layers] → [One label]
FCN:
[Image] → [Conv layers] → [Upsampling layers] → [Label map same size as input]
↑
"Upsampling" grows the small feature map back to image size.
U-Net: Skip Connections for Precise Edges
U-Net is the dominant architecture for medical image segmentation. It has a U-shaped design: the left side (encoder) shrinks the image to understand context, and the right side (decoder) expands it back to full resolution. Skip connections pass high-resolution details from the encoder directly to the decoder, producing sharp, accurate segment boundaries.
U-Net Architecture Diagram
ENCODER (left side — shrinks) DECODER (right side — expands)
[224×224×3 input] [224×224×classes output]
↓ ↑
[Conv → 112×112×64] ─────────→ [Conv → 112×112×64]
↓ ↑
[Conv → 56×56×128] ─────────→ [Conv → 56×56×128]
↓ ↑
[Conv → 28×28×256] ─────────→ [Conv → 28×28×256]
↓ ↑
[Conv → 14×14×512] → [Bottleneck 7×7×1024] →
Arrows (→) = Skip connections carrying spatial detail
Bottleneck = deepest point, most abstract features
Skip connections let the decoder know exactly where edges are, even though the encoder compressed that information. This precision is why U-Net excels at segmenting thin structures like blood vessels or nerve fibers in medical scans.
Instance Segmentation
Instance segmentation goes further than semantic segmentation. It gives every individual object its own unique label — not just "car" but "Car 1," "Car 2," and "Car 3." Each instance gets its own pixel mask.
Instance Segmentation Example
Three overlapping apples on a plate: Semantic: Apple Apple Apple Plate Background █████ █████ █████ ░░░░░ · · · · · (All apples = same color) Instance: Apple_1 Apple_2 Apple_3 Plate Background ███████ ▓▓▓▓▓▓▓ ▒▒▒▒▒▒▒ ░░░░░ · · · · · (Each apple = unique color/ID) Instance segmentation can count objects and tell them apart even when they touch.
Mask R-CNN
Mask R-CNN extends Faster R-CNN by adding a third output: a pixel mask for each detected object. It runs in three parallel streams — bounding box, class label, and pixel mask — for each proposed region.
For each detected object region:
Stream 1: [Box refinement] → precise bounding box
Stream 2: [Classification] → class label + confidence
Stream 3: [Mask prediction] → 28×28 binary mask
(which pixels inside the box belong to the object?)
Final output:
Object 1: Label=Car, Box=(100,80,300,220), Mask=[28×28 binary grid]
Object 2: Label=Person, Box=(350,50,420,300), Mask=[28×28 binary grid]
Panoptic Segmentation
Panoptic segmentation unifies semantic and instance segmentation. Countable objects (people, cars, animals) get individual instance IDs. Uncountable regions (sky, road, grass) get semantic labels only — there is no meaningful way to count individual patches of sky.
Panoptic Output Map
| Pixel Region | Semantic Label | Instance ID | Type |
|---|---|---|---|
| Sky pixels | Sky | — (no instance) | Stuff (uncountable) |
| Road pixels | Road | — (no instance) | Stuff (uncountable) |
| First person | Person | ID = 1 | Thing (countable) |
| Second person | Person | ID = 2 | Thing (countable) |
| Car | Car | ID = 3 | Thing (countable) |
Evaluation Metric: Mean IoU (mIoU)
The primary metric for segmentation quality is Mean Intersection over Union (mIoU). For each class, it computes the IoU between predicted pixels and true pixels, then averages across all classes.
For class "Car": Predicted car pixels: 800 pixels True car pixels: 900 pixels Overlap: 700 pixels Union: 800 + 900 - 700 = 1000 pixels IoU(car) = 700 / 1000 = 0.70 Compute IoU for every class → Average them = mIoU. State-of-the-art models achieve 80–85% mIoU on standard benchmarks.
Real-World Applications
- Autonomous driving – Label every pixel of a video frame as road, pedestrian, vehicle, or obstacle.
- Medical imaging – Precisely outline tumors, organs, or surgical targets in scans.
- Satellite mapping – Color-code land cover (forest, farmland, urban, water) at the pixel level.
- Virtual try-on – Separate a person's body from the background to overlay virtual clothing.
- Robotics – Help robots understand and interact with the exact shape of objects in a scene.
Key Takeaways
- Segmentation labels every pixel — more precise than detection's bounding boxes.
- Semantic segmentation labels pixel classes but cannot tell two cars apart.
- Instance segmentation labels each individual object separately.
- Panoptic segmentation combines both: instances for countable objects, semantic labels for background.
- U-Net uses skip connections to preserve spatial detail — the standard for medical segmentation.
- Mask R-CNN extends object detection to produce per-object pixel masks.
