Computer Vision Image Descriptors

Feature detection finds interesting locations in an image. Descriptors describe what those locations look like — turning the visual appearance of a patch into a compact string of numbers. Two matching patches in different images produce similar descriptors, allowing the computer to recognize the same feature across photos.

Detection vs. Description

Detection answers: "Where is an interesting point?" Description answers: "What does that point look like, in numbers?" Both steps together allow image matching.

Detection + Description Pipeline

Photo A                         Photo B
  (same building, different angle)

Step 1 – DETECT:
  Found corner at (120, 85)       Found corner at (210, 60)

Step 2 – DESCRIBE:
  Descriptor = [0.3, 0.8, 0.1, 0.6, ...]   [0.3, 0.8, 0.1, 0.6, ...]
                       ↑                              ↑
               Very similar numbers → These are the SAME corner!

Step 3 – MATCH:
  Connect matching descriptors across both photos.

HOG: Histogram of Oriented Gradients

HOG is one of the most widely used image descriptors. It captures the direction and strength of edges in small image patches, then summarizes them as a histogram of edge directions. The collection of these histograms across the image forms the descriptor.

HOG Step by Step

Step 1: Compute gradient (edge direction + strength) at every pixel.

  Gradient direction can be: 0°, 45°, 90°, 135°...
  Gradient magnitude: how strong the edge is.

Step 2: Divide image into cells (e.g., 8×8 pixels each).

  +-------+-------+-------+
  | cell  | cell  | cell  |
  | (8×8) | (8×8) | (8×8) |
  +-------+-------+-------+
  | cell  | cell  | cell  |
  +-------+-------+-------+

Step 3: In each cell, build a histogram of gradient directions.
  Bin 1 (0°–20°):   count = 5   ██████
  Bin 2 (20°–40°):  count = 12  █████████████
  Bin 3 (40°–60°):  count = 3   ████
  ...
  Bin 9 (160°–180°): count = 7  ████████

Step 4: Concatenate all cell histograms into one long vector.
  → That vector IS the HOG descriptor.

Why HOG Works for Pedestrian Detection

Human body silhouette produces consistent edge patterns:
  Head: circular gradient directions
  Shoulders: near-horizontal edges
  Torso: mainly vertical edges
  Legs: two vertical columns

A HOG descriptor captures this edge pattern precisely.
A non-person (tree, car) produces a different pattern.
Train a classifier on HOG → detects pedestrians reliably.

SIFT Descriptor

SIFT (Scale-Invariant Feature Transform) creates descriptors that remain useful even when the image is rotated, scaled, or photographed under different lighting. It achieves this by normalizing the patch orientation and size before computing the descriptor.

SIFT Descriptor Construction

Around a detected keypoint:

Step 1: Find dominant orientation.
  Compute gradient directions in a 16×16 patch.
  The most common direction becomes the "main orientation."
  Rotate the patch so main orientation always points "up."
  → Now the descriptor is rotation-invariant.

Step 2: Divide the 16×16 patch into 4×4 sub-regions.
  16 sub-regions total.

Step 3: In each sub-region, compute 8-bin gradient histogram.
  16 sub-regions × 8 bins = 128 numbers.

Step 4: Normalize the 128-number vector.
  → Reduces effect of lighting changes.

Final descriptor: 128-dimensional vector.
Similar patches → similar 128-number vectors → matched!

SIFT: Scale and Rotation Invariance

Change in Image	SIFT Response	HOG Response
Rotated 90°	Same descriptor (rotation-normalized)	Different descriptor
Scaled 2×	Same descriptor (scale-normalized)	Different descriptor
Brighter lighting	Similar descriptor (normalized)	Different descriptor

ORB: Oriented FAST and Rotated BRIEF

ORB combines the FAST detector with a modified version of the BRIEF descriptor. BRIEF stores each feature as a binary string — a sequence of 0s and 1s — making it extremely compact and fast to compare. ORB adds rotation handling to make BRIEF robust. It is the go-to descriptor for real-time applications on phones and embedded systems because it requires no patent license (SIFT and SURF are patented).

Binary Descriptor Concept

BRIEF comparison test:
  Pick two random pixels (p, q) in the patch.
  If brightness(p) > brightness(q) → record 1
  If brightness(p) ≤ brightness(q) → record 0

Repeat 256 times with different (p, q) pairs:
  Result: 256-bit binary string
  Example: 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1...

Matching: Count different bits (Hamming distance).
  Similar patches → few different bits → MATCH
  Different patches → many different bits → NO MATCH

Speed: Comparing binary strings with XOR is very fast.
Size: 256 bits = 32 bytes (vs. SIFT's 512 bytes for 128 floats)

Comparing Descriptors for Matching

Once you have descriptors from two images, you match them by measuring how similar they are. For floating-point descriptors (SIFT, HOG), use Euclidean distance. For binary descriptors (ORB, BRIEF), use Hamming distance.

Nearest Neighbour Matching

Image A descriptors: [dA1, dA2, dA3, dA4, ...]
Image B descriptors: [dB1, dB2, dB3, dB4, ...]

For each descriptor in A:
  Find the closest descriptor in B by distance.
  If distance < threshold → declare a match.

Ratio test (Lowe's test):
  Best match distance / Second best distance < 0.75
  → Accept as match (far from second best = distinctive match)
  → Rejects ambiguous matches automatically.

Good matches → used to estimate transformation between images.

LBP: Local Binary Pattern

LBP describes texture by comparing each pixel to its 8 neighbors. Each neighbor either is brighter (1) or darker (0) than the center. Reading those 8 bits clockwise gives an 8-bit binary number (0–255) for each pixel. The histogram of these numbers across a region forms a texture descriptor.

LBP Calculation

Pixel neighborhood (center = 128):
  130  100  80
   90  128  200
  110  140  160

Compare each neighbor to center (128):
  130>128 → 1    100<128 → 0    80<128  → 0
   90<128 → 0    (center)      200>128 → 1
  110<128 → 0    140>128 → 1   160>128 → 1

Reading clockwise from top-left:
  1  0  0  1  1  1  0  0  =  binary 10011100 = 156

Center pixel gets LBP code = 156.
Repeat for every pixel. Build histogram of codes.
→ Histogram = texture descriptor for that region.

Use: Face recognition (captures skin texture patterns).

Descriptor Summary Table

Descriptor	Type	Size	Speed	Invariant to
HOG	Float vector	~1764 values	Medium	Small rotations
SIFT	Float vector	128 values	Slow	Scale, rotation, lighting
ORB	Binary string	256 bits	Very fast	Rotation
LBP	Histogram	256 bins	Fast	Monotonic lighting changes

Key Takeaways

A descriptor converts the appearance of an image patch into a compact number vector.
HOG captures edge directions in small cells — excellent for detecting human shapes.
SIFT produces 128-number descriptors that are robust to scale, rotation, and lighting changes.
ORB uses a binary descriptor — very fast comparison using Hamming distance — best for real-time apps.
LBP describes texture by comparing each pixel to its neighbors — widely used in face recognition.
Matching descriptors from two images enables image stitching, object recognition, and 3D reconstruction.

Previous lesson

Back to course

Next lesson