Computer Vision Image Descriptors
Feature detection finds interesting locations in an image. Descriptors describe what those locations look like — turning the visual appearance of a patch into a compact string of numbers. Two matching patches in different images produce similar descriptors, allowing the computer to recognize the same feature across photos.
Detection vs. Description
Detection answers: "Where is an interesting point?" Description answers: "What does that point look like, in numbers?" Both steps together allow image matching.
Detection + Description Pipeline
Photo A Photo B
(same building, different angle)
Step 1 – DETECT:
Found corner at (120, 85) Found corner at (210, 60)
Step 2 – DESCRIBE:
Descriptor = [0.3, 0.8, 0.1, 0.6, ...] [0.3, 0.8, 0.1, 0.6, ...]
↑ ↑
Very similar numbers → These are the SAME corner!
Step 3 – MATCH:
Connect matching descriptors across both photos.
HOG: Histogram of Oriented Gradients
HOG is one of the most widely used image descriptors. It captures the direction and strength of edges in small image patches, then summarizes them as a histogram of edge directions. The collection of these histograms across the image forms the descriptor.
HOG Step by Step
Step 1: Compute gradient (edge direction + strength) at every pixel. Gradient direction can be: 0°, 45°, 90°, 135°... Gradient magnitude: how strong the edge is. Step 2: Divide image into cells (e.g., 8×8 pixels each). +-------+-------+-------+ | cell | cell | cell | | (8×8) | (8×8) | (8×8) | +-------+-------+-------+ | cell | cell | cell | +-------+-------+-------+ Step 3: In each cell, build a histogram of gradient directions. Bin 1 (0°–20°): count = 5 ██████ Bin 2 (20°–40°): count = 12 █████████████ Bin 3 (40°–60°): count = 3 ████ ... Bin 9 (160°–180°): count = 7 ████████ Step 4: Concatenate all cell histograms into one long vector. → That vector IS the HOG descriptor.
Why HOG Works for Pedestrian Detection
Human body silhouette produces consistent edge patterns: Head: circular gradient directions Shoulders: near-horizontal edges Torso: mainly vertical edges Legs: two vertical columns A HOG descriptor captures this edge pattern precisely. A non-person (tree, car) produces a different pattern. Train a classifier on HOG → detects pedestrians reliably.
SIFT Descriptor
SIFT (Scale-Invariant Feature Transform) creates descriptors that remain useful even when the image is rotated, scaled, or photographed under different lighting. It achieves this by normalizing the patch orientation and size before computing the descriptor.
SIFT Descriptor Construction
Around a detected keypoint: Step 1: Find dominant orientation. Compute gradient directions in a 16×16 patch. The most common direction becomes the "main orientation." Rotate the patch so main orientation always points "up." → Now the descriptor is rotation-invariant. Step 2: Divide the 16×16 patch into 4×4 sub-regions. 16 sub-regions total. Step 3: In each sub-region, compute 8-bin gradient histogram. 16 sub-regions × 8 bins = 128 numbers. Step 4: Normalize the 128-number vector. → Reduces effect of lighting changes. Final descriptor: 128-dimensional vector. Similar patches → similar 128-number vectors → matched!
SIFT: Scale and Rotation Invariance
| Change in Image | SIFT Response | HOG Response |
|---|---|---|
| Rotated 90° | Same descriptor (rotation-normalized) | Different descriptor |
| Scaled 2× | Same descriptor (scale-normalized) | Different descriptor |
| Brighter lighting | Similar descriptor (normalized) | Different descriptor |
ORB: Oriented FAST and Rotated BRIEF
ORB combines the FAST detector with a modified version of the BRIEF descriptor. BRIEF stores each feature as a binary string — a sequence of 0s and 1s — making it extremely compact and fast to compare. ORB adds rotation handling to make BRIEF robust. It is the go-to descriptor for real-time applications on phones and embedded systems because it requires no patent license (SIFT and SURF are patented).
Binary Descriptor Concept
BRIEF comparison test: Pick two random pixels (p, q) in the patch. If brightness(p) > brightness(q) → record 1 If brightness(p) ≤ brightness(q) → record 0 Repeat 256 times with different (p, q) pairs: Result: 256-bit binary string Example: 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1... Matching: Count different bits (Hamming distance). Similar patches → few different bits → MATCH Different patches → many different bits → NO MATCH Speed: Comparing binary strings with XOR is very fast. Size: 256 bits = 32 bytes (vs. SIFT's 512 bytes for 128 floats)
Comparing Descriptors for Matching
Once you have descriptors from two images, you match them by measuring how similar they are. For floating-point descriptors (SIFT, HOG), use Euclidean distance. For binary descriptors (ORB, BRIEF), use Hamming distance.
Nearest Neighbour Matching
Image A descriptors: [dA1, dA2, dA3, dA4, ...] Image B descriptors: [dB1, dB2, dB3, dB4, ...] For each descriptor in A: Find the closest descriptor in B by distance. If distance < threshold → declare a match. Ratio test (Lowe's test): Best match distance / Second best distance < 0.75 → Accept as match (far from second best = distinctive match) → Rejects ambiguous matches automatically. Good matches → used to estimate transformation between images.
LBP: Local Binary Pattern
LBP describes texture by comparing each pixel to its 8 neighbors. Each neighbor either is brighter (1) or darker (0) than the center. Reading those 8 bits clockwise gives an 8-bit binary number (0–255) for each pixel. The histogram of these numbers across a region forms a texture descriptor.
LBP Calculation
Pixel neighborhood (center = 128): 130 100 80 90 128 200 110 140 160 Compare each neighbor to center (128): 130>128 → 1 100<128 → 0 80<128 → 0 90<128 → 0 (center) 200>128 → 1 110<128 → 0 140>128 → 1 160>128 → 1 Reading clockwise from top-left: 1 0 0 1 1 1 0 0 = binary 10011100 = 156 Center pixel gets LBP code = 156. Repeat for every pixel. Build histogram of codes. → Histogram = texture descriptor for that region. Use: Face recognition (captures skin texture patterns).
Descriptor Summary Table
| Descriptor | Type | Size | Speed | Invariant to |
|---|---|---|---|---|
| HOG | Float vector | ~1764 values | Medium | Small rotations |
| SIFT | Float vector | 128 values | Slow | Scale, rotation, lighting |
| ORB | Binary string | 256 bits | Very fast | Rotation |
| LBP | Histogram | 256 bins | Fast | Monotonic lighting changes |
Key Takeaways
- A descriptor converts the appearance of an image patch into a compact number vector.
- HOG captures edge directions in small cells — excellent for detecting human shapes.
- SIFT produces 128-number descriptors that are robust to scale, rotation, and lighting changes.
- ORB uses a binary descriptor — very fast comparison using Hamming distance — best for real-time apps.
- LBP describes texture by comparing each pixel to its neighbors — widely used in face recognition.
- Matching descriptors from two images enables image stitching, object recognition, and 3D reconstruction.
