CV Face Detection and Recognition

Face detection locates faces in an image. Face recognition identifies whose face it is. These are two distinct tasks — detection comes first, and recognition follows. Together, they power phone unlocking, attendance systems, security cameras, and photo organization apps.

Detection vs. Recognition

Two-Step Face Pipeline

INPUT: Group photo of 5 people

STEP 1 — FACE DETECTION:
  Output: 5 bounding boxes (location of each face)
  ┌──────────────────────────────────────────────────┐
  │  ┌────┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐          │
  │  │Face│  │Face│  │Face│  │Face│  │Face│          │
  │  │ 1  │  │ 2  │  │ 3  │  │ 4  │  │ 5  │          │
  │  └────┘  └────┘  └────┘  └────┘  └────┘          │
  └──────────────────────────────────────────────────┘

STEP 2 — FACE RECOGNITION:
  Compare each detected face to a database of known faces.
  Output: [Alice, Unknown, Bob, Carol, David]

Viola-Jones: The Classical Face Detector

Viola-Jones (2001) was the first real-time face detector and remained the industry standard for over a decade. It uses Haar-like features — simple rectangular patterns — evaluated with an integral image for speed, combined with a cascade of classifiers that reject non-faces early.

Haar-like Features

Haar features are simple black-and-white rectangle pairs.
A face has consistent regions: dark eye sockets above bright cheekbones.

Feature types (white area − black area = feature value):

TYPE 1 (two-rectangle, horizontal):
  +──────────────+
  │ WHITE (+)    │  ← forehead (bright)
  │──────────────│
  │ BLACK (−)    │  ← eye region (dark)
  +──────────────+
  Value = sum(white pixels) − sum(black pixels)
  Large positive value → likely upper face region.

TYPE 2 (three-rectangle):
  +────+────+────+
  │ B  │ W  │ B  │
  +────+────+────+
  Detects nose bridge (bright center, dark sides).

Integral image trick:
  Pre-computes cumulative pixel sums so any rectangle
  sum is computed in exactly 4 operations — regardless of size.
  → 38,000+ features evaluated in milliseconds.

Cascade of Classifiers

Instead of running all 38,000 features on every window:

Stage 1 (2 features):   Rejects 50% of non-faces immediately.
       ↓
Stage 2 (10 features):  Rejects 80% of remaining non-faces.
       ↓
Stage 3 (25 features):  Rejects more.
       ↓
...
Stage 20 (200+ features): Final decision.

Only ~0.01% of windows reach the final stage.
Average cost per window: far less than evaluating all features.
Result: Real-time face detection on 2001 hardware.

Deep Learning Face Detectors

Modern face detectors use CNNs for higher accuracy across varying poses, lighting, and occlusion. MTCNN (Multi-Task Cascaded Convolutional Networks) is the most widely used deep face detector — it runs three small CNNs in sequence to detect and align faces.

MTCNN Three-Stage Pipeline

Stage 1 — P-Net (Proposal Network):
  Tiny CNN, fast.
  Scans image at multiple scales.
  Proposes candidate face windows (many, rough).

Stage 2 — R-Net (Refine Network):
  Medium CNN.
  Takes candidate windows from P-Net.
  Rejects most false alarms. Refines box position.

Stage 3 — O-Net (Output Network):
  Larger CNN.
  Accurate face box + 5 facial landmark positions:
    Left eye, right eye, nose tip, left mouth corner, right mouth corner.

Output:
  [Box: (x1,y1,x2,y2), Landmarks: (x_eye_L,y_eye_L, x_eye_R,...)]

Face Alignment

Face recognition is most accurate when the face is aligned — eyes at the same height, face centered and upright. Alignment uses the detected facial landmarks to compute and apply a geometric transformation.

Face Alignment Process

Detected landmarks:
  Left eye:   (82, 95)
  Right eye:  (142, 90)
  Nose:       (112, 130)

Target template (standard aligned face):
  Left eye:   (50, 50)
  Right eye:  (100, 50)
  Nose:       (75, 80)

Compute transformation (rotation + scale + translation)
  that maps detected landmarks to template positions.

Apply transformation to entire face crop.
Result: Standardized, aligned 112×112 face image.

Why align?
  Recognition network was trained on aligned faces.
  Misaligned input → significantly lower accuracy.

Face Recognition: From Detection to Identity

Face recognition compares a detected face to a database of known faces. Modern systems use embedding networks — CNNs that convert a face image into a compact vector of numbers (an embedding). Similar faces produce similar embeddings. The identity is determined by finding the closest embedding in the database.

Face Embedding Concept

Face Recognition CNN:
  [Aligned face image (112×112)]
           ↓
  [Deep CNN: many convolutional layers]
           ↓
  [512-dimensional embedding vector]
       [0.23, −0.15, 0.87, 0.42, ...]

Two photos of Alice:
  Photo 1 embedding: [0.23, −0.15, 0.87, ...]
  Photo 2 embedding: [0.24, −0.14, 0.86, ...]
  Distance = 0.05  ← Very similar → SAME PERSON

Alice and Bob:
  Alice embedding:  [0.23, −0.15, 0.87, ...]
  Bob embedding:    [0.71,  0.50, −0.30, ...]
  Distance = 1.42  ← Very different → DIFFERENT PERSONS

Threshold (e.g., 0.6):
  Distance < 0.6 → Same person
  Distance ≥ 0.6 → Different person

Triplet Loss: Training the Embedding Network

Embedding networks use Triplet Loss for training. Each training example has three images: an Anchor (reference face), a Positive (different photo of the same person), and a Negative (photo of a different person). The loss pushes Anchor-Positive embeddings closer together and Anchor-Negative embeddings farther apart.

Triplet Loss Diagram

       Anchor (Alice, photo 1)
            ●
           /|\
    close / | \ far
         /  |  \
Positive●   |   ●Negative
(Alice,      |   (Bob, photo 1)
 photo 2)    |

Training goal:
  dist(Anchor, Positive) + margin < dist(Anchor, Negative)

Where margin = 0.2 (minimum required separation)

If goal not met → loss > 0 → backpropagate → adjust weights.
If goal met → loss = 0 → no update needed.

Privacy and Ethics in Face Recognition

Face recognition is a powerful technology used in law enforcement, commercial security, and consumer apps. Its use raises important privacy considerations. In many regions, collecting and storing facial data requires user consent and is regulated by data protection laws. Responsible deployment includes transparent user notification, data minimization, and regular bias audits to ensure the system performs equally across different demographic groups.

Real-World Applications

Smartphone unlock – Matches your face in real time to unlock the device.
Attendance systems – Schools and offices record attendance by recognizing enrolled faces.
Photo organization – Apps automatically group photos by person.
Airport boarding – Biometric gates match passengers to their passport photo.
Security surveillance – Alerts when a known person of interest appears in camera view.

Key Takeaways

Face detection locates faces in an image. Face recognition identifies whose face it is.
Viola-Jones uses Haar features + cascade classifiers for real-time classical detection.
MTCNN uses three small CNNs in sequence for accurate deep learning face detection with landmarks.
Face alignment standardizes face position before recognition to maximize accuracy.
Embedding networks convert faces into compact vectors — similar faces have close vectors.
Triplet Loss trains the embedding network to pull same-person faces together and push different-person faces apart.

Previous lessons

Back to courses

Next lessons