Computer Vision Image Classification

Image classification assigns a single label to an entire image. Given an image, the system outputs a class name — "cat," "car," "forest" — along with a confidence score. It is the most fundamental task in computer vision and the foundation for almost all other vision tasks.

The Classification Pipeline

Every image classification system takes an image as input, processes it through layers of computation, and produces a probability for each possible class. The class with the highest probability becomes the prediction.

Classification Flow

[Input Image]        [Processing]             [Output]
  ┌──────┐     →  Feature extraction  →   Cat:   0.89
  │ 🐱   │     →  Pattern matching    →   Dog:   0.07
  │Photo │     →  Probability calc   →   Bird:  0.02
  └──────┘                               Other: 0.02
                                              ↑
                                     Predicted class: CAT

Traditional Classification: Bag of Visual Words

Before deep learning, classification used the "Bag of Visual Words" (BoVW) approach — inspired by text classification. It extracts local features from the image, clusters them into a visual vocabulary, and represents each image as a histogram of which "visual words" appear in it.

Bag of Visual Words Analogy

TEXT CLASSIFICATION (reference):
  Document: "The cat sat on the mat"
  Word histogram: {cat:1, sat:1, mat:1, the:2, on:1}
  → This histogram describes the document.

VISUAL WORDS:
  Image features (SIFT keypoints) = "words"
  Cluster all features into K groups = "vocabulary"
  Each cluster center = one "visual word"

Image histogram: {visual_word_5: 12, visual_word_18: 3, ...}
  → This histogram describes the image's content.

Classifier (SVM) trained on histograms:
  Beach images   → lots of visual_word_3 (horizontal edges) + visual_word_7 (blue blobs)
  Forest images  → lots of visual_word_12 (leaf textures) + visual_word_20 (vertical edges)
  → SVM learns these patterns and classifies new images.

Deep Learning for Classification: CNNs

Convolutional Neural Networks (CNNs) learn to classify images automatically from labeled training data. They outperform traditional methods because they learn features directly from the data rather than relying on manually engineered descriptors. (CNNs are covered in detail in Topic 12.)

CNN Classification Process (Overview)

[Image: 224 × 224 × 3]
       ↓
[Conv Layer 1] → Detects edges and colors
       ↓
[Conv Layer 2] → Detects textures and shapes
       ↓
[Conv Layer 3] → Detects object parts (ear, wheel, wing)
       ↓
[Conv Layer 4] → Detects complete objects
       ↓
[Fully Connected] → Combine all features
       ↓
[Softmax Output]
  Cat:    0.89
  Dog:    0.07
  Rabbit: 0.04

Softmax: Turning Scores into Probabilities

The final layer of a classification network produces raw scores (called logits) for each class. The softmax function converts these scores into probabilities that sum to 1.0, making them easy to interpret.

Softmax Calculation

Raw scores (logits) from network:
  Cat:  3.5
  Dog:  1.2
  Bird: 0.1

Softmax converts each:
  e^3.5 = 33.1
  e^1.2 =  3.3
  e^0.1 =  1.1
  Sum   = 37.5

Probabilities:
  Cat:  33.1 / 37.5 = 0.88 (88%)
  Dog:   3.3 / 37.5 = 0.09 ( 9%)
  Bird:  1.1 / 37.5 = 0.03 ( 3%)
  Total:              1.00 (100%)

Prediction: Cat (highest probability).

Loss Function: Measuring Mistakes

During training, the model compares its prediction to the correct label using a loss function. Cross-entropy loss is the standard choice for classification. A high loss means the model is wrong. The training process adjusts the model's parameters to reduce the loss.

Cross-Entropy Loss Example

True label: Cat (position 0 in the list)
Predicted probabilities: [Cat: 0.88, Dog: 0.09, Bird: 0.03]

Cross-entropy loss = -log(probability of correct class)
                   = -log(0.88)
                   = 0.128    ← Low loss = mostly correct

If the model predicted: [Cat: 0.10, Dog: 0.80, Bird: 0.10]
Loss = -log(0.10) = 2.30     ← High loss = very wrong

Training goal: Minimize loss → correct predictions get higher probability.

Famous Classification Benchmarks

Benchmarks are standard datasets used to compare classification models. The most famous is ImageNet — 1.2 million training images spread across 1000 categories. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) drove most major advances in deep learning for computer vision.

ImageNet Challenge Results Timeline

Year	Model	Top-5 Error	Key Innovation
2011	Traditional (non-CNN)	~26%	Hand-crafted features
2012	AlexNet	15.3%	First deep CNN on ImageNet
2014	VGGNet	7.3%	Deep, simple 3×3 convolutions
2015	ResNet-152	3.6%	Skip connections (residual blocks)
2017	SENet	2.3%	Channel attention

A human expert scores around 5% top-5 error on ImageNet. ResNet surpassed human performance in 2015.

Multi-Label Classification

Standard classification assigns exactly one label per image. Multi-label classification allows multiple labels. A beach photo might receive labels: "outdoor," "water," "sand," "people," and "sunset" — all at once.

Single-Label vs. Multi-Label

Photo: sunset at a beach with people swimming

SINGLE-LABEL:
  Predicted: Beach  (only the dominant category)

MULTI-LABEL:
  Predicted: [Beach=0.96, Sunset=0.91, People=0.84,
              Water=0.98, Outdoor=0.99]

  Each class gets its own probability (0 to 1).
  Classes with probability > 0.5 are flagged as present.
  No requirement that probabilities sum to 1.

Top-1 and Top-5 Accuracy

Top-1 accuracy counts a prediction as correct only if the highest-scoring class matches the true label. Top-5 accuracy counts it correct if the true label appears in the top 5 predictions. Top-5 is more lenient and often used when some ambiguity exists (a dog breed that looks like another breed).

Top-1 vs. Top-5 Example

True label: Siberian Husky

Model predictions (ranked):
  1. Alaskan Malamute  ← Top-1 prediction (wrong → Top-1 error)
  2. Siberian Husky    ← Correct label here (Top-5 correct!)
  3. German Shepherd
  4. Border Collie
  5. Samoyed

Top-1: WRONG (predicted Malamute, not Husky)
Top-5: CORRECT (Husky is in top 5)

Real-World Applications

Plant disease diagnosis – Photo of a leaf classified as "healthy," "powdery mildew," or "rust."
Food calorie tracking apps – Classify a meal photo to estimate its nutritional content.
Satellite land use analysis – Classify each satellite tile as urban, forest, water, or farmland.
Skin condition screening – Classify skin photos as conditions requiring further examination.
Manufacturing defect detection – Classify product images as "pass" or "fail."

Key Takeaways

Image classification assigns one (or more) labels to an entire image.
Softmax converts raw network scores into probabilities that sum to 1.
Cross-entropy loss measures how wrong the model is — training minimizes this.
ImageNet benchmark drove major CNN advances between 2012 and 2017.
Top-5 accuracy is more lenient than top-1 — allows the true answer to appear in the top 5 predictions.
Multi-label classification assigns multiple labels per image — used for rich content tagging.

Previous lessons

Back to courses

Next lessons