Computer Vision Data Augmentation

Data augmentation artificially increases the size and diversity of a training dataset by applying transformations to existing images. The model trains on these varied versions and learns to recognize objects regardless of angle, lighting, size, or position — making it more robust in real-world conditions.

Why Augmentation Is Necessary

Neural networks learn from examples. More varied examples produce a model that generalizes better. Without augmentation, a model trained only on photos of dogs in sunlit parks may fail to recognize a dog photographed indoors at night. Augmentation exposes the model to many conditions without needing to collect new real images.

Without vs. With Augmentation

Training data: 500 photos of apples (all on a white background, front view).

WITHOUT AUGMENTATION:
  Model learns: "An apple is a red/green circle on white background."
  Fails on:    Apple in a basket, apple cut in half, apple in hand.

WITH AUGMENTATION:
  Each training photo → generates 20+ variants:
    Rotated, flipped, zoomed, darkened, brightened, blurred...
  Effective dataset: 10,000+ varied examples.
  Model learns: "An apple is recognizable in many positions and conditions."

Geometric Transformations

Geometric augmentations change the spatial arrangement of pixels. They simulate different camera angles, distances, and orientations.

Common Geometric Augmentations

ORIGINAL:
  ┌──────────────┐
  │              │
  │    🍎        │
  │              │
  └──────────────┘

HORIZONTAL FLIP:         VERTICAL FLIP:
  ┌──────────────┐         ┌──────────────┐
  │        🍎    │         │              │
  │              │         │    🍎        │ ← (less common,
  └──────────────┘         └──────────────┘   avoid if unnatural)

ROTATION (+15°):          ROTATION (−15°):
  ┌──────────────┐         ┌──────────────┐
  │   🍎         │         │      🍎      │
  │              │         │              │
  └──────────────┘         └──────────────┘

CROP + RESIZE:            TRANSLATION:
  ┌──────┐                 ┌──────────────┐
  │  🍎  │  → resize →     │         🍎   │
  └──────┘  original size  └──────────────┘

PERSPECTIVE WARP:   (simulates viewing from different angles)
  ┌─────────────┐
  │  🍎         │   ← as if camera tilted left
  │             /
  └────────────╱

When NOT to Flip

Task: Digit recognition (0–9)
  "6" flipped → looks like "9"  ← WRONG LABEL!

Task: Medical scan analysis
  Liver is on the right side of body.
  Flipping assigns it to the left → misleads the model.

Rule: Only use augmentations that do NOT change the label meaning.

Photometric Transformations

Photometric augmentations change the color and brightness of pixels without moving them. They simulate different lighting conditions, camera exposures, and color environments.

Common Photometric Augmentations

Augmentation	What Changes	Why Useful
Brightness jitter	All pixels lighter/darker	Different room lighting
Contrast adjustment	Range between dark/bright stretched	Overexposed or underexposed photos
Saturation jitter	Colors more vivid or washed out	Old cameras, faded images
Hue shift	Subtle color tint added	Warm vs. cool lighting
Grayscale conversion	Color removed	Black-and-white cameras
Gaussian noise	Random pixel variation added	Noisy camera sensors

Advanced Augmentation Techniques

Cutout / Random Erasing

Cutout randomly removes a rectangular patch from the image by replacing it with zeros (black) or random noise. This forces the model to classify using partial information — making it more robust when parts of an object are occluded in the real world.

Original image:         After Cutout:
  ┌──────────────┐        ┌──────────────┐
  │              │        │    ████      │
  │    Dog       │   →    │   ████Dog    │  ← Black rectangle removed
  │              │        │              │
  └──────────────┘        └──────────────┘

The model learns to classify "Dog" even with part of it missing.
Real-world scenario: A dog partially behind a fence or car.

Mixup

Mixup blends two training images and their labels together at a set ratio. If 70% of the blend comes from a cat image and 30% from a dog image, the label becomes: cat=0.7, dog=0.3. This forces the network to produce smooth probability distributions rather than overconfident predictions.

Image A (cat) × 0.7   +   Image B (dog) × 0.3   =   Mixed image
Label: [cat=0.7, dog=0.3]

Network must predict proportional probabilities.
Effect: Reduces overconfidence, improves calibration.

CutMix

CutMix cuts a patch from one image and pastes it into another. The label mixes proportionally to the area of each image visible. It combines the benefits of Cutout and Mixup.

Image A (cat, 75% area):   Image B (dog, 25% area):
  ┌──────────────┐           ┌──────────────┐
  │              │           │   ████████   │
  │    Cat       │    +      │   ████████   │ ← patch from B
  │              │           │              │
  └──────────────┘           └──────────────┘

Result:
  ┌──────────────┐
  │   ████████   │  ← dog patch (25% of image)
  │   █dog████   │
  │    Cat       │  ← remaining cat area (75%)
  └──────────────┘
  Label: [cat=0.75, dog=0.25]

AutoAugment and RandAugment

AutoAugment uses reinforcement learning to search for the optimal combination and magnitude of augmentations for a given dataset — replacing manual trial-and-error. RandAugment simplifies this by randomly sampling from a fixed set of operations with a single magnitude parameter, achieving similar results without the expensive search.

RandAugment parameters:
  N = number of augmentations to apply per image (e.g., 2)
  M = magnitude of each augmentation (e.g., 9 out of 30)

For each image, randomly pick N augmentations from:
  [Rotate, ShearX, ShearY, TranslateX, TranslateY,
   Brightness, Color, Contrast, Sharpness, Posterize,
   Solarize, Equalize, AutoContrast, Invert, Cutout...]

Apply them at magnitude M → automatically diverse augmentation.

Augmentation for Object Detection

When augmenting images for object detection, bounding box annotations must transform along with the image. Flipping the image also flips the bounding box. Cropping an image removes boxes that fall outside the crop region.

Bounding Box Transform During Augmentation

Original:
  Image (400×300) with bounding box: [x=100, y=50, w=80, h=120]

After horizontal flip:
  Image flipped → bounding box also flipped:
  New x = image_width - (x + w) = 400 - (100 + 80) = 220
  New box: [x=220, y=50, w=80, h=120]

After random crop to 320×240 (starting at x=50, y=20):
  New box position = [x = 100−50 = 50, y = 50−20 = 30, w=80, h=120]
  Check if box is still mostly inside crop region → keep or discard.

Augmentation Pipelines in Practice

Typical augmentation pipeline for image classification:

  [Raw Image]
       ↓
  RandomHorizontalFlip (50% chance)
       ↓
  RandomCrop (resize to 256, crop to 224)
       ↓
  ColorJitter (brightness ±0.4, contrast ±0.4, saturation ±0.4)
       ↓
  RandomGrayscale (10% chance)
       ↓
  Normalize (subtract dataset mean, divide by std)
       ↓
  [Augmented Image ready for training]

Only applied during TRAINING — not during testing.
Testing uses only: Resize → CenterCrop → Normalize.

Key Takeaways

Data augmentation creates new training examples by transforming existing images.
Geometric augmentations (flip, rotate, crop) simulate different camera angles.
Photometric augmentations (brightness, contrast, noise) simulate different lighting conditions.
Cutout forces the model to work with partially hidden objects.
Mixup and CutMix blend images and labels — reduce overconfidence and improve generalization.
Augmentation applies during training only — test images use standard, unaugmented processing.

Previous lesson

Back to course

Next lesson