CV Generative Models in Vision

Generative models learn the underlying distribution of images and use that knowledge to synthesize new images, edit existing ones, or transform visual content. They power AI image generation, image restoration, style transfer, and synthetic data creation for training other models.

Discriminative vs. Generative Models

Most models covered so far are discriminative — they learn to classify or detect by mapping an input image to a label. Generative models learn to produce images. Both types rely on the same neural network building blocks but pursue completely different objectives.

Side-by-Side Comparison

DISCRIMINATIVE MODEL:
  Input:  [Photo of a dog]
  Output: "Dog" (a label)
  Learns: P(label | image)
  Purpose: Classify what already exists.

GENERATIVE MODEL:
  Input:  "A golden retriever sitting on grass" (text or noise)
  Output: [A new photo of a golden retriever on grass]
  Learns: P(image) — the distribution of real images
  Purpose: Create something new that looks realistic.

Generative Adversarial Networks (GANs)

A GAN trains two neural networks against each other. The Generator creates fake images from random noise. The Discriminator judges whether images are real or fake. Both networks improve together in a competitive loop — the Generator gets better at fooling the Discriminator, and the Discriminator gets better at detecting fakes.

GAN Training Loop

GENERATOR (G):                DISCRIMINATOR (D):
  Input: random noise z         Input: real OR fake image
  Output: fake image G(z)       Output: probability (real=1, fake=0)

TRAINING STEP:
  ┌───────────────────────────────────────────────────┐
  │  1. Sample noise z → Generator → Fake image       │
  │  2. Feed Fake image to Discriminator              │
  │     D says: 0.12 (mostly thinks it's fake)        │
  │  3. Feed Real image to Discriminator              │
  │     D says: 0.91 (correctly identifies real)      │
  │                                                   │
  │  UPDATE D: Maximize gap between real/fake scores. │
  │  UPDATE G: Fool D → push fake score toward 1.0.   │
  └───────────────────────────────────────────────────┘

After millions of steps:
  G produces images so realistic D cannot tell them apart.
  D output ≈ 0.5 for both real and fake (random guessing).
  → GAN has converged.

GAN Applications

GAN Variant	Task	Example
StyleGAN	Face synthesis	Generate photorealistic faces of non-existent people
Pix2Pix	Image-to-image translation	Convert a sketch into a photo, or a day scene to night
CycleGAN	Unpaired image translation	Turn a horse photo into a zebra without paired training data
SRGAN	Super-resolution	Upscale a low-resolution photo to high resolution
InPaintGAN	Image inpainting	Remove an object and fill the gap convincingly

CycleGAN — No Paired Training Data Needed

Goal: Convert horse photos to zebra photos.
Problem: No paired dataset (photo of same horse as a zebra).

CycleGAN solution uses two generators and two discriminators:

Generator G_A: Horse → Zebra
Generator G_B: Zebra → Horse

Cycle consistency loss:
  Horse → G_A → Zebra → G_B → Reconstructed Horse
  The reconstructed horse must match the original.
  This constraint forces realistic translation without pairs.

  Horse ──G_A──→ Zebra ──G_B──→ Reconstructed Horse
    ↑                                    ↓
    └──── loss: match these two ─────────┘

Variational Autoencoders (VAEs)

A Variational Autoencoder learns to compress images into a compact latent space (encoding) and then reconstruct them (decoding). Unlike a regular autoencoder, the latent space is structured as a probability distribution — allowing smooth interpolation and controlled image generation.

VAE Architecture

ENCODER:
  [Input image] → [CNN layers] → [μ (mean), σ (std dev)]
                                        ↓
                               Sample z ~ N(μ, σ²)
                               (random point in latent space)

DECODER:
  [z] → [Upsampling CNN layers] → [Reconstructed image]

LOSS = Reconstruction loss + KL divergence
       (How similar to input?)   (How close is the latent
                                   distribution to standard normal?)

GENERATION:
  Sample z ~ N(0, 1)   ← random point from standard normal
  Feed z to Decoder
  → New image generated

Latent Space Interpolation

Image A (smiling face) → Encoder → z_A = [0.2, 0.8, −0.3, ...]
Image B (frowning face) → Encoder → z_B = [−0.5, 0.1, 0.9, ...]

Interpolate: z_t = (1−t) × z_A + t × z_B  for t = 0 to 1

t=0.0: z_A → Decoder → Smiling face
t=0.25: z_mid1 → Decoder → Slightly smiling face
t=0.5:  z_mid → Decoder → Neutral face
t=0.75: z_mid2 → Decoder → Slightly frowning face
t=1.0:  z_B → Decoder → Frowning face

→ Smooth, realistic transition between two expressions.

Diffusion Models

Diffusion models are the technology behind Stable Diffusion, DALL-E, and Midjourney. They work by gradually adding noise to an image over many steps until it becomes pure random noise, then learning to reverse this process — step by step recovering a clean image from noise.

Forward and Reverse Diffusion

FORWARD PROCESS (adding noise — fixed, no learning):
  [Real image] → Add noise → Add more noise → ... → [Pure noise]
  Step 0           Step 100      Step 500           Step 1000
  Clean image    Slightly noisy  Very noisy      Random noise

REVERSE PROCESS (removing noise — learned by neural network):
  [Random noise] → Denoise → Denoise → ... → [Generated image]
  Step 1000         Step 900   Step 500        Step 0
  Noise only      Faint shape  Clearer image   Clean output

The network learns: "Given a noisy image at step t,
predict the noise that was added, then subtract it."
Repeat 1000 times → emerges as a realistic image.

Text-Guided Diffusion (Text-to-Image)

Input text: "A red barn in a snowy field at sunset"

Step 1: Text encoder (CLIP) converts text to embedding vector.
         "red barn snowy field sunset" → [0.3, 0.8, −0.2, ...]

Step 2: Start with pure random noise image.

Step 3: At each denoising step, the U-Net receives:
  - Current noisy image
  - Text embedding (as conditioning signal via cross-attention)
  → Predicts the noise to remove, guided by the text.

Step 4: After 50–1000 steps of guided denoising:
  → Final image matching the text description.

Cross-attention:
  Image features query the text embedding at every layer.
  "Where is 'red'? → highlight warm regions."
  "Where is 'barn'? → focus on rectangular structures."

Image Super-Resolution

Super-resolution takes a low-resolution image and generates a high-resolution version. Early methods interpolated (bilinear, bicubic) which produced blurry results. Modern methods use deep learning — and GANs in particular — to hallucinate fine texture detail that was not in the original.

Super-Resolution Comparison

Low-resolution input (64×64):
  ░░░░░░░░░░░░░░░░░░░░
  ░░ blurry cat ░░░░░
  ░░░░░░░░░░░░░░░░░░░░

Bicubic upsampling (256×256):
  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  ▒▒ slightly smoother cat ▒▒
  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  (Still blurry — just larger)

SRGAN output (256×256):
  ████████████████████
  ██ sharp fur texture cat ██
  ████████████████████
  (Individual hairs visible, crisp edges, vivid colors)
  → GAN hallucinated realistic fur texture.

Image Inpainting

Inpainting fills in missing or masked regions of an image convincingly. A user marks the object to remove (e.g., a person, a watermark), and the model fills the gap with realistic background content by learning the context from the surrounding pixels.

Inpainting Pipeline

ORIGINAL:
  ┌────────────────────────────────┐
  │  Mountain landscape with a     │
  │  power line crossing the sky   │
  └────────────────────────────────┘

USER MASK:
  ┌────────────────────────────────┐
  │  Mountain landscape ████████   │ ← masked power line region
  └────────────────────────────────┘

INPAINTED OUTPUT:
  ┌────────────────────────────────┐
  │  Mountain landscape with       │
  │  clear blue sky (no power line)│ ← gap filled with realistic sky
  └────────────────────────────────┘

The model uses surrounding pixels to infer what the sky
should look like and generates the missing content.

Responsible Use of Generative Vision Models

Generative models can create realistic images of people, places, and events that never existed. Responsible developers and users apply these technologies for productive purposes such as creative content, data augmentation, research, and accessibility tools. Most reputable platforms implement usage policies, content filters, and digital watermarking to help distinguish AI-generated content from authentic photography and video.

Key Takeaways

Generative models learn the distribution of real images and synthesize new ones.
GANs pit a Generator (creates fakes) against a Discriminator (detects fakes) — both improve through competition.
CycleGAN translates images between domains without needing paired training examples.
VAEs learn a structured latent space — enabling smooth interpolation between images.
Diffusion models iteratively remove noise from a random input — guided by text or other conditions.
Super-resolution and inpainting are key practical applications of generative vision models.

Previous lesson

Back to course