CV Generative Models in Vision
Generative models learn the underlying distribution of images and use that knowledge to synthesize new images, edit existing ones, or transform visual content. They power AI image generation, image restoration, style transfer, and synthetic data creation for training other models.
Discriminative vs. Generative Models
Most models covered so far are discriminative — they learn to classify or detect by mapping an input image to a label. Generative models learn to produce images. Both types rely on the same neural network building blocks but pursue completely different objectives.
Side-by-Side Comparison
DISCRIMINATIVE MODEL: Input: [Photo of a dog] Output: "Dog" (a label) Learns: P(label | image) Purpose: Classify what already exists. GENERATIVE MODEL: Input: "A golden retriever sitting on grass" (text or noise) Output: [A new photo of a golden retriever on grass] Learns: P(image) — the distribution of real images Purpose: Create something new that looks realistic.
Generative Adversarial Networks (GANs)
A GAN trains two neural networks against each other. The Generator creates fake images from random noise. The Discriminator judges whether images are real or fake. Both networks improve together in a competitive loop — the Generator gets better at fooling the Discriminator, and the Discriminator gets better at detecting fakes.
GAN Training Loop
GENERATOR (G): DISCRIMINATOR (D): Input: random noise z Input: real OR fake image Output: fake image G(z) Output: probability (real=1, fake=0) TRAINING STEP: ┌───────────────────────────────────────────────────┐ │ 1. Sample noise z → Generator → Fake image │ │ 2. Feed Fake image to Discriminator │ │ D says: 0.12 (mostly thinks it's fake) │ │ 3. Feed Real image to Discriminator │ │ D says: 0.91 (correctly identifies real) │ │ │ │ UPDATE D: Maximize gap between real/fake scores. │ │ UPDATE G: Fool D → push fake score toward 1.0. │ └───────────────────────────────────────────────────┘ After millions of steps: G produces images so realistic D cannot tell them apart. D output ≈ 0.5 for both real and fake (random guessing). → GAN has converged.
GAN Applications
| GAN Variant | Task | Example |
|---|---|---|
| StyleGAN | Face synthesis | Generate photorealistic faces of non-existent people |
| Pix2Pix | Image-to-image translation | Convert a sketch into a photo, or a day scene to night |
| CycleGAN | Unpaired image translation | Turn a horse photo into a zebra without paired training data |
| SRGAN | Super-resolution | Upscale a low-resolution photo to high resolution |
| InPaintGAN | Image inpainting | Remove an object and fill the gap convincingly |
CycleGAN — No Paired Training Data Needed
Goal: Convert horse photos to zebra photos.
Problem: No paired dataset (photo of same horse as a zebra).
CycleGAN solution uses two generators and two discriminators:
Generator G_A: Horse → Zebra
Generator G_B: Zebra → Horse
Cycle consistency loss:
Horse → G_A → Zebra → G_B → Reconstructed Horse
The reconstructed horse must match the original.
This constraint forces realistic translation without pairs.
Horse ──G_A──→ Zebra ──G_B──→ Reconstructed Horse
↑ ↓
└──── loss: match these two ─────────┘
Variational Autoencoders (VAEs)
A Variational Autoencoder learns to compress images into a compact latent space (encoding) and then reconstruct them (decoding). Unlike a regular autoencoder, the latent space is structured as a probability distribution — allowing smooth interpolation and controlled image generation.
VAE Architecture
ENCODER:
[Input image] → [CNN layers] → [μ (mean), σ (std dev)]
↓
Sample z ~ N(μ, σ²)
(random point in latent space)
DECODER:
[z] → [Upsampling CNN layers] → [Reconstructed image]
LOSS = Reconstruction loss + KL divergence
(How similar to input?) (How close is the latent
distribution to standard normal?)
GENERATION:
Sample z ~ N(0, 1) ← random point from standard normal
Feed z to Decoder
→ New image generated
Latent Space Interpolation
Image A (smiling face) → Encoder → z_A = [0.2, 0.8, −0.3, ...] Image B (frowning face) → Encoder → z_B = [−0.5, 0.1, 0.9, ...] Interpolate: z_t = (1−t) × z_A + t × z_B for t = 0 to 1 t=0.0: z_A → Decoder → Smiling face t=0.25: z_mid1 → Decoder → Slightly smiling face t=0.5: z_mid → Decoder → Neutral face t=0.75: z_mid2 → Decoder → Slightly frowning face t=1.0: z_B → Decoder → Frowning face → Smooth, realistic transition between two expressions.
Diffusion Models
Diffusion models are the technology behind Stable Diffusion, DALL-E, and Midjourney. They work by gradually adding noise to an image over many steps until it becomes pure random noise, then learning to reverse this process — step by step recovering a clean image from noise.
Forward and Reverse Diffusion
FORWARD PROCESS (adding noise — fixed, no learning): [Real image] → Add noise → Add more noise → ... → [Pure noise] Step 0 Step 100 Step 500 Step 1000 Clean image Slightly noisy Very noisy Random noise REVERSE PROCESS (removing noise — learned by neural network): [Random noise] → Denoise → Denoise → ... → [Generated image] Step 1000 Step 900 Step 500 Step 0 Noise only Faint shape Clearer image Clean output The network learns: "Given a noisy image at step t, predict the noise that was added, then subtract it." Repeat 1000 times → emerges as a realistic image.
Text-Guided Diffusion (Text-to-Image)
Input text: "A red barn in a snowy field at sunset"
Step 1: Text encoder (CLIP) converts text to embedding vector.
"red barn snowy field sunset" → [0.3, 0.8, −0.2, ...]
Step 2: Start with pure random noise image.
Step 3: At each denoising step, the U-Net receives:
- Current noisy image
- Text embedding (as conditioning signal via cross-attention)
→ Predicts the noise to remove, guided by the text.
Step 4: After 50–1000 steps of guided denoising:
→ Final image matching the text description.
Cross-attention:
Image features query the text embedding at every layer.
"Where is 'red'? → highlight warm regions."
"Where is 'barn'? → focus on rectangular structures."
Image Super-Resolution
Super-resolution takes a low-resolution image and generates a high-resolution version. Early methods interpolated (bilinear, bicubic) which produced blurry results. Modern methods use deep learning — and GANs in particular — to hallucinate fine texture detail that was not in the original.
Super-Resolution Comparison
Low-resolution input (64×64): ░░░░░░░░░░░░░░░░░░░░ ░░ blurry cat ░░░░░ ░░░░░░░░░░░░░░░░░░░░ Bicubic upsampling (256×256): ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ ▒▒ slightly smoother cat ▒▒ ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ (Still blurry — just larger) SRGAN output (256×256): ████████████████████ ██ sharp fur texture cat ██ ████████████████████ (Individual hairs visible, crisp edges, vivid colors) → GAN hallucinated realistic fur texture.
Image Inpainting
Inpainting fills in missing or masked regions of an image convincingly. A user marks the object to remove (e.g., a person, a watermark), and the model fills the gap with realistic background content by learning the context from the surrounding pixels.
Inpainting Pipeline
ORIGINAL: ┌────────────────────────────────┐ │ Mountain landscape with a │ │ power line crossing the sky │ └────────────────────────────────┘ USER MASK: ┌────────────────────────────────┐ │ Mountain landscape ████████ │ ← masked power line region └────────────────────────────────┘ INPAINTED OUTPUT: ┌────────────────────────────────┐ │ Mountain landscape with │ │ clear blue sky (no power line)│ ← gap filled with realistic sky └────────────────────────────────┘ The model uses surrounding pixels to infer what the sky should look like and generates the missing content.
Responsible Use of Generative Vision Models
Generative models can create realistic images of people, places, and events that never existed. Responsible developers and users apply these technologies for productive purposes such as creative content, data augmentation, research, and accessibility tools. Most reputable platforms implement usage policies, content filters, and digital watermarking to help distinguish AI-generated content from authentic photography and video.
Key Takeaways
- Generative models learn the distribution of real images and synthesize new ones.
- GANs pit a Generator (creates fakes) against a Discriminator (detects fakes) — both improve through competition.
- CycleGAN translates images between domains without needing paired training examples.
- VAEs learn a structured latent space — enabling smooth interpolation between images.
- Diffusion models iteratively remove noise from a random input — guided by text or other conditions.
- Super-resolution and inpainting are key practical applications of generative vision models.
