Generative AI Image Generation

Image generation is one of the most visible capabilities of generative AI. A text prompt describing a scene, style, or concept produces an entirely new image that has never existed before. Tools like Stable Diffusion, DALL·E, and Midjourney have made this technology accessible to anyone — no drawing skills required.

How Image Generation Models Work

The dominant approach for modern image generation is the diffusion model. It works by learning to gradually remove noise from a random image until a clear image matching the prompt emerges.

Training Phase (what the model learns):
────────────────────────────────────────────────────────
Clean image of a cat
      │  (add random noise gradually)
      ▼
[■▓▒░ noisy cat ░▒▓■]
      │  (more noise)
      ▼
[■■■■ pure noise ■■■■]

Model learns: how to REVERSE this noise process step by step
────────────────────────────────────────────────────────

Generation Phase (creating a new image):
────────────────────────────────────────────────────────
Start with pure random noise
      │  (model removes noise, guided by prompt)
      ▼
[partial shapes appear]
      │  (more denoising steps)
      ▼
[rough image of a cat]
      │  (final denoising)
      ▼
[sharp photorealistic cat image]
────────────────────────────────────────────────────────

The Role of Text in Image Generation

A text encoder (usually a CLIP model) converts the text prompt into a vector — a list of numbers that represents the meaning of the description. The diffusion model uses this vector to guide the denoising process toward an image that matches the prompt.

Prompt: "A golden retriever puppy sitting in autumn leaves, soft light"
          │
          ▼
   Text Encoder (CLIP)
          │
          ▼
   [0.8, 0.3, 0.9, 0.1 ... ] ← semantic vector
          │
          ▼
   Diffusion Model uses vector to guide image creation
          │
          ▼
   Final image: golden retriever puppy in autumn leaves

Key Image Generation Models

Model	Creator	Key Feature
DALL·E 3	OpenAI	Strong prompt adherence, integrated with ChatGPT
Stable Diffusion	Stability AI	Open-source, runs locally, highly customizable
Midjourney	Midjourney Inc.	Exceptional artistic quality and style control
Adobe Firefly	Adobe	Trained on licensed content, safe for commercial use
Imagen 3	Google DeepMind	High photorealism and text rendering in images
Flux	Black Forest Labs	State-of-the-art open-source alternative to DALL·E

Anatomy of an Effective Image Prompt

Image prompts follow a different pattern than text prompts. More descriptive and specific prompts produce better results.

WEAK PROMPT:
"A cat"

STRONG PROMPT:
"A tabby cat sitting on a wooden windowsill, warm golden afternoon light,
 bokeh background of green garden, photorealistic, 85mm lens, high detail"

Elements of a strong image prompt:
─────────────────────────────────────────────────────────────────
1. Subject       → What is in the image? ("tabby cat")
2. Setting       → Where and when? ("wooden windowsill, afternoon")
3. Lighting      → Type and direction of light ("warm golden light")
4. Style         → Photorealistic, cartoon, painting, sketch, etc.
5. Camera/Lens   → For realistic photos ("85mm lens, bokeh")
6. Quality tags  → ("high detail", "4K", "sharp focus")
─────────────────────────────────────────────────────────────────

Image Generation Parameters

Guidance Scale (CFG Scale)

Controls how strictly the model follows the prompt:

Low (1–4): Model is creative and less literal — may diverge from the prompt
Medium (7–10): Good balance between prompt adherence and quality
High (15–20): Very close to the prompt but may look over-processed

Steps (Sampling Steps)

The number of denoising steps the model takes. More steps = higher quality image but longer generation time.

20 steps: Fast, rough quality
50 steps: Good quality, standard use
100+ steps: Highest quality, slower

Seed

A seed is a random number that initializes the noise the model starts from. Using the same seed with the same prompt produces the same image every time — useful for reproducibility and iterative refinement.

Negative Prompts

Most image generation systems accept a negative prompt — a list of things to avoid in the generated image.

Prompt:          "Portrait of a woman, professional headshot, studio lighting"
Negative Prompt: "blurry, low resolution, distorted hands, extra fingers,
                  watermark, overexposed, cartoonish"

Effect: The model actively avoids these qualities during generation.

ControlNet — Guiding Image Structure

ControlNet is an extension that lets users control the structure of generated images using reference inputs like edge maps, depth maps, or human pose skeletons. This allows precise control over composition without fully specifying every detail in text.

Reference Input (stick figure pose sketch)
          │
          ▼
ControlNet + Diffusion Model
          │
          ▼
Photorealistic image of a person in exactly that pose

Image-to-Image Generation

Beyond text-to-image, models also support image-to-image tasks:

Inpainting: Fill in a masked area of an existing image with new content
Outpainting: Extend an image beyond its original borders
Style Transfer: Apply the artistic style of one image to another
Upscaling: Increase image resolution while adding fine detail

Ethical Considerations in Image Generation

Image generation carries specific ethical responsibilities:

Deepfakes: Realistic fake images of real people can be misused
Copyright: Models trained on artist work without consent raise IP questions
Misinformation: Fake but realistic images can spread false narratives
Bias: Models trained on imbalanced data may stereotype certain groups

Responsible platforms add safety filters to prevent generation of harmful, illegal, or misleading content.

Image generation demonstrates how generative AI extends far beyond text. The next topic explores another sensory domain — audio and music generation.

Previous lesson

Back to course

Next lesson