Multimodal Prompting

AI models have evolved beyond text. Modern AI tools like GPT-4o, Claude, and Gemini can process and reason about images alongside text in the same conversation. This is called multimodal capability — the ability to understand and respond to more than one type of input (or "mode") at a time.

Multimodal prompting is the skill of writing instructions that combine text and images effectively to get meaningful, accurate, and useful responses from AI.

What is Multimodal Prompting?

Multimodal Prompting refers to the practice of providing the AI with both an image and a text instruction — and crafting that instruction in a way that gets the most accurate and useful response based on what the image contains.

Instead of only describing something in words, an image is shared directly. The AI can then see, analyze, interpret, and respond to visual content — just as it responds to text.

What Can AI Do With an Image?

Modern multimodal AI models can:

Describe what is in an image
Read and extract text visible in an image (receipts, signs, documents, screenshots)
Analyze charts, graphs, and tables and explain their data
Identify objects, scenes, or products in a photo
Understand diagrams, flowcharts, and wireframes
Compare two images and explain the differences
Answer questions about the content of an image
Review UI/UX screenshots and provide feedback
Interpret handwritten notes or drawings

Multimodal Prompt Structure

A multimodal prompt has two parts:

The image — uploaded or shared alongside the message
The text instruction — the prompt that tells the AI what to do with the image

The text instruction is just as important as the image itself. Saying "what is this?" gives the AI unlimited latitude. A specific instruction like "identify the three data trends visible in this bar chart and explain what they indicate" gets a structured, actionable response.

Multimodal Prompt Examples by Use Case

Use Case 1 — Reading a Receipt or Invoice

Image: A photo of a restaurant receipt

Prompt: "Read this receipt and list each item ordered, its quantity, and its price. Then calculate the subtotal, tax amount, and total separately. Present the results in a table."

Result: A clean, organized breakdown of every line on the receipt without manually reading the photo.

Use Case 2 — Analyzing a Chart or Graph

Image: A bar chart showing monthly website traffic over 12 months

Prompt: "Analyze this bar chart. Identify the three months with the highest traffic, the three with the lowest, and describe any visible seasonal trend. Write your analysis in three short paragraphs."

Result: A data-driven narrative analysis of the chart, ready to include in a report.

Use Case 3 — UI / UX Review

Image: A screenshot of a mobile app's home screen

Prompt: "Review this mobile app home screen as a UX designer. Identify three usability issues and suggest a specific improvement for each. Consider readability, navigation clarity, and visual hierarchy."

Result: Practical, specific UX feedback grounded in the actual screenshot — not a generic response.

Use Case 4 — Product Identification

Image: A photo of a plant in a garden

Prompt: "Identify the plant in this image. Include its common name, scientific name, whether it is edible or toxic, and what growing conditions it prefers."

Result: A concise plant profile based on the visual features in the image.

Use Case 5 — Document Understanding

Image: A scanned page from a contract or legal document

Prompt: "Read the text in this scanned document page. Summarize the key obligations mentioned for both parties in plain language. Highlight any deadline or deadline-related language."

Result: A plain-language summary that saves significant reading time.

Use Case 6 — Comparing Two Images

Images: Two versions of a product packaging design (Version A and Version B)

Prompt: "Compare these two product packaging designs. List the visual differences between them. Based on clarity of the product name, visual appeal, and readability of key information, which version is stronger and why?"

Use Case 7 — Handwritten Notes

Image: A photo of handwritten meeting notes

Prompt: "Transcribe the handwritten text in this image as accurately as possible. Then organize the content into three sections: Action Items, Decisions Made, and Questions to Follow Up. Present each section as a bullet list."

Use Case 8 — Error Screenshot in Code

Image: A screenshot of a terminal showing an error message

Prompt: "Read the error message in this screenshot. Identify what type of error it is, explain in plain language what caused it, and suggest how to fix it."

Tips for Writing Strong Multimodal Prompts

Be Specific About What to Focus On

Images contain many elements. The prompt should direct the AI's attention to what matters most.

Vague: "Tell me about this image."
Specific: "Focus on the text content in the top-left section of this image and transcribe it exactly."

Specify the Output Format

Just like text-only prompts, multimodal prompts benefit from format instructions — especially when the image contains structured data like tables, charts, or lists.

"Extract all the numerical data from this chart and present it in a table with columns: Month, Value, Change from Previous Month."

Provide Context About the Image

If the image has context that the AI might not infer from looking at it, include that in the prompt.

"This is a screenshot from our internal sales dashboard from Q3 2024. The data shows regional performance..."

Ask Follow-Up Questions in the Same Message

Combining analysis and action in one multimodal prompt saves time.

"Describe what this flowchart shows. Then list the three decision points and explain what happens in each path."

Limitations of Multimodal AI

Low-resolution images: Blurry or very small text in images may not be readable accurately
Complex diagrams: Highly technical diagrams (circuit boards, medical scans) may be interpreted with errors — always verify critical information
Privacy: Never share images containing sensitive personal data — names, ID numbers, medical records, financial details — with a public AI tool
Not all AI tools support images: Confirm the tool supports image input before attempting multimodal prompting

Real-World Applications of Multimodal Prompting

Field	Multimodal Use Case
E-commerce	Analyze product photos and generate SEO-optimized descriptions
Education	Upload a textbook diagram and ask for an explanation
Healthcare Admin	Read printed medical forms and extract structured data
Engineering	Upload a schematic and ask for an explanation of a specific component
Marketing	Review a competitor's advertisement and analyze messaging and design choices
Finance	Upload a scanned invoice and extract all line items into a spreadsheet format
Software Development	Share a UI wireframe and ask the AI to write the corresponding HTML/CSS

Key Takeaway

Multimodal prompting combines text instructions with image input to unlock a wide range of AI capabilities — from reading receipts and analyzing charts to reviewing designs and debugging code from screenshots. The same principles of specificity, context, and format instructions that apply to text prompts apply here too. Clear, specific multimodal prompts that direct the AI's attention to relevant parts of the image produce far better results than vague requests to simply "describe" or "analyze" an image.

In the next topic, we will explore Prompt Injection and Security — understanding how prompts can be exploited and how to protect AI systems from manipulation.

Previous lessons

Back to courses

Next lessons