Testing Image Generation Models with Explainable Human Evaluation

2 hours ago 1

ImagenWorld

^♠Samin Mahdizadeh Sani*, ^♠♡Max Ku*, ^♠Nima Jamali, ^♠Matina Mahdizadeh Sani, ^♦Paria Khoshtab, ^♡Wei-Chieh Sun, ^♦Parnian Fazel, ^♡Zhi Rui Tam, ^♡Thomas Chong, ^♡Edisy Kin Wai Chan, ^♡Donald Wai Tong Tsang, ^♡Chiao-Wei Hsu, ^♡Ting Wai Lam, ^♡Ho Yin Sam Ng, ^♦Chiafeng Chu, ^♡Chak-Wing Mak, ^♦Keming Wu, ^♡Hiu Tung Wong, ^♡Yik Chun Ho, ^♠Chi Ruan, ^♦Zhuofeng Li, ^♡I-Sheng Fang, ^♣♡Shih-Ying Yeh, ^♡§Ho Kei Cheng, ^♦Ping Nie, ^♠Wenhu Chen

TL;DR

We build a 3.6K conditions set, 6-task x 6-domain benchmark with 20K explainable human annotations that stress-tests image generation and editing, shows where models break (notably local edits and text-heavy content), benchmarks against VLM-as-judge baselines, and identifies key failure modes.

Abstract

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

Overview

We introduce ImagenWorld, a large-scale, human-centric benchmark designed to stress-test image generation models in real-world scenarios. Unlike prior evaluations that focus on isolated tasks or narrow domains, ImagenWorld is organized into six domains: Artworks, Photorealistic Images, Information Graphics, Textual Graphics, Computer Graphics, and Screenshots, and six tasks: Text-to-Image Generation (TIG), Single-Reference Image Generation (SRIG), Multi-Reference Image Generation (MRIG), Text-to-Image Editing (TIE), Single-Reference Image Editing (SRIE), and Multi-Reference Image Editing (MRIE). The benchmark includes 3.6K condition sets and 20K fine-grained human annotations, providing a comprehensive testbed for generative models. To support explainable evaluation, ImagenWorld applies object- and segment-level extraction to generated outputs, identifying entities such as objects and fine-grained regions. This structured decomposition enables human annotators to provide not only scalar ratings but also detailed tags of object-level and segment-level failures.

MY ALT TEXT

Dataset Preview

We build a diverse benchmark covering 6 topics x 6 tasks = 36 areas, and we annotated over 20K images in total, each provided object-level and segment-level annotations to localize the errors.

MY ALT TEXT

Figure 1: Illustrative samples from our dataset across six tasks: Text-to-Image Generation (TIG), Single-Reference Image Generation (SRIG), Multi-Reference Image Generation (MRIG), Text-to-Image Editing (TIE), Single-Reference Image Editing (SRIE), and Multi-Reference Image Editing (MRIE). For each task, we show both a successful generation (green) and a failure case (red).

MY ALT TEXT

Figure 2: Examples include object-level issues, where expected objects are missing or distorted, and segment-level issues, where annotated regions highlight specific regions with visual inconsistencies that affect evaluation scores.

Overall Results

Across ImagenWorld, models struggle primarily with editing (TIE/SRIE/MRIE) compared to generation (TIG/SRIG/MRIG). Performance peaks on Artworks and Photorealistic images, whereas Information Graphics and Screenshots are most challenging due to their symbolic content and text-heavy, structured layouts.

MY ALT TEXT

Figure 3: Mean human evaluation scores across our four metrics by topic (left) and task (right).

MY ALT TEXT

Figure 4: Overall human rating by task and topic for the four unified models that support all six tasks.

Models struggle to execute localized edits reliably: AR–diffusion hybrids often overwrite the input with a new image, while diffusion editors struggle in the opposite way, frequently doing nothing and leaving the input unchanged.

MY ALT TEXT

Figure 5: Percentage of cases where the model generating completely new image or simply return input in image editing tasks.

Common Failure Modes

1. Failing to Precisely Follow Instructions

Prompt:

Edit image 1. Replace the top-left crate with the yellow warning sign from image 3. Place the pink crewmate (from the center of image 2) and the yellow crewmate (from the bottom right of image 2) standing side-by-side on the central doorway in image 1. Ensure all new elements are integrated with correct perspective, lighting, and scale.

MY ALT TEXT

Figure 6: Instruction-following problem: The model placed red and green crewmates instead of pink and yellow, and the yellow sign’s position does not match the request.

2. Numerical Inconsistencies

MY ALT TEXT

Figure 7: Examples of numerical inconsistencies.

3. Segments and Labeling Issues

MY ALT TEXT

Figure 8: Examples of labeling issues.

4. Generating New Image in Editing

MY ALT TEXT

Figure 9: Examples of generating a new image when the task is editing.

5. Plots and chart errors

MY ALT TEXT

Figure 10: Examples of plots and diagram issues.

6. Unreadable Text

MY ALT TEXT

Figure 11: Examples of text issues.

Citation

Please kindly cite our paper if you use our code, data, models or results:

@misc{imagenworld2025, title = {ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks}, author = {Samin Mahdizadeh Sani and Max Ku and Nima Jamali and Matina Mahdizadeh Sani and Paria Khoshtab and Wei-Chieh Sun and Parnian Fazel and Zhi Rui Tam and Thomas Chong and Edisy Kin Wai Chan and Donald Wai Tong Tsang and Chiao-Wei Hsu and Ting Wai Lam and Ho Yin Sam Ng and Chiafeng Chu and Chak-Wing Mak and Keming Wu and Hiu Tung Wong and Yik Chun Ho and Chi Ruan and Zhuofeng Li and I-Sheng Fang and Shih-Ying Yeh and Ho Kei Cheng and Ping Nie and Wenhu Chen}, year = {2025}, doi = {10.5281/zenodo.17344183}, url = {https://zenodo.org/records/17344183}, projectpage = {https://tiger-ai-lab.github.io/ImagenWorld/}, blogpost = {https://blog.comfy.org/p/introducing-imagenworld}, note = {Community-driven dataset and benchmark release, Temporarily archived on Zenodo while arXiv submission is under moderation review.}, }

Read Entire Article

Testing Image Generation Models with Explainable Human Evaluation

TL;DR

Abstract

Overview

Dataset Preview

We build a diverse benchmark covering 6 topics x 6 tasks = 36 areas, and we annotated over 20K images in total, each provided object-level and segment-level annotations to localize the errors.

Figure 2: Examples include object-level issues, where expected objects are missing or distorted, and segment-level issues, where annotated regions highlight specific regions with visual inconsistencies that affect evaluation scores.

Overall Results

Figure 3: Mean human evaluation scores across our four metrics by topic (left) and task (right).

Figure 4: Overall human rating by task and topic for the four unified models that support all six tasks.

Figure 5: Percentage of cases where the model generating completely new image or simply return input in image editing tasks.

Common Failure Modes

Figure 6: Instruction-following problem: The model placed red and green crewmates instead of pink and yellow, and the yellow sign’s position does not match the request.

Figure 7: Examples of numerical inconsistencies.

Figure 8: Examples of labeling issues.

Figure 9: Examples of generating a new image when the task is editing.

Figure 10: Examples of plots and diagram issues.

Figure 11: Examples of text issues.

Citation

Related

Noetix Robotics

Hello World

Boston Police Can No Longer Use Facial Recognition Software