ICME25 Grand Challenge Inova: Interleaved Image-Text Comprehension and Generation Challenge

Program

Coming soon

Demonstrations and task taxonomy of the proposed Inova challenge for two tracks.
Demonstrations of Inova Challenge

Leaderboard-Track A

Rank Team Score

Leaderboard-Track B

Rank Team Score

Overview

The Interleaved Image-Text Comprehension and Generation Challenge aim to facilitate development on interleaved vision language instruction following capability of AI systems by addressing crucial tasks in two sub-tracks: i) interleaved image-text comprehension and ii) interleaved image-text generation. These tasks involve multiple images as visual contexts and encourage comprehensive ability for unified multi-modal comprehension and generation with generative multi-modal AI.

  • Track A, Interleaved Image-Text Comprehension, focuses on understanding analysis and inference between multiple images.  Participants should understand the relationship between interleaved image-text sequences and give the corresponding comprehension for predefined open-ended questions or multi-choice options.
  • Track B, Interleaved Image-Text Generation, emphasizes generating subsequent images followed by sequential interleaved image-text contexts.  Participants should generate desired images based on the given references or contexts, following instruction semantics while maintaining content consistency.

By addressing these tasks, the challenge seeks to promote the mutual enhancement of understanding and generative capabilities of multi-modal AI systems when facing intertwined visual and textual information

Task Details

Track A: Interleaved Image-Text Comprehension:

  • Multi-Image Reasoning. This category focuses on the comparative analysis between images and texts.
  • Document and Knowledge-Based Understanding. This category requires extraction and comprehension of information from structured formats.
  • Interactive Multi-Modal Communication. This category focuses on the dynamic interaction between visual and textual modalities in a conversational context.
  • Multi-Image Discrimination. This task involves analyzing two given photos to determine their similarities and discrimination across multiple dimensions.

Track B: Interleaved Image-Text Generation:

  • Sequential Visual Generation. This task category encompasses the subsequent image generation of coherent narratives based on sequential visual inputs.
  • Material-based Image Coloring. The task involves coloring a specific object in the target image based on the given material image to get the corresponding material.
  • Visual Reference Customization. This task requires the model to extract and apply pixel-level control from images, such as cartoon characters or clothing sketches
Inova Challenge Task Details
Task Scenario Dataset Metric
Track A: Interleaved Image-Text Comprehension Multi-Image Reasoning
Visual Change Captioning Surveillance Spot-the-Diff ROUGE-L
Visual Change Captioning Synthetic CLEVR-Change ROUGE-L
Visual Relationship Expressing General IEdit ROUGE-L
Subtle Difference Expressing Fine-Grained Birds-to-Words ROUGE-L
Image-Set QA Driving Recording nuScenes Accuracy
Industrial Inspection Industrial VISION Accuracy
Fashion QA Fashion Fashion200K Accuracy
Property Coherence General MIT-States-PropertyCoherence Accuracy
State Transformation Coherence General MIT-States-StateCoherence Accuracy
Visual Step Matching Recipe RecipeQA-ImageCoherence Accuracy
Multi-Image Visual Entailment General NLVR2 Accuracy
Ambiguity Analysis Mobile Photo VizWiz Accuracy
Document and Knowledge-Based Understanding
Slide QA Slide SlideVQA Accuracy
OCR QA Book Cover OCR-VQA Accuracy
Document QA Document Image DocVQA Accuracy
Webpage QA Webpage WebQA Accuracy
Textbook QA Textbook TQA Accuracy
Complex Multimodal QA Wikipedia MMQA Accuracy
Interactive Multi-Modal Communication
Conversational Embodied Dialogue Embodied ALFRED ROUGE-L
Multi-Modal Dialogue Conversation MMCoQA ROUGE-L
Multi-Image Discrimination
Facial Comparison General LFW Accuracy
Similarity Dimension Selection General Totally-Looks-Like Accuracy
Track B: Interleaved Image-Text Generation Sequential Visual Generation
Animated Story Completion Cartoon AESOP ROUGE-L & Similarity
Animated Story Completion Cartoon PororoSV ROUGE-L & Similarity
Animated Story Completion Cartoon FlintstonesSV ROUGE-L & Similarity
Sequential Photo Storytelling Album VIST ROUGE-L & Similarity
Sequential Photo Storytelling Cartoon DiDeMoSV ROUGE-L & Similarity
Comic Dialogue Identification Cartoon COMICS-Dialogue ROUGE-L & Similarity
Comic Panel Identification Cartoon COMICS-Panel ROUGE-L & Similarity
Recipe Completion Recipe RecipeQA-TextCloze ROUGE-L & Similarity
Visual Step Cloze Recipe RecipeQA-VisualCloze ROUGE-L & Similarity
Material-based Image Coloring
Material Transfer Industrial FMD Similarity
Texture Transfer Physical KTH-TIPS2 Similarity
Visual Reference Customization
Virtual Try-on Fashion VTON-HD Similarity
Visual Reference General MSCOCO Similarity

Dataset-Track A

For each task, we will provide a dataset with a training set and a test set. The annotations are in the form of a JSON file. An example of the task metadata is shown below.

"metadata": {
        "dataset": "Totally-Looks-Like",
        "split": "test",
        "num_sample": "50",
        "task_instruction": [
            "You are provided with a dataset comprising two images and text; accurately distinguish two images in multiple dimensions and accurately answer the following question. You must only answer with 'yes' or 'no'.",
            "Analyze the given data containing two images and text, please distinguish two images in multiple dimensions and answer the subsequent question accurately. You must only answer with 'yes' or 'no'.",
            "Based on the data provided, distinguish two images in multiple dimensions, and give a precise answer to the question. You must only answer with 'yes' or 'no'.",
            "Given a dataset consisting of two images and text, distinguish two images in multiple dimensions and respond to questions correctly. You must only answer with 'yes' or 'no'.",
            "Your objective is to distinguish two images in multiple dimensions, please answer the question accurately. You must only answer with 'yes' or 'no'.",
            "Working with a dataset that has two images and text, provides an accurate answer to the question. You must only answer with 'yes' or 'no'.",
            "Reviewing the provided data, distinguish two images in multiple dimensions and answer the question precisely. You must only answer with 'yes' or 'no'.",
            "Based on the dataset featuring two images and accompanying text, distinguish two images in multiple dimensions and determine the correct answer. You must only answer with 'yes' or 'no'.",
            "Assess the two images and text in the dataset, then distinguish two images in multiple dimensions and answer the subsequent question. You must only answer with 'yes' or 'no'.",
            "Interpret the given dataset to distinguish two images in multiple dimensions and formulate an accurate response. You must only answer with 'yes' or 'no'."
        ],
        "question_type": "yes-no"
    }

A example of the instance annoation is shown below.

"annotations": [
        {
            "sample_id": "0",
            "meta_instruction_id": "1",
            "instance": {
                "context": "Image_1: {image#1} Image_2: {image#2} Question: Compare the two images in terms of facial features, global shape, near-duplicates, facial resemblance, textural patterns, and color composition. Return 'yes' if they exhibit similarities in any of these aspects, otherwise return 'no'.",
                "images_path": [
                    "images/01905_0.jpg",
                    "images/01905_1.jpg"
                ]
            },
            "response": ""
        },
        ...
    ]

Dataset-Track B

For each task, we will provide a dataset with a training set and a test set. The annotations are in the form of a JSON file. An example of the task metadata is shown below.

"metadata": {
        "dataset": "KTH-TIPS2",
        "split": "test",
        "num_sample": "500",
        "task_instruction": [
            "Using the provided dataset containing images, text, and texture, your objective is to precisely apply the texture to the designated object in the image as outlined in the question. You must generate the corresponding image instead of text.",
            "Given a set of relevant data comprising images, text, and texture, your task is to accurately transfer the texture to the specified object in the image as described in the question. You must generate the corresponding image instead of text.",
            "With the supplied collection of images, text, and texture, your goal is to meticulously apply the texture to the indicated object in the image as per the question's instructions. You must generate the corresponding image instead of text.",
            "Based on the provided data, which includes images, text, and texture, your responsibility is to precisely map the texture onto the specified object in the image as detailed in the question. You must generate the corresponding image instead of text.",
            "Using the available dataset of images, text, and texture, your task is to carefully transfer the texture to the designated object in the image exactly as instructed in the question. You must generate the corresponding image instead of text.",
            "Given the relevant data, including images, text, and texture, your objective is to accurately apply the texture to the specified object in the image as described in the question. You must generate the corresponding image instead of text.",
            "With the provided images, text, and texture, your task is to precisely overlay the texture onto the designated object in the image as outlined in the question. You must generate the corresponding image instead of text.",
            "Using the collection of images, text, and texture provided, your goal is to meticulously transfer the texture to the specified object in the image as per the question's instructions. You must generate the corresponding image instead of text.",
            "Given the dataset containing images, text, and texture, your responsibility is to accurately map the texture onto the designated object in the image as detailed in the question. You must generate the corresponding image instead of text.",
            "With the supplied data, which includes images, text, and texture, your task is to carefully apply the texture to the specified object in the image exactly as instructed in the question. You must generate the corresponding image instead of text."
        ],
        "question_type": "image-generation"
    }

A example of the instance annoation is shown below.

"annotations": [
        {
            "sample_id": "0",
            "meta_instruction_id": "4",
            "instance": {
                "context": "Question: What does it look like to replace the cat's texture in the image {image#1} with the texture {image#2} given",
                "images_path": [
                    "images/0.jpg",
                    "images/1.jpg"
                ]
            },
            "response": ""
        },
        {
            "sample_id": "1",
            "meta_instruction_id": "2",
            "instance": {
                "context": "Question: What does it look like to replace the horse's texture in the image {image#1} with the texture {image#2} given",
                "images_path": [
                    "images/2.jpg",
                    "images/3.jpg"
                ]
            },
            "response": ""
        },
        ...
    ]

The dataset can be downloaded from google drive.

Inova Dataset Statistics
Tasks Scenarios Images Instructions Avg. Images / Instruction
Track A: Interleaved Image-Text Comprehension Inova-Test 22 17 52.4K 15.0K 3.5
Inova-Train 22 17 485.1K 127.7K 3.8
Track B: Interleaved Image-Text Generation Inova-Test 13 7 25.7K 8.9K 2.9
Inova-Train 13 7 231.6K 68.1K 3.4

Evaluation

(I) For track A interleaved image-text comprehension, the evaluation will use ROUGE-L (F1) to assess the semantic and structural alignment of the generated text with reference texts of open-ended comprehension tasks and adopt Accuracy to measure the correctness of selected options of multi-choice comprehension tasks.

(II) For track B interleaved image-text generation, ROUGE-L & Similarity will serve as the evaluation metric. We use semantic similarity to evaluate the instruction following of the generated image and visual similarity to evaluate the content consistency of the generated image

The overall score for each team will be defined as the mean of these scores across all tasks for each track, reflecting a comprehensive measure of performance akin to the scoring of human examinations.

Submission

To participate in the Inova challenge, please first register by submitting the form.

Participants can send the submission files to inova_2025@outlook.com with the subject "Inova Challenge Submission-Track A/B". We will evaluate the submissions and announce the results on the website later.

The submission file should keep the same structure as the Inova-Test dataset, with the response field filled with the predicted answer. Image folders are not required for submission.

[Important] Challenge overview papers: You may submit a 6-page paper (same template as the main paper track) summarizing the data, results and the main take away from the challenge. This will go to the ICMEW (workshop) proceedings together with the challenge papers. Participants can send the challenge overview paper to inova_2025@outlook.com with the subject "Inova Challenge Paper Submission-Track A/B"

Important Dates

Registration Open: 2025-2-15

Training Data Release: 2025-3-10

Challenge Result Submission Deadline: 2025-4-20

Challenge Technical Paper Submission Deadline: 2025-4-30

Final Decisions: 2025-5-15

Camera Ready Submission Deadline: 2025-5-25

Organizers

Dong Chen, Zhengzhou University, China

Fei Gao, Zhengzhou University, China

Zhengqing Hu, Zhengzhou University, China

Xiaojun Chang, University of Science and Technology of China, China

Contact

For any questions, please contact us at inova_2025@outlook.com.