Coming soon
Rank | Team | Score |
---|
Rank | Team | Score |
---|
The Interleaved Image-Text Comprehension and Generation Challenge aim to facilitate development on interleaved vision language instruction following capability of AI systems by addressing crucial tasks in two sub-tracks: i) interleaved image-text comprehension and ii) interleaved image-text generation. These tasks involve multiple images as visual contexts and encourage comprehensive ability for unified multi-modal comprehension and generation with generative multi-modal AI.
By addressing these tasks, the challenge seeks to promote the mutual enhancement of understanding and generative capabilities of multi-modal AI systems when facing intertwined visual and textual information
Track A: Interleaved Image-Text Comprehension:
Track B: Interleaved Image-Text Generation:
Task | Scenario | Dataset | Metric | |
---|---|---|---|---|
Track A: Interleaved Image-Text Comprehension | Multi-Image Reasoning | |||
Visual Change Captioning | Surveillance | Spot-the-Diff | ROUGE-L | |
Visual Change Captioning | Synthetic | CLEVR-Change | ROUGE-L | |
Visual Relationship Expressing | General | IEdit | ROUGE-L | |
Subtle Difference Expressing | Fine-Grained | Birds-to-Words | ROUGE-L | |
Image-Set QA | Driving Recording | nuScenes | Accuracy | |
Industrial Inspection | Industrial | VISION | Accuracy | |
Fashion QA | Fashion | Fashion200K | Accuracy | |
Property Coherence | General | MIT-States-PropertyCoherence | Accuracy | |
State Transformation Coherence | General | MIT-States-StateCoherence | Accuracy | |
Visual Step Matching | Recipe | RecipeQA-ImageCoherence | Accuracy | |
Multi-Image Visual Entailment | General | NLVR2 | Accuracy | |
Ambiguity Analysis | Mobile Photo | VizWiz | Accuracy | |
Document and Knowledge-Based Understanding | ||||
Slide QA | Slide | SlideVQA | Accuracy | |
OCR QA | Book Cover | OCR-VQA | Accuracy | |
Document QA | Document Image | DocVQA | Accuracy | |
Webpage QA | Webpage | WebQA | Accuracy | |
Textbook QA | Textbook | TQA | Accuracy | |
Complex Multimodal QA | Wikipedia | MMQA | Accuracy | |
Interactive Multi-Modal Communication | ||||
Conversational Embodied Dialogue | Embodied | ALFRED | ROUGE-L | |
Multi-Modal Dialogue | Conversation | MMCoQA | ROUGE-L | |
Multi-Image Discrimination | ||||
Facial Comparison | General | LFW | Accuracy | |
Similarity Dimension Selection | General | Totally-Looks-Like | Accuracy | |
Track B: Interleaved Image-Text Generation | Sequential Visual Generation | |||
Animated Story Completion | Cartoon | AESOP | ROUGE-L & Similarity | |
Animated Story Completion | Cartoon | PororoSV | ROUGE-L & Similarity | |
Animated Story Completion | Cartoon | FlintstonesSV | ROUGE-L & Similarity | |
Sequential Photo Storytelling | Album | VIST | ROUGE-L & Similarity | |
Sequential Photo Storytelling | Cartoon | DiDeMoSV | ROUGE-L & Similarity | |
Comic Dialogue Identification | Cartoon | COMICS-Dialogue | ROUGE-L & Similarity | |
Comic Panel Identification | Cartoon | COMICS-Panel | ROUGE-L & Similarity | |
Recipe Completion | Recipe | RecipeQA-TextCloze | ROUGE-L & Similarity | |
Visual Step Cloze | Recipe | RecipeQA-VisualCloze | ROUGE-L & Similarity | |
Material-based Image Coloring | ||||
Material Transfer | Industrial | FMD | Similarity | |
Texture Transfer | Physical | KTH-TIPS2 | Similarity | |
Visual Reference Customization | ||||
Virtual Try-on | Fashion | VTON-HD | Similarity | |
Visual Reference | General | MSCOCO | Similarity |
For each task, we will provide a dataset with a training set and a test set. The annotations are in the form of a JSON file. An example of the task metadata is shown below.
"metadata": {
"dataset": "Totally-Looks-Like",
"split": "test",
"num_sample": "50",
"task_instruction": [
"You are provided with a dataset comprising two images and text; accurately distinguish two images in multiple dimensions and accurately answer the following question. You must only answer with 'yes' or 'no'.",
"Analyze the given data containing two images and text, please distinguish two images in multiple dimensions and answer the subsequent question accurately. You must only answer with 'yes' or 'no'.",
"Based on the data provided, distinguish two images in multiple dimensions, and give a precise answer to the question. You must only answer with 'yes' or 'no'.",
"Given a dataset consisting of two images and text, distinguish two images in multiple dimensions and respond to questions correctly. You must only answer with 'yes' or 'no'.",
"Your objective is to distinguish two images in multiple dimensions, please answer the question accurately. You must only answer with 'yes' or 'no'.",
"Working with a dataset that has two images and text, provides an accurate answer to the question. You must only answer with 'yes' or 'no'.",
"Reviewing the provided data, distinguish two images in multiple dimensions and answer the question precisely. You must only answer with 'yes' or 'no'.",
"Based on the dataset featuring two images and accompanying text, distinguish two images in multiple dimensions and determine the correct answer. You must only answer with 'yes' or 'no'.",
"Assess the two images and text in the dataset, then distinguish two images in multiple dimensions and answer the subsequent question. You must only answer with 'yes' or 'no'.",
"Interpret the given dataset to distinguish two images in multiple dimensions and formulate an accurate response. You must only answer with 'yes' or 'no'."
],
"question_type": "yes-no"
}
A example of the instance annoation is shown below.
"annotations": [
{
"sample_id": "0",
"meta_instruction_id": "1",
"instance": {
"context": "Image_1: {image#1} Image_2: {image#2} Question: Compare the two images in terms of facial features, global shape, near-duplicates, facial resemblance, textural patterns, and color composition. Return 'yes' if they exhibit similarities in any of these aspects, otherwise return 'no'.",
"images_path": [
"images/01905_0.jpg",
"images/01905_1.jpg"
]
},
"response": ""
},
...
]
For each task, we will provide a dataset with a training set and a test set. The annotations are in the form of a JSON file. An example of the task metadata is shown below.
"metadata": {
"dataset": "KTH-TIPS2",
"split": "test",
"num_sample": "500",
"task_instruction": [
"Using the provided dataset containing images, text, and texture, your objective is to precisely apply the texture to the designated object in the image as outlined in the question. You must generate the corresponding image instead of text.",
"Given a set of relevant data comprising images, text, and texture, your task is to accurately transfer the texture to the specified object in the image as described in the question. You must generate the corresponding image instead of text.",
"With the supplied collection of images, text, and texture, your goal is to meticulously apply the texture to the indicated object in the image as per the question's instructions. You must generate the corresponding image instead of text.",
"Based on the provided data, which includes images, text, and texture, your responsibility is to precisely map the texture onto the specified object in the image as detailed in the question. You must generate the corresponding image instead of text.",
"Using the available dataset of images, text, and texture, your task is to carefully transfer the texture to the designated object in the image exactly as instructed in the question. You must generate the corresponding image instead of text.",
"Given the relevant data, including images, text, and texture, your objective is to accurately apply the texture to the specified object in the image as described in the question. You must generate the corresponding image instead of text.",
"With the provided images, text, and texture, your task is to precisely overlay the texture onto the designated object in the image as outlined in the question. You must generate the corresponding image instead of text.",
"Using the collection of images, text, and texture provided, your goal is to meticulously transfer the texture to the specified object in the image as per the question's instructions. You must generate the corresponding image instead of text.",
"Given the dataset containing images, text, and texture, your responsibility is to accurately map the texture onto the designated object in the image as detailed in the question. You must generate the corresponding image instead of text.",
"With the supplied data, which includes images, text, and texture, your task is to carefully apply the texture to the specified object in the image exactly as instructed in the question. You must generate the corresponding image instead of text."
],
"question_type": "image-generation"
}
A example of the instance annoation is shown below.
"annotations": [
{
"sample_id": "0",
"meta_instruction_id": "4",
"instance": {
"context": "Question: What does it look like to replace the cat's texture in the image {image#1} with the texture {image#2} given",
"images_path": [
"images/0.jpg",
"images/1.jpg"
]
},
"response": ""
},
{
"sample_id": "1",
"meta_instruction_id": "2",
"instance": {
"context": "Question: What does it look like to replace the horse's texture in the image {image#1} with the texture {image#2} given",
"images_path": [
"images/2.jpg",
"images/3.jpg"
]
},
"response": ""
},
...
]
The dataset can be downloaded from google drive.
Tasks | Scenarios | Images | Instructions | Avg. Images / Instruction | ||
---|---|---|---|---|---|---|
Track A: Interleaved Image-Text Comprehension | Inova-Test | 22 | 17 | 52.4K | 15.0K | 3.5 |
Inova-Train | 22 | 17 | 485.1K | 127.7K | 3.8 | |
Track B: Interleaved Image-Text Generation | Inova-Test | 13 | 7 | 25.7K | 8.9K | 2.9 |
Inova-Train | 13 | 7 | 231.6K | 68.1K | 3.4 |
(I) For track A interleaved image-text comprehension, the evaluation will use ROUGE-L (F1) to assess the semantic and structural alignment of the generated text with reference texts of open-ended comprehension tasks and adopt Accuracy to measure the correctness of selected options of multi-choice comprehension tasks.
(II) For track B interleaved image-text generation, ROUGE-L & Similarity will serve as the evaluation metric. We use semantic similarity to evaluate the instruction following of the generated image and visual similarity to evaluate the content consistency of the generated image
The overall score for each team will be defined as the mean of these scores across all tasks for each track, reflecting a comprehensive measure of performance akin to the scoring of human examinations.
To participate in the Inova challenge, please first register by submitting the form.
Participants can send the submission files to inova_2025@outlook.com with the subject "Inova Challenge Submission-Track A/B". We will evaluate the submissions and announce the results on the website later.
The submission file should keep the same structure as the Inova-Test dataset, with the response field filled with the predicted answer. Image folders are not required for submission.
[Important] Challenge overview papers: You may submit a 6-page paper (same template as the main paper track) summarizing the data, results and the main take away from the challenge. This will go to the ICMEW (workshop) proceedings together with the challenge papers. Participants can send the challenge overview paper to inova_2025@outlook.com with the subject "Inova Challenge Paper Submission-Track A/B"
Registration Open: 2025-2-15
Training Data Release: 2025-3-10
Challenge Result Submission Deadline: 2025-4-20
Challenge Technical Paper Submission Deadline: 2025-4-30
Final Decisions: 2025-5-15
Camera Ready Submission Deadline: 2025-5-25
Dong Chen, Zhengzhou University, China
Fei Gao, Zhengzhou University, China
Zhengqing Hu, Zhengzhou University, China
Xiaojun Chang, University of Science and Technology of China, China
For any questions, please contact us at inova_2025@outlook.com.