ICME25 Grand Challenge Inova: Interleaved Image-Text Comprehension and Generation Challenge

Rank

Team

Score

🥇 1

Syncode

75.5

🥈 2

Quizzard

74.9

🥉 3

NJUST-KMG

72.4

🎖 4

Multi

72.3

🎖 5

Visionland

70.4

Rank

Team

Score

🥇 1

MMGen

60.0

🥈 2

multimodal

58.9

🥉 3

TNT

57.2

🎖 4

PerceptLab

55.7

🎖 5

xSynth

51.4

Overview

The Interleaved Image-Text Comprehension and Generation Challenge aim to facilitate development on interleaved vision language instruction following capability of AI systems by addressing crucial tasks in two sub-tracks: i) interleaved image-text comprehension and ii) interleaved image-text generation. These tasks involve multiple images as visual contexts and encourage comprehensive ability for unified multi-modal comprehension and generation with generative multi-modal AI.

Track A, Interleaved Image-Text Comprehension, focuses on understanding analysis and inference between multiple images. Participants should understand the relationship between interleaved image-text sequences and give the corresponding comprehension for predefined open-ended questions or multi-choice options.
Track B, Interleaved Image-Text Generation, emphasizes generating subsequent images followed by sequential interleaved image-text contexts. Participants should generate desired images based on the given references or contexts, following instruction semantics while maintaining content consistency.

By addressing these tasks, the challenge seeks to promote the mutual enhancement of understanding and generative capabilities of multi-modal AI systems when facing intertwined visual and textual information

Task Details

Track A: Interleaved Image-Text Comprehension:

Multi-Image Reasoning. This category focuses on the comparative analysis between images and texts.
Document and Knowledge-Based Understanding. This category requires extraction and comprehension of information from structured formats.
Interactive Multi-Modal Communication. This category focuses on the dynamic interaction between visual and textual modalities in a conversational context.
Multi-Image Discrimination. This task involves analyzing two given photos to determine their similarities and discrimination across multiple dimensions.

Track B: Interleaved Image-Text Generation:

Sequential Visual Generation. This task category encompasses the subsequent image generation of coherent narratives based on sequential visual inputs.
Material-based Image Coloring. The task involves coloring a specific object in the target image based on the given material image to get the corresponding material.
Visual Reference Customization. This task requires the model to extract and apply pixel-level control from images, such as cartoon characters or clothing sketches

Inova Challenge Task Details
	Task	Scenario	Dataset	Metric
Track A: Interleaved Image-Text Comprehension	Multi-Image Reasoning
Visual Change Captioning	Surveillance	Spot-the-Diff	ROUGE-L
Visual Change Captioning	Synthetic	CLEVR-Change	ROUGE-L
Visual Relationship Expressing	General	IEdit	ROUGE-L
Subtle Difference Expressing	Fine-Grained	Birds-to-Words	ROUGE-L
Image-Set QA	Driving Recording	nuScenes	Accuracy
Industrial Inspection	Industrial	VISION	Accuracy
Fashion QA	Fashion	Fashion200K	Accuracy
Property Coherence	General	MIT-States-PropertyCoherence	Accuracy
State Transformation Coherence	General	MIT-States-StateCoherence	Accuracy
Visual Step Matching	Recipe	RecipeQA-ImageCoherence	Accuracy
Multi-Image Visual Entailment	General	NLVR2	Accuracy
Ambiguity Analysis	Mobile Photo	VizWiz	Accuracy
Document and Knowledge-Based Understanding
Slide QA	Slide	SlideVQA	Accuracy
OCR QA	Book Cover	OCR-VQA	Accuracy
Document QA	Document Image	DocVQA	Accuracy
Webpage QA	Webpage	WebQA	Accuracy
Textbook QA	Textbook	TQA	Accuracy
Complex Multimodal QA	Wikipedia	MMQA	Accuracy
Interactive Multi-Modal Communication
Conversational Embodied Dialogue	Embodied	ALFRED	ROUGE-L
Multi-Modal Dialogue	Conversation	MMCoQA	ROUGE-L
Multi-Image Discrimination
Facial Comparison	General	LFW	Accuracy
Similarity Dimension Selection	General	Totally-Looks-Like	Accuracy
Track B: Interleaved Image-Text Generation	Sequential Visual Generation
Animated Story Completion	Cartoon	AESOP	ROUGE-L & Similarity
Animated Story Completion	Cartoon	PororoSV	ROUGE-L & Similarity
Animated Story Completion	Cartoon	FlintstonesSV	ROUGE-L & Similarity
Sequential Photo Storytelling	Album	VIST	ROUGE-L & Similarity
Sequential Photo Storytelling	Cartoon	DiDeMoSV	ROUGE-L & Similarity
Comic Dialogue Identification	Cartoon	COMICS-Dialogue	ROUGE-L & Similarity
Comic Panel Identification	Cartoon	COMICS-Panel	ROUGE-L & Similarity
Recipe Completion	Recipe	RecipeQA-TextCloze	ROUGE-L & Similarity
Visual Step Cloze	Recipe	RecipeQA-VisualCloze	ROUGE-L & Similarity
Material-based Image Coloring
Material Transfer	Industrial	FMD	Similarity
Texture Transfer	Physical	KTH-TIPS2	Similarity
Visual Reference Customization
Virtual Try-on	Fashion	VTON-HD	Similarity
Visual Reference	General	MSCOCO	Similarity

Inova Challenge Task Details

Task

Scenario

Dataset

Metric

Track A: Interleaved Image-Text Comprehension

Multi-Image Reasoning

Visual Change Captioning

Surveillance

Spot-the-Diff

ROUGE-L

Visual Change Captioning

Synthetic

CLEVR-Change

ROUGE-L

Visual Relationship Expressing

General

IEdit

ROUGE-L

Subtle Difference Expressing

Fine-Grained

Birds-to-Words

ROUGE-L

Image-Set QA

Driving Recording

nuScenes

Accuracy

Industrial Inspection

Industrial

VISION

Accuracy

Fashion QA

Fashion

Fashion200K

Accuracy

Property Coherence

General

MIT-States-PropertyCoherence

Accuracy

State Transformation Coherence

General

MIT-States-StateCoherence

Accuracy

Visual Step Matching

Recipe

RecipeQA-ImageCoherence

Accuracy

Multi-Image Visual Entailment

General

NLVR2

Accuracy

Ambiguity Analysis

Mobile Photo

VizWiz

Accuracy

Document and Knowledge-Based Understanding

Slide QA

Slide

SlideVQA

Accuracy

OCR QA

Book Cover

OCR-VQA

Accuracy

Document QA

Document Image

DocVQA

Accuracy

Webpage QA

Webpage

WebQA

Accuracy

Textbook QA

Textbook

TQA

Accuracy

Complex Multimodal QA

Wikipedia

MMQA

Accuracy

Interactive Multi-Modal Communication

Conversational Embodied Dialogue

Embodied

ALFRED

ROUGE-L

Multi-Modal Dialogue

Conversation

MMCoQA

ROUGE-L

Multi-Image Discrimination

Facial Comparison

General

LFW

Accuracy

Similarity Dimension Selection

General

Totally-Looks-Like

Accuracy

Track B: Interleaved Image-Text Generation

Sequential Visual Generation

Animated Story Completion

Cartoon

AESOP

ROUGE-L & Similarity

Animated Story Completion

Cartoon

PororoSV

ROUGE-L & Similarity

Animated Story Completion

Cartoon

FlintstonesSV

ROUGE-L & Similarity

Sequential Photo Storytelling

Album

VIST

ROUGE-L & Similarity

Sequential Photo Storytelling

Cartoon

DiDeMoSV

ROUGE-L & Similarity

Comic Dialogue Identification

Cartoon

COMICS-Dialogue

ROUGE-L & Similarity

Comic Panel Identification

Cartoon

COMICS-Panel

ROUGE-L & Similarity

Recipe Completion

Recipe

RecipeQA-TextCloze

ROUGE-L & Similarity

Visual Step Cloze

Recipe

RecipeQA-VisualCloze

ROUGE-L & Similarity

Material-based Image Coloring

Material Transfer

Industrial

FMD

Similarity

Texture Transfer

Physical

KTH-TIPS2

Similarity

Visual Reference Customization

Virtual Try-on

Fashion

VTON-HD

Similarity

Visual Reference

General

MSCOCO

Similarity

"metadata": { "dataset": "Totally-Looks-Like", "split": "test", "num_sample": "50", "task_instruction": [ "You are provided with a dataset comprising two images and text; accurately distinguish two images in multiple dimensions and accurately answer the following question. You must only answer with 'yes' or 'no'.", "Analyze the given data containing two images and text, please distinguish two images in multiple dimensions and answer the subsequent question accurately. You must only answer with 'yes' or 'no'.", "Based on the data provided, distinguish two images in multiple dimensions, and give a precise answer to the question. You must only answer with 'yes' or 'no'.", "Given a dataset consisting of two images and text, distinguish two images in multiple dimensions and respond to questions correctly. You must only answer with 'yes' or 'no'.", "Your objective is to distinguish two images in multiple dimensions, please answer the question accurately. You must only answer with 'yes' or 'no'.", "Working with a dataset that has two images and text, provides an accurate answer to the question. You must only answer with 'yes' or 'no'.", "Reviewing the provided data, distinguish two images in multiple dimensions and answer the question precisely. You must only answer with 'yes' or 'no'.", "Based on the dataset featuring two images and accompanying text, distinguish two images in multiple dimensions and determine the correct answer. You must only answer with 'yes' or 'no'.", "Assess the two images and text in the dataset, then distinguish two images in multiple dimensions and answer the subsequent question. You must only answer with 'yes' or 'no'.", "Interpret the given dataset to distinguish two images in multiple dimensions and formulate an accurate response. You must only answer with 'yes' or 'no'." ], "question_type": "yes-no" }

"annotations": [ { "sample_id": "0", "meta_instruction_id": "1", "instance": { "context": "Image_1: {image#1} Image_2: {image#2} Question: Compare the two images in terms of facial features, global shape, near-duplicates, facial resemblance, textural patterns, and color composition. Return 'yes' if they exhibit similarities in any of these aspects, otherwise return 'no'.", "images_path": [ "images/01905_0.jpg", "images/01905_1.jpg" ] }, "response": "" }, ... ]

"metadata": { "dataset": "KTH-TIPS2", "split": "test", "num_sample": "500", "task_instruction": [ "Using the provided dataset containing images, text, and texture, your objective is to precisely apply the texture to the designated object in the image as outlined in the question. You must generate the corresponding image instead of text.", "Given a set of relevant data comprising images, text, and texture, your task is to accurately transfer the texture to the specified object in the image as described in the question. You must generate the corresponding image instead of text.", "With the supplied collection of images, text, and texture, your goal is to meticulously apply the texture to the indicated object in the image as per the question's instructions. You must generate the corresponding image instead of text.", "Based on the provided data, which includes images, text, and texture, your responsibility is to precisely map the texture onto the specified object in the image as detailed in the question. You must generate the corresponding image instead of text.", "Using the available dataset of images, text, and texture, your task is to carefully transfer the texture to the designated object in the image exactly as instructed in the question. You must generate the corresponding image instead of text.", "Given the relevant data, including images, text, and texture, your objective is to accurately apply the texture to the specified object in the image as described in the question. You must generate the corresponding image instead of text.", "With the provided images, text, and texture, your task is to precisely overlay the texture onto the designated object in the image as outlined in the question. You must generate the corresponding image instead of text.", "Using the collection of images, text, and texture provided, your goal is to meticulously transfer the texture to the specified object in the image as per the question's instructions. You must generate the corresponding image instead of text.", "Given the dataset containing images, text, and texture, your responsibility is to accurately map the texture onto the designated object in the image as detailed in the question. You must generate the corresponding image instead of text.", "With the supplied data, which includes images, text, and texture, your task is to carefully apply the texture to the specified object in the image exactly as instructed in the question. You must generate the corresponding image instead of text." ], "question_type": "image-generation" }

"annotations": [ { "sample_id": "0", "meta_instruction_id": "4", "instance": { "context": "Question: What does it look like to replace the cat's texture in the image {image#1} with the texture {image#2} given", "images_path": [ "images/0.jpg", "images/1.jpg" ] }, "response": "" }, { "sample_id": "1", "meta_instruction_id": "2", "instance": { "context": "Question: What does it look like to replace the horse's texture in the image {image#1} with the texture {image#2} given", "images_path": [ "images/2.jpg", "images/3.jpg" ] }, "response": "" }, ... ]

Inova Dataset Statistics
		Tasks	Scenarios	Images	Instructions	Avg. Images / Instruction
Track A: Interleaved Image-Text Comprehension	Inova-Test	22	17	52.4K	15.0K	3.5
Inova-Train	22	17	485.1K	127.7K	3.8
Track B: Interleaved Image-Text Generation	Inova-Test	13	7	25.7K	8.9K	2.9
Inova-Train	13	7	231.6K	68.1K	3.4

Inova Dataset Statistics

Tasks

Scenarios

Images

Instructions

Avg. Images / Instruction

Track A: Interleaved Image-Text Comprehension

Inova-Test

52.4K

15.0K

3.5

Inova-Train

485.1K

127.7K

3.8

Track B: Interleaved Image-Text Generation

Inova-Test

25.7K

8.9K

2.9

Inova-Train

231.6K

68.1K

3.4

Evaluation

(I) For track A interleaved image-text comprehension, the evaluation will use ROUGE-L (F1) to assess the semantic and structural alignment of the generated text with reference texts of open-ended comprehension tasks and adopt Accuracy to measure the correctness of selected options of multi-choice comprehension tasks.

(II) For track B interleaved image-text generation, ROUGE-L & Similarity will serve as the evaluation metric. We use semantic similarity to evaluate the instruction following of the generated image and visual similarity to evaluate the content consistency of the generated image

The overall score for each team will be defined as the mean of these scores across all tasks for each track, reflecting a comprehensive measure of performance akin to the scoring of human examinations.

Submission

To participate in the Inova challenge, please first register by submitting the form.

Participants can send the submission files to inova_2025@outlook.com with the subject "Inova Challenge Submission-Track A/B". We will evaluate the submissions and announce the results on the website later.

The submission file should keep the same structure as the Inova-Test dataset, with the response field filled with the predicted answer. Image folders are not required for submission.

[Important] Challenge overview papers: You may submit a 6-page paper (same template as the main paper track) summarizing the data, results and the main take away from the challenge. This will go to the ICMEW (workshop) proceedings together with the challenge papers. Participants can send the challenge overview paper to inova_2025@outlook.com with the subject "Inova Challenge Paper Submission-Track A/B"

Important Dates

Registration Open: 2025-2-15

Training Data Release: 2025-3-10

Challenge Result Submission Deadline: ~~2025-4-20~~ 2025-4-25

Challenge Technical Paper Submission Deadline: ~~2024-4-30~~ 2025-5-10

Final Decisions: 2025-5-15

Camera Ready Submission Deadline: 2025-5-25

Organizers

Dong Chen, Zhengzhou University, China

Fei Gao, Zhengzhou University, China

Zhengqing Hu, Zhengzhou University, China

Xiaojun Chang, University of Science and Technology of China, China

ICME25 Grand Challenge Inova: Interleaved Image-Text Comprehension and Generation Challenge

Program

Leaderboard-Track A

Leaderboard-Track B

Overview

Task Details

Dataset-Track A

Dataset-Track B

Evaluation

Submission

Important Dates

Organizers

Contact