The increasing demands in analyzing complex associated scenes pose necessities to researching multi-image understanding abilities.
Compared with understanding individual images, both the alignments and differences between images are essential aspects of understanding the intricate relationships for multi-image inference tasks.
However, existing benchmarks face difficulties in addressing both of these aspects simultaneously, resulting in obstacles to modeling relationships under various granularities and domains of images.
In this paper, we introduce a benchmark called 
-
[2025.8.11] 🔥
$M^4$ Bench is supported by open-compass/VLMEvalKit in PR #1163 -
[2025.4.29] 🔥
$M^4$ Bench has been accepted to IJCAI2025! -
[2025.4.9] 🔥We release the evaluation code and outputs of
$M^4$ Bench. - [2025.2.10] 🔥We release the M4Bench on HuggingFace.
We have provided both versions of the dataset on Hugging Face:
- the current path-based naming scheme;
- a neutral image filenames (e.g., “img1_1”, “img1_2”) in one folder.
We utilize our designed automated dataset construction pipeline to expand the scale of
Tip
Since different MLLMs may require different versions of transformers and other dependencies, we recommend creating a separate virtual environment for each model series (e.g., Qwen Series) to avoid dependency conflicts.
conda create -n m4bench python=3.10 -y
# git clone this repo
cd M4Bench
pip install -r requirements.txtOur evaluation code supports API models (like the GPT series or any other models accessible via the OpenAI API) as a judge. The only thing you need to do is to set the environment variable OPENAI_API_KEY like blew.
Important
If you don't set the variable, the evaluation will default to using exact match as the assessment method
export OPENAI_API_KEY=your_api_keyYou can download the models and datasets from the huggingface, and put them in the llm and dataset folder respectively.
mkdir llm # for example, llm/Qwen/Qwen2-VL-2B-Instruct
mkdir datasetWe have implemented 10 open-source models in our benchmark, as shown in the table below.
| Model Name | Supports Multiple Images | Supports Grounding | Model Name | Supports Multiple Images | Supports Grounding |
|---|---|---|---|---|---|
| Qwen2-VL-2B-Instruct | ✅ | ✅ | Qwen2-VL-7B-Instruct | ✅ | ✅ |
| InternVL2-4B | ✅ | ✅ | InternVL2-8B | ✅ | ✅ |
| InternVL2.5-4B | ✅ | ✅ | InternVL2.5-8B | ✅ | ✅ |
| deepseek-vl2-tiny | ✅ | ✅ | deepseek-vl2-small | ✅ | ✅ |
| MiniCPM-V-2_6 | ✅ | ❌ | llava-onevision | ✅ | ❌ |
Important
If you have downloaded the models, you need to modify the model_mapping in main.py to point to the correct paths.
Now, you can start the evaluation process by running the following command:
- model_name:
model_mappingvariable inmain.py - task_list:
task_mappingvariable inmain.py, separated by commas
# Run a model with multiple tasks in parallel
python main.py \
--model_name Qwen2-VL-7B-Instruct \
--task_list Object_States,State_InvarianceIf you want to add new model to be evaluated on our benchmark, you can follow the following steps:
- please refer to
utils.models.deepseekvl2.pyand implement your model with key functions__init__,parse_inputandinferin a new python file in theutils.modelsdirectory. - please modify the
utils.models.automodel.pyand add your model in thefrom_pretrainedfunction according to the model type in its own config. - please modify the
main.pyto add your model in themodel_mappingvariable. - Now you can enjoy the evaluation simply by running the following command:
python main.py \
--model_name Your_Model_Name \
--task_list TaskNamegit clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
# Example
python3 run.py --data Instance_Comparison --model QwenVLPlus --verbose
python3 run.py --data M4Bench --model QwenVLPlus --verbosePlease refer to our outputs folder for more details.
| Model | Test |
|---|---|
| 🏅 DeepSeek-VL2-small | 51.3 |
| 🥈 Qwen2VL-7B | 46.7 |
| 🥉 Qwen-VL-Max | 46.3 |
| GPT-4o | 39.5 |
| Qwen2VL-2B | 38.4 |
| InternVL2-8B | 37.8 |
| InternVL2.5-8B | 37.2 |
| DeepSeek-VL2-tiny | 36.9 |
| Gemini 1.5 Pro | 36.8 |
| InternVL2.5-4B | 34.2 |
| InternVL2-4B | 28.7 |
| LLaVA-OneVision | 18.6 |
| MiniCPM-V2.6-8B | 17.0 |




