Skip to content

Visual-Agent/DeepEyesV2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

DeepEyesV2: Toward Agentic Multimodal Model


Paper sft dataset rl dataset checkpoint checkpoint homepage

* Logo inspired by oracle bone character "eye".

🔥 Updates

  • [2025/11/14] Since the training time of cold start is relatively long, we have uploaded the cold-start checkpoint. You can download it from here.

DeepEyesV2

Quote from https://openai.com/index/thinking-with-images/

They don’t just see an image, they can integrate visual information directly into the reasoning chain.

What's new?

  • We introduce DeepEyesV2, an agentic multimodal model that unifies code execution and web search within a single reasoning loop, enabling reliable and complex reasoning.

  • We construct a carefully curated training corpus through rigorous data filtering and cleaning to build both cold-start SFT data and RL data that complement each other.

  • Extensive experiments across real-world understanding, mathematical reasoning, and search-intensive benchmarks demonstrate the strong reasoning and tool-usage ability of DeepEyesV2.

  • We analyze the dynamics of tool-use behavior in DeepEyesV2, revealing task-adaptive patterns. Besides, we also find reinforcement learning can enable more complex tool combinations and adaptive, context-aware tool invocation.

Quick Start

DATA

Please refer to data for data preparation.

Cold Start

We use LLaMA-Factory to conduct cold start.

Environment Setup

Please refer to LLaMA-Factory for more details on installing.

Cold Start Training

We use Qwen-2.5-VL-7B-Instruct as our foundation model. Qwen-2.5-VL-32B-Instruct is also supported.

bash ./cold_start/run_cold_start.sh

Reinforcement Training

We use the same codebase as DeepEyes.

Environment Setup

cd reinforcement_learning
# Follow the VeRL official installation procedure
pip install -e .

# Additional dependencies required by DeepEyes
bash scripts/install_deepeyes.sh

Code Sandbox

DeepEyesV2 can write code, execute it in a sandbox, and perform subsequent reasoning based on the execution results. Similar to o3, DeepEyesV2' code follows the Jupyter style. We receive DeepEyesV2's code via a server and return the execution results.

You can follow this github repo to deploy the code server. To guarantee code execution safety, we recommend using docker virtualization to run the code sandbox. The docker image for the code sandbox can be found on this dockerhub repo.

During RL training, transmitting a large number of high-resolution images to a single node in a short period may saturate the bandwidth, leading to retransmissions and timeouts. Therefore, we recommend deploying multiple code servers to distribute network pressure and using high-performance machines to run the code servers, ensuring fast response.

In our practice, we deployed a large number of code server services on each GPU node separately, and each training process should only request the code server on its localhost machine. Through this approach, no timeouts caused by network, CPU, or other hardware issues occurred during our training process.

Online Search API

DeepEyesV2 can acquire external knowledge through searching. In our implementation, we use online search instead of RAG. For image search, we use the cache of MMSearch-R1, and you can download the corresponding cache data from here. For text search, we use an online search service; you can use your own search API and modify the content in reinforcement_learning/verl/workers/agent/envs/deepeyesv2/search_utils.py.

Judge Server

We utilize Qwen for llm-as-a-judge verification. Please follow DeepEyes for instruction to deploy a server.

# download Qwen-2.5-72B-Instruct model
# Ohter models, such as Qwen-3-8B are also ok.
huggingface-cli download --resume-download https://huggingface.co/Qwen/Qwen2.5-72B-Instruct --local-dir /path/to/your/local/filedir --local-dir-use-symlinks False

# start vllm serving
vllm serve /path/to/your/local/filedir \
    --port 18901 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --tensor-parallel-size 8 \ 
    --served-model-name "judge" \
    --trust-remote-code \
    --disable-log-requests

Reinforcement Learning Optimization

We recommend using no less than 32 GPUs (4 nodes x 8 GPUs) for 7B training, and no less than 64 GPUs (8 nodes x 8 GPUs) for 32B training. For each node, we recommend using no less than 1200GB CPU RAM, as the high resolution images can consume large amount of memory.

Build a ray cluster for all of the training nodes. Prepare data before starting training. Then, use the following scripts to start training.

cd reinforcement_learning
# your wandb access key here...
wandb login

# the IP and port for your Qwen-2.5-72B-Instruct vllm serving
export LLM_AS_A_JUDGE_BASE="http://your.vllm.machine.ip:18901/v1"

# config for 7B
bash examples/deepeyesv2/run_qwen2_5_vl-7b_final_allin.sh

The training scripts use both wandb and RL Logging Board (great work) to visualize the training dynamics.

Evaluation

Please see evaluation for more details.

Star Chart

Star History Chart

Licence

This project is released under Apache licence.

Citation

# DeepEyesV2
@article{hong2025deepeyesv2,
  title={DeepEyesV2: Toward Agentic Multimodal Model},
  author={Hong, Jack and Zhao, Chenxiao and Zhu, ChengLin and Lu, Weiheng and Xu, Guohai and Yu, Xing},
  journal={arXiv preprint arXiv:2511.05271},
  year={2025}
}
# DeepEyes
@article{zheng2025deepeyes,
  title={DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning},
  author={Zheng, Ziwei and Yang, Michael and Hong, Jack and Zhao, Chenxiao and Xu, Guohai and Yang, Le and Shen, Chao and Yu, Xing},
  journal={arXiv preprint arXiv:2505.14362},
  year={2025}
}