Zhen Yang1
·
Mingyang Zhang5
·
Feng Chen3
·
Ganggui Ding4
Liang Hou2
·
Xin Tao2
·
Pengfei Wan2
·
Ying-Cong Chen1,6
1HKUST(GZ) · 2Kuaishou Technology · 3AIML · 4ZJU · 5Ant Group · 6HKUST
- Create the environment and install the dependencies by running:
conda create -n MTI python=3.10
conda activate MTI
pip install vllm==0.10.2
pip install accelerate==1.10.1
pip install transformers==4.56.1
- Run offline with vllm
python run_vllm_offline.py
- Run online with vllm
python run_vllm_online.py \
--model Qwen3/Qwen3-8B \
--tokenizer Qwen3/Qwen3-8B \
--host 0.0.0.0 \
--port 6666 \
--api-key yzisallyouneed
- Run with huggingface
python run_hf.py
The huggingface version is only for learning purposes because it does not provide inference acceleration. It is recommended to use the vllm version for evaluation.
Since the monkey patch may introduce unknown bugs, we recommend that, during the evaluation phase, you directly replace vllm’s GPUModelRunner.execute_model and FlashAttentionImpl.forward in the Conda-created virtual environment with our execute_model and forward implementations.
- SGLang version
- More models
- Code combine opencompass
- MTI on VLM / VLA / dllm
|
|
|
Many thanks for the generous help from Lequan Lin.
@article{yang2025less,
title={Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention},
author={Yang, Zhen and Zhang, Mingyang and Chen, Feng and Ding, Ganggui and Hou, Liang and Tao, Xin and Wan, Pengfei and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2510.13940},
year={2025}
}

