Wenzheng Zeng1, Difei Gao1, Mike Zheng Shou1, Hwee Tou Ng1,
1National University of Singapore
This repository contains the official implementation of the ICCV 2025 paper "Factorized Learning for Temporally Grounded Video-Language Models".-
Model: We propose a new framework
$D^2\mathrm{VLM}$ , where we decompose the generation objective into a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens to emphasize explicit event-level visual semantic capture. - Training Algorithm: We introduce Factorized Preference Optimization (FPO) that explicitly addresses both temporal grounding and textual response. A factorized data synthesis approach is also designed to support FPO.
- Performance: Our method consistently outperforms SOTA methods across various tasks.
- Open Source: We release the source code and model weights to the community.
- [2025-10] Code and model weights are released!
- [2025-06] Our work is accepted to ICCV 2025!
Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.
- CUDA 11.8
- Python 3.12.2
- PyTorch 2.4.0
- Transformers 4.44.2
- DeepSpeed 0.14.5
- NNCore 0.4.5
- Clone the repository from GitHub.
git clone https://github.com/nusnlp/d2vlm.git
cd d2vlm- Initialize conda environment.
conda create -n d2vlm python=3.12 -y
conda activate d2vlm- Install dependencies.
pip install -r requirements.txtPlease refer to the Dataset page.
- You can download the pre-trained model at here.
- Run the following commands. Remember to change the absolute path within each .sh file.
bash scripts/inference.sh
# or refer to the inference and Evaluation part of scripts/train_inference_eval.sh bash all_benchmark_eval/charades/inference.shall_benchmark_eval/charades/inference.sh
bash all_benchmark_eval/youcook2/inference.sh- Download the pretrained model from here (stage-2 model of E.T. Chat).
- Check and run the following command (modify relevant path).
bash scripts/train_inference_eval.shIf you find our work useful in your research, please consider to cite our paper:
@inproceedings{d2vlm,
title={Factorized Learning for Temporally Grounded Video-Language Models},
author={Zeng, Wenzheng and Gao, Difei and Shou, Mike Zheng and Ng, Hwee Tou},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025},
pages={20683-20693}
}
This project was built upon E.T. Bench, TimeChat, and AMP. We thank their solid contribution to the community!
