- ✅ Release source training & inference code
- ✅ Provide training data and processing scripts
- ✅ Provide example inference for I2V and V2V
- ✅ V2V customized inference pipeline
- ⏳ I2V customized inference pipeline (Coming Soon!)
git clone --recursive https://github.com/wz0919/EPiC.git
cd EPiC
conda create -n epic python=3.10
conda activate epic
pip install -r requirements.txt
Download CogVideoX-5B-I2V (Base Model), RAFT (To Extract dense optical flow for masking source videos), Depth-Crafter (For video depth estimation), Qwen2.5-VL-7B-Instruct (For getting detailed captions over videos) with the script
bash download/download_models.sh
We provide processed sample test sets in data/test_i2v and data/test_i2v. You can have a try with our pretrained model (in out/EPiC_pretrained) by
bash scripts/inference.sh test_v2v
and
bash scripts/inference.sh test_i2v
Download the ~5K training videos from EPiC_Data by
cd data/train
wget https://huggingface.co/datasets/ZunWang/EPiC_Data/resolve/main/train.zip
unzip train.zip
(Optional) Download the extracted vae latents by (You can also extract the latents yourself, which may take several hours)
wget https://huggingface.co/datasets/ZunWang/EPiC_Data/resolve/main/train_joint_latents.zip
unzip train_joint_latents.zip
Extract caption embeddings (please specify the GPU list in preprocess.sh)
cd preprocess
bash preprocess.sh caption
(Optional) Extract vae latents
bash preprocess.sh latent
After preprocessing, your data folder should look like:
data/
├── test_i2v/
├── test_v2v/
└── train/
├── caption_embs/
├── captions/
├── joint_latents/
├── masked_videos/
├── masks/
└── videos/
You can prepare you own videos + captions.
To do so, you need to first prepare the them like train/videos, train/captions. Then
bash preprocess.sh masking
To get the corresponding masked anchor videos from estimated dense optical flow, and
bash preprocess.sh caption
bash preprocess.sh latent
To get the extracted textual embeddings and visual latents.
Editing GPU configs in scripts/train_with_latent.sh and training/accelerate_config_machine.yaml→ num_processes, then
bash scripts/train_with_latent.sh
You can stop training after 500 iteration, which will take less than 2 hours on 8xH100 GPUs.
(Alternatively: bash scripts/train.sh for online latent encoding, but much slower)
example inference data processing script
cd inference/v2v_data
bash get_anchor_videos.sh v2v_try
The processed data will be saved to data/v2v_try.
You can modify camera pos type, operation mode, and other parameters to get anchor videos following your own trajectory, please refer to configuration document for setup.
Then inference with
bash scripts/inference.sh v2v_try
Coming soon!
- This code mainly builds upon CogVideoX-ControlNet and AC3D
- This code uses the original CogVideoX model CogVideoX
- The v2v data processing pipeline largely builds upon TrajectoryCrafter
A non-exhaustive list of related works includes: CogVideoX, ViewCrafter, GCD, NVS-Solver, DimensionX, ReCapture, TrajAttention, GS-DiT, DaS, RecamMaster, TrajectoryCrafter, GEN3C, CAT4D, Uni3C, AC3D, RealCam-I2V, CamCtrl3D...
