Skip to content
/ EPiC Public

Official implementation of EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Notifications You must be signed in to change notification settings

wz0919/EPiC

Repository files navigation

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Project Website arXiv

✅ To-Do Checklist

  • ✅ Release source training & inference code
  • ✅ Provide training data and processing scripts
  • ✅ Provide example inference for I2V and V2V
  • ✅ V2V customized inference pipeline
  • ⏳ I2V customized inference pipeline (Coming Soon!)

🚀 Setup

1. Clone EPiC

git clone --recursive https://github.com/wz0919/EPiC.git
cd EPiC

2. Setup environments

conda create -n epic python=3.10
conda activate epic
pip install -r requirements.txt

3. Downloading Pretrained Models

Download CogVideoX-5B-I2V (Base Model), RAFT (To Extract dense optical flow for masking source videos), Depth-Crafter (For video depth estimation), Qwen2.5-VL-7B-Instruct (For getting detailed captions over videos) with the script

bash download/download_models.sh

🎬 Demo Inference

We provide processed sample test sets in data/test_i2v and data/test_i2v. You can have a try with our pretrained model (in out/EPiC_pretrained) by

bash scripts/inference.sh test_v2v

and

bash scripts/inference.sh test_i2v

🧠 Training

1. Downloading Training Data

Download the ~5K training videos from EPiC_Data by

cd data/train
wget https://huggingface.co/datasets/ZunWang/EPiC_Data/resolve/main/train.zip
unzip train.zip

(Optional) Download the extracted vae latents by (You can also extract the latents yourself, which may take several hours)

wget https://huggingface.co/datasets/ZunWang/EPiC_Data/resolve/main/train_joint_latents.zip
unzip train_joint_latents.zip

2. Preprocessing

Extract caption embeddings (please specify the GPU list in preprocess.sh)

cd preprocess
bash preprocess.sh caption

(Optional) Extract vae latents

bash preprocess.sh latent

After preprocessing, your data folder should look like:

data/
├── test_i2v/
├── test_v2v/
└── train/
    ├── caption_embs/
    ├── captions/
    ├── joint_latents/
    ├── masked_videos/
    ├── masks/
    └── videos/

Custom Training Data (Optional)

You can prepare you own videos + captions. To do so, you need to first prepare the them like train/videos, train/captions. Then

bash preprocess.sh masking

To get the corresponding masked anchor videos from estimated dense optical flow, and

bash preprocess.sh caption
bash preprocess.sh latent

To get the extracted textual embeddings and visual latents.

3. Training

Editing GPU configs in scripts/train_with_latent.sh and training/accelerate_config_machine.yamlnum_processes, then

bash scripts/train_with_latent.sh

You can stop training after 500 iteration, which will take less than 2 hours on 8xH100 GPUs.

(Alternatively: bash scripts/train.sh for online latent encoding, but much slower)

🧪 Inference

1. V2V Inference

example inference data processing script

cd inference/v2v_data
bash get_anchor_videos.sh v2v_try

The processed data will be saved to data/v2v_try. You can modify camera pos type, operation mode, and other parameters to get anchor videos following your own trajectory, please refer to configuration document for setup. Then inference with

bash scripts/inference.sh v2v_try

2. I2V Inference

Coming soon!

📚 Acknowledgements

🔗 Related Works

A non-exhaustive list of related works includes: CogVideoX, ViewCrafter, GCD, NVS-Solver, DimensionX, ReCapture, TrajAttention, GS-DiT, DaS, RecamMaster, TrajectoryCrafter, GEN3C, CAT4D, Uni3C, AC3D, RealCam-I2V, CamCtrl3D...

About

Official implementation of EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published