EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

✅ To-Do Checklist

✅ Release source training & inference code
✅ Provide training data and processing scripts
✅ Provide example inference for I2V and V2V
✅ V2V customized inference pipeline
⏳ I2V customized inference pipeline (Coming Soon!)

🚀 Setup

1. Clone EPiC

git clone --recursive https://github.com/wz0919/EPiC.git
cd EPiC

2. Setup environments

conda create -n epic python=3.10
conda activate epic
pip install -r requirements.txt

3. Downloading Pretrained Models

Download CogVideoX-5B-I2V (Base Model), RAFT (To Extract dense optical flow for masking source videos), Depth-Crafter (For video depth estimation), Qwen2.5-VL-7B-Instruct (For getting detailed captions over videos) with the script

bash download/download_models.sh

🎬 Demo Inference

We provide processed sample test sets in data/test_i2v and data/test_i2v. You can have a try with our pretrained model (in out/EPiC_pretrained) by

bash scripts/inference.sh test_v2v

and

bash scripts/inference.sh test_i2v

🧠 Training

1. Downloading Training Data

Download the ~5K training videos from EPiC_Data by

cd data/train
wget https://huggingface.co/datasets/ZunWang/EPiC_Data/resolve/main/train.zip
unzip train.zip

(Optional) Download the extracted vae latents by (You can also extract the latents yourself, which may take several hours)

wget https://huggingface.co/datasets/ZunWang/EPiC_Data/resolve/main/train_joint_latents.zip
unzip train_joint_latents.zip

2. Preprocessing

Extract caption embeddings (please specify the GPU list in preprocess.sh)

cd preprocess
bash preprocess.sh caption

(Optional) Extract vae latents

bash preprocess.sh latent

After preprocessing, your data folder should look like:

data/
├── test_i2v/
├── test_v2v/
└── train/
    ├── caption_embs/
    ├── captions/
    ├── joint_latents/
    ├── masked_videos/
    ├── masks/
    └── videos/

Custom Training Data (Optional)

You can prepare you own videos + captions. To do so, you need to first prepare the them like train/videos, train/captions. Then

bash preprocess.sh masking

To get the corresponding masked anchor videos from estimated dense optical flow, and

bash preprocess.sh caption
bash preprocess.sh latent

To get the extracted textual embeddings and visual latents.

3. Training

Editing GPU configs in scripts/train_with_latent.sh and training/accelerate_config_machine.yaml→ num_processes, then

bash scripts/train_with_latent.sh

You can stop training after 500 iteration, which will take less than 2 hours on 8xH100 GPUs.

(Alternatively: bash scripts/train.sh for online latent encoding, but much slower)

🧪 Inference

1. V2V Inference

example inference data processing script

cd inference/v2v_data
bash get_anchor_videos.sh v2v_try

The processed data will be saved to data/v2v_try. You can modify camera pos type, operation mode, and other parameters to get anchor videos following your own trajectory, please refer to configuration document for setup. Then inference with

bash scripts/inference.sh v2v_try

2. I2V Inference

Coming soon!

📚 Acknowledgements

This code mainly builds upon CogVideoX-ControlNet and AC3D
This code uses the original CogVideoX model CogVideoX
The v2v data processing pipeline largely builds upon TrajectoryCrafter

🔗 Related Works

A non-exhaustive list of related works includes: CogVideoX, ViewCrafter, GCD, NVS-Solver, DimensionX, ReCapture, TrajAttention, GS-DiT, DaS, RecamMaster, TrajectoryCrafter, GEN3C, CAT4D, Uni3C, AC3D, RealCam-I2V, CamCtrl3D...

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
download		download
inference		inference
out/EPiC_pretrained		out/EPiC_pretrained
preprocess		preprocess
scripts		scripts
training		training
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
cogvideo_controlnet_pcd.py		cogvideo_controlnet_pcd.py
cogvideo_transformer.py		cogvideo_transformer.py
controlnet_pipeline.py		controlnet_pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

✅ To-Do Checklist

🚀 Setup

1. Clone EPiC

2. Setup environments

3. Downloading Pretrained Models

🎬 Demo Inference

🧠 Training

1. Downloading Training Data

2. Preprocessing

Custom Training Data (Optional)

3. Training

🧪 Inference

1. V2V Inference

2. I2V Inference

📚 Acknowledgements

🔗 Related Works

About

Uh oh!

Releases

Packages

Languages

wz0919/EPiC

Folders and files

Latest commit

History

Repository files navigation

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

✅ To-Do Checklist

🚀 Setup

1. Clone EPiC

2. Setup environments

3. Downloading Pretrained Models

🎬 Demo Inference

🧠 Training

1. Downloading Training Data

2. Preprocessing

Custom Training Data (Optional)

3. Training

🧪 Inference

1. V2V Inference

2. I2V Inference

📚 Acknowledgements

🔗 Related Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages