A multi-view stereo depth estimation model which works anywhere, in any scene, with any range of depths
MVSAnywhere: Zero Shot Multi-View Stereo
Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Turmukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow and Jamie Watson.
dron_mtb_both.mp4
This code is for non-commercial use; please see the license file for terms. If you do find any part of this codebase helpful, please cite our paper using the BibTex below and link this repo. Thanks!
- MVSAnywhere: Zero Shot Multi-View Stereo
- Table of Contents
- โ๏ธ Setup
- ๐ฆ Pretrained Models
- ๐ Running out of the box!
- Running on recordings from your own device!
- Running Gaussian splatting with MVSAnywhere regularisation!
- ๐ Testing and Evaluation
- ๐จ Training
- ๐๐งฎ๐ฉโ๐ป Notation for Transformation Matrices
- ๐บ๏ธ World Coordinate System
- ๐ Acknowledgements
- ๐ BibTeX
- ๐ฉโโ๏ธ License
We are going to create a new Mamba environment called mvsanywhere. If you don't have Mamba, you can install it with:
make install-mambamake create-mamba-env
mamba activate mvsanywhereIn the code directory, install the repo as a pip package:
pip install -e .To use our Gaussian splatting regularization also install that module:
pip install -e src/regsplatfacto/We provide 2 variants of our models: mvsanywhere_hero.ckpt and mvsanywhere_dot.ckpt. mvsanywhere_hero is "Ours" from the main paper, and mvsanywhere_dot is ours with no metadata MLP.
We've now included two scans for people to try out immediately with the code. You can download these scans from here.
Steps:
- Download weights for the
hero_modelinto the weights directory. - Download the scans and unzip them to a directory of your choosing.
- You should be able to run it! Something like this will work:
CUDA_VISIBLE_DEVICES=0 python src/mvsanywhere/run_demo.py \
--name mvsanywhere \
--output_base_path OUTPUT_PATH \
--config_file configs/models/mvsanywhere_model.yaml \
--load_weights_from_checkpoint weights/mvsanywhere_hero.ckpt \
--data_config_file configs/data/vdr/vdr_dense.yaml \
--scan_parent_directory /path/to/vdr/ \
--scan_name house \ # Scan name (house or living_room)
--num_workers 8 \
--batch_size 2 \
--fast_cost_volume \
--run_fusion \
--depth_fuser custom_open3d \
--fuse_color \
--fusion_max_depth 3.5 \
--fusion_resolution 0.02 \
--extended_neg_truncation \
--dump_depth_visualizationThis will output meshes, quick depth viz, and socres when benchmarked against LiDAR depth under OUTPUT_PATH.
If you run out of GPU memory, you can try removing the --fast_cost_colume flag.
How to use NeRF Capture to record videos
-
Download the NeRF Capture app from the App Store. Capture a recording of your favourite environment and save it.
-
Place your recordings in a directory with the following structure:
/path/to/recordings/
โ-- recording_0/
โ โ-- images/
| | |-- image_0.png
| | |-- image_1.png
| | ...
โ โ-- transforms.json
โ-- recording_1/
| ...
- And run the model ๐๐๐
python src/mvsanywhere/run_demo.py \
--name mvsanywhere \
--output_base_path OUTPUT_PATH \
--config_file configs/models/mvsanywhere_model.yaml \
--load_weights_from_checkpoint weights/mvsanywhere_hero.ckpt \
--data_config_file configs/data/nerfstudio/nerfstudio_empty.yaml \
--scan_parent_directory /path/to/recordings/ \
--scan_name recording_0 \
--fast_cost_volume \
--num_workers 8 \
--batch_size 2 \
--image_height 480 \
--image_width 640 \
--dump_depth_visualization \
--rotate_images # Only if you recorded in portraitUse ARCorder to get a video in Android with camera poses
-
Download the ARCorder app from releases. This very simple app relies on Android AR Core system, accuracy of the computed poses might be limited. Capture a recording of your favourite environment and save it.
-
Place your recordings in a directory with the following structure:
/path/to/recordings/
โ-- recording_0/
โ โ-- images/
| | |-- image_0.png
| | |-- image_1.png
| | ...
โ โ-- transforms.json
โ-- recording_1/
| ...
- And run the model ๐๐๐
python src/mvsanywhere/run_demo.py \
--name mvsanywhere \
--output_base_path OUTPUT_PATH \
--config_file configs/models/mvsanywhere_model.yaml \
--load_weights_from_checkpoint weights/mvsanywhere_hero.ckpt \
--data_config_file configs/data/nerfstudio/nerfstudio_empty.yaml \
--scan_parent_directory /path/to/recordings/ \
--scan_name recording_0 \
--fast_cost_volume \
--num_workers 8 \
--batch_size 2 \
--image_height 480 \
--image_width 640 \
--dump_depth_visualization \
--rotate_images # Only if you recorded in portraitUse COLMAP to obtain a sparse reconstruction
If you already have a COLMAP reconstruction skip to 4.
- Install nerfstudio
- Install COLMAP using
conda install -c conda-forge colmap. - Process your video/sequence using
ns-process-data {images, video} --data {DATA_PATH} --output-dir {PROCESSED_DATA_DIR}- Your reconstructions should have the following structure:
/path/to/reconstruction/
โ-- reconstruction_0/
โ โ-- images/
| | |-- image_0.png
| | |-- image_1.png
| | ...
โ โ-- colmap/
| | |-- database.db
| | |-- sparse/
| | | |-- 0/
| | | | |-- cameras.bin
| | | | |-- images.bin
| | | | ...
| | | |-- 1/
| | | | ...
โ-- reconstruction_1/
| ...
- And run the model ๐๐๐
python src/mvsanywhere/run_demo.py \
--name mvsanywhere \
--output_base_path OUTPUT_PATH \
--config_file configs/models/mvsanywhere_model.yaml \
--load_weights_from_checkpoint weights/mvsanywhere_hero.ckpt \
--data_config_file configs/data/colmap/colmap_empty.yaml \
--scan_parent_directory /path/to/reconstruction \
--scan_name reconstruction_0:0 \ # reconstruction_name:n where n is the colmap sparse model
--fast_cost_volume \
--num_workers 8 \
--batch_size 2 \
--image_height 480 \
--image_width 640 \
--dump_depth_visualizationsplats_reg.mp4
We release code regsplatfacto to run splatting using MVSAnywhere depths as regularisation. This is heavily inspired by techniques such as DN-Splatter and VCR-Gauss.
You can use any data in the nerfstudio format - e.g. existing nerfstudio data, or data from the 3 sources listed above.
If you are using data which has camera distortion, you will need to run our script scripts/data_scripts/undistort_nerfstudio_data.py:
python3 scripts/data_scripts/undistort_nerfstudio_data.py \
--data-dir /path/to/input/scene \
--output-dir /path/to/output/sceneAdditionally, the NeRF Capture app saves frame metadata without file extension. To run splatting you will need to run our script scripts/data_scripts/fix_nerfcapture_filenames.py.
To train a splat, you can use
ns-train regsplatfacto \
--data path/to/data \
--experiment-name mvsanywhere-splatting \
--pipeline.datamanager.load_weights_from_checkpoint path/to/model \
--pipeline.model.use-skybox FalseThis will first run mvsanywhere inference and save outputs to disk, and then start training your splat.
Tips:
- If your data was captured with a phone in portrait mode, you can append the flag
--pipeline.datamanager.rotate_images True.- If your data contains a lot of sky, you can try adding a background skybox using
--pipeline.model.use-skybox True.
Once you have a splat, you can extract a mesh using TSDF fusion, using
ns-render-for-meshing \
--load-config /path/to/splat/config \
--rescale_to_world True \
--output_path /path/to/render/outputs
ns-meshing \
--renders-path /path/to/render/outputs \
--max_depth 20.0 \
--save-name mvsanywhere_mesh \
--voxel_size 0.04If you are running on a scene reconstructed without metric scale (e.g. COLMAP), then you will need to adjust the max_depth and voxel_size to be something sensible for your scale.
Congratulations - you now have a splat and a mesh!
We used the Robust Multi-View Depth Benchmark to evaluate MVSAnywhere depth estimation on a zero-shot environment with multiple datasets.
To evaluate MVSAnywhere on this benchmark, first, download the benchmark code in your system:
git clone https://github.com/lmb-freiburg/robustmvd.gitNow, download and preprocess the evaluation datasets following this guide. You should download:
- KITTI
- Scannet
- ETH3D
- DTU
- Tanks and Temples
Don't forget to set the path to these datasets in rmvd/data/paths.toml. Now you are ready to evaluate MVSAnywhere by just running:
export PYTHONPATH="/path/to/robustmvd/:$PYTHONPATH"
python src/mvsanywhere/test_rmvd.py \
--name mvsanywhere \
--output_base_path OUTPUT_PATH \
--config_file configs/models/mvsanywhere_model.yaml \
--load_weights_from_checkpoint weights/mvsanywhere_hero.ckptTo train MVSAnywhere:
-
Download all the required synthetic datasets (and val dataset):
Hypersim
- Download following instructions from here:
python code/python/tools/dataset_download_images.py \ --downloads_dir path/to/download \ --decompress_dir /path/to/hypersim/raw
- Update
configs/data/hypersim/hypersim_default_train.yamlto point to the correct location. - Convert distances into planar depth using the provided script in this repo:
python ./data_scripts/generate_hypersim_planar_depths.py \ --data_config configs/data/hypersim_default_train.yaml \ --num_workers 8TartanAir
- Download following instructions from here:
python download_training.py \ --output-dir /path/to/tartan \ --rgb \ --depth \ --seg \ --only-left \ --unzip
- Update
configs/data/tartanair/tartanair_default_train.yamlto point to the correct location.
BlendedMVG
- Download following instructions from here.
- You should download BlendedMVS, BlendedMVS+ and BlendedMVS++, all low-res. Place all on the same folder.
- Update
configs/data/blendedmvg/blendedmvg_default_train.yamlto point to the correct location.
MatrixCity
- Download following instructions from here.
- You should download big_city, big_city_depth, big_city_depth_float32.
- Update
configs/data/matrix_city/matrix_city_default_train.yamlto point to the correct location.
VKITTI2
- Download following instructions from here.
- You should download rgb, depth, classSegmentation and textgt.
- Update
configs/data/vkitti/vkitti_default_train.yamlto point to the correct location.
Dynamic Replica
- Download following instructions from here.
- After download you can remove unused stuff to save disk space (segmentation, optical flow and pixel trajectories.)
- Update
configs/data/dynamic_replica/dynamic_replica_default_train.yamlto point to the correct location.
MVSSynth
- Download following instructions from here.
- You should download the 960x540 version.
- Update
configs/data/mvssynth/mvssynth_default_train.yamlto point to the correct location.
SAIL-VOS 3D
- Download following instructions from here.
- You will need to contact the authors to download the data.
- Buy Grand Theft Auto V.
- (optional, recommended) Play Grand Theft Auto V and relax a little bit.
- Update
configs/data/sailvos3d/sailvos3d_default_train.yamlto point to the correct location.
ScanNet v2 (Optional, val only)
- Follow the instructions from here.
-
Download Depth Anything v2 base weights from here.
-
Now you can train the model using:
python src/mvsanywhere/train.py \
--log_dir logs/ \
--name mvsanywhere_training \
--config_file configs/models/mvsanywhere_model.yaml \
--data_config configs/data/hypersim/hypersim_default_train.yaml:configs/data/tartanair/tartanair_default_train.yaml:configs/data/blendedmvg/blendedmvg_default_train.yaml:configs/data/matrix_city/matrix_city_default_train.yaml:configs/data/vkitti/vkitti_default_train.yaml:configs/data/dynamic_replica/dynamic_replica_default_train.yaml:configs/data/mvssynth/mvssynth_default_train.yaml:configs/data/sailvos3d/sailvos3d_default_train.yaml \
--val_data_config configs/data/scannet/scannet_default_val.yaml \
--batch_size 6 \
--val_batch_size 6 \
--da_weights_path /path/to/depth_anything_v2_vitb.pth \
--gpus 2TL;DR: world_T_cam == world_from_cam
This repo uses the notation "cam_T_world" to denote a transformation from world to camera points (extrinsics). The intention is to make it so that the coordinate frame names would match on either side of the variable when used in multiplication from right to left:
cam_points = cam_T_world @ world_points
world_T_cam denotes camera pose (from cam to world coords). ref_T_src denotes a transformation from a source to a reference view.
Finally this notation allows for representing both rotations and translations such as: world_R_cam and world_t_cam
This repo is geared towards ScanNet, so while its functionality should allow for any coordinate system (signaled via input flags), the model weights we provide assume a ScanNet coordinate system. This is important since we include ray information as part of metadata. Other datasets used with these weights should be transformed to the ScanNet system. The dataset classes we include will perform the appropriate transforms.
The tuple generation scripts make heavy use of a modified version of DeepVideoMVS's Keyframe buffer (thanks Arda and co!).
We'd like to thank the Niantic Raptor R&D infrastructure team - Saki Shinoda, Jakub Powierza, and Stanimir Vichev - for their valuable infrastructure support.
If you find our work useful in your research please consider citing our paper:
@inproceedings{izquierdo2025mvsanywhere,
title={{MVSAnywhere}: Zero Shot Multi-View Stereo},
author={Izquierdo, Sergio and Sayed, Mohamed and Firman, Michael and Garcia-Hernando, Guillermo and Turmukhambetov, Daniyar and Civera, Javier and Mac Aodha, Oisin and Brostow, Gabriel J. and Watson, Jamie},
booktitle={CVPR},
year={2025}
}
Copyright ยฉ Niantic, Inc. 2024. Patent Pending. All rights reserved. Please see the license file for terms.