Skip to content

Commit ac8fc98

Browse files
authored
Merge pull request #599 from artemisp/main
X-InstructBLIP Code
2 parents 7f00a08 + 018b106 commit ac8fc98

File tree

560 files changed

+73960
-182
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

560 files changed

+73960
-182
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,5 +152,8 @@ debug*/
152152
*.dat
153153
*.tsv
154154
*.gz
155+
*.csv
156+
*.p
157+
*.pdf
155158

156159
cache/

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@
2727
# LAVIS - A Library for Language-Vision Intelligence
2828

2929
## What's New: 🎉
30+
* [Model Release] November 2023, released implementation of **X-InstructBLIP** <br>
31+
[Paper](https://arxiv.org/pdf/2311.18799.pdf), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/xinstructblip), [Website](https://artemisp.github.io/X-InstructBLIP-page/), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/xinstructblip/demo/run_demo.ipynb)
32+
> A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.
3033
* [Model Release] July 2023, released implementation of **BLIP-Diffusion** <br>
3134
[Paper](https://arxiv.org/abs/2305.06500), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion), [Website](https://dxli94.github.io/BLIP-Diffusion-website/)
3235
> A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.

assets/LAVIS_technical_report.pdf

-1.55 MB
Binary file not shown.

lavis/common/utils.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
Copyright (c) 2022, salesforce.com, inc.
2+
Copyright (c) 2023, salesforce.com, inc.
33
All rights reserved.
44
SPDX-License-Identifier: BSD-3-Clause
55
For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
@@ -440,3 +440,16 @@ def get_file_size(filename):
440440
"""
441441
size_in_mb = os.path.getsize(filename) / float(1024**2)
442442
return size_in_mb
443+
444+
def is_serializable(value):
445+
"""
446+
This function checks if the provided value can be serialized into a JSON string.
447+
"""
448+
try:
449+
json.dumps(value)
450+
return True
451+
except (TypeError, OverflowError):
452+
return False
453+
454+
def is_convertible_to_int(value):
455+
return bool(re.match(r'^-?\d+$', str(value)))
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Copyright (c) 2023, salesforce.com, inc.
2+
# All rights reserved.
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5+
6+
datasets:
7+
aok_vqa_instruct:
8+
# data_dir: ${env.data_dir}/datasets
9+
data_type: images # [images|videos|features]
10+
11+
vis_processor:
12+
train:
13+
name: "clip_image_train"
14+
image_size: 224
15+
eval:
16+
name: "clip_image_eval"
17+
image_size: 224
18+
19+
text_processor:
20+
train:
21+
name: blip_instruction
22+
modality: image
23+
task: qa
24+
eval:
25+
name: blip_question
26+
27+
build_info:
28+
# Be careful not to append minus sign (-) before split to avoid itemizing
29+
annotations:
30+
train:
31+
url:
32+
- https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/aokvqa/aokvqa_v1p0_train.json
33+
storage:
34+
- aokvqa/annotations/aokvqa_v1p0_train.json
35+
# val:
36+
# url:
37+
# - https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/aokvqa/aokvqa_v1p0_val.json
38+
# - https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/aokvqa/specialized_vocab_train.json
39+
# storage:
40+
# - aokvqa/annotations/aokvqa_v1p0_val.json
41+
# - aokvqa/annotations/specialized_vocab_train_lavis.json
42+
# # - aokvqa/annotations/large_vocab_train_lavis.json
43+
# test:
44+
# url:
45+
# - https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/aokvqa/aokvqa_v1p0_test.json
46+
# - https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/aokvqa/specialized_vocab_train.json
47+
# storage:
48+
# - aokvqa/annotations/aokvqa_v1p0_test.json
49+
# - aokvqa/annotations/specialized_vocab_train_lavis.json
50+
images:
51+
# storage: /coco/images
52+
storage: /export/share/datasets/vision/coco/images
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Copyright (c) 2023, salesforce.com, inc.
2+
# All rights reserved.
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5+
6+
datasets:
7+
audiocaps_mm_caption: # name of the dataset builder
8+
audio_processor:
9+
train:
10+
name: beats_audio
11+
sampling_rate: 16000
12+
eval:
13+
name: beats_audio
14+
sampling_rate: 16000
15+
16+
text_processor:
17+
train:
18+
name: "blip_instruction"
19+
modality: audio
20+
task: caption
21+
eval:
22+
name: "blip_caption"
23+
24+
data_type: [audio]
25+
26+
build_info:
27+
kwargs:
28+
missing_ids: [2sh7ZkazyO8, 966jA2-z0mQ, 52RlolYyjAE, HVAc9hm4jjk, 8lPjqvYWNyM, eXgPnnE3TuQ]
29+
annotations:
30+
train:
31+
url:
32+
- https://raw.githubusercontent.com/cdjkim/audiocaps/master/dataset/train.csv
33+
storage:
34+
- audiocaps/annotations/train.csv
35+
36+
val:
37+
url:
38+
- https://raw.githubusercontent.com/cdjkim/audiocaps/master/dataset/val.csv
39+
storage:
40+
- audiocaps/annotations/val.csv
41+
42+
test:
43+
url:
44+
- https://raw.githubusercontent.com/cdjkim/audiocaps/master/dataset/test.csv
45+
storage:
46+
- audiocaps/annotations/test.csv
47+
48+
audio:
49+
storage: /export/einstein-vision/audio_datasets/audiocaps/AUDIOCAPS_32000Hz/audio
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Copyright (c) 2023, salesforce.com, inc.
2+
# All rights reserved.
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5+
6+
datasets:
7+
audiocaps_mm_caption_instruct: # name of the dataset builder
8+
audio_processor:
9+
train:
10+
name: beats_audio
11+
sampling_rate: 16000
12+
eval:
13+
name: beats_audio
14+
sampling_rate: 16000
15+
16+
text_processor:
17+
train:
18+
name: "blip_instruction"
19+
modality: audio
20+
task: caption
21+
eval:
22+
name: "blip_caption"
23+
24+
data_type: [audio]
25+
26+
missing_ids: [2sh7ZkazyO8, 966jA2-z0mQ, 52RlolYyjAE, HVAc9hm4jjk, 8lPjqvYWNyM, eXgPnnE3TuQ]
27+
28+
build_info:
29+
kwargs:
30+
cached: False
31+
cached_dir: /export/einstein-vision/audio_datasets/audiocaps/beats_features
32+
annotations:
33+
train:
34+
url:
35+
- https://raw.githubusercontent.com/cdjkim/audiocaps/master/dataset/train.csv
36+
storage:
37+
- audiocaps/annotations/train.csv
38+
39+
# val:
40+
# url:
41+
# - https://raw.githubusercontent.com/cdjkim/audiocaps/master/dataset/val.csv
42+
# storage:
43+
# - audiocaps/annotation/val.csv
44+
45+
# test:
46+
# url:
47+
# - https://raw.githubusercontent.com/cdjkim/audiocaps/master/dataset/test.csv
48+
# storage:
49+
# - /export/einstein-vision/audio_datasets/audiocaps/dataset/test.csv
50+
51+
audio:
52+
storage: /export/einstein-vision/audio_datasets/audiocaps/AUDIOCAPS_32000Hz/audio
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Copyright (c) 2023, salesforce.com, inc.
2+
# All rights reserved.
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5+
6+
datasets:
7+
audiocaps_mm_qa: # name of the dataset builder
8+
audio_processor:
9+
train:
10+
name: beats_audio
11+
sampling_rate: 16000
12+
eval:
13+
name: beats_audio
14+
sampling_rate: 16000
15+
is_eval: True
16+
17+
text_processor:
18+
train:
19+
name: "blip_instruction"
20+
modality: audio
21+
task: qa
22+
eval:
23+
name: "blip_question"
24+
25+
data_type: [audio]
26+
27+
build_info:
28+
kwargs:
29+
cached: False
30+
# add_binary: True
31+
cached_dir: /export/einstein-vision/audio_datasets/audiocaps/beats_features
32+
missing_ids: [2sh7ZkazyO8, 966jA2-z0mQ, 52RlolYyjAE, HVAc9hm4jjk, 8lPjqvYWNyM, eXgPnnE3TuQ]
33+
annotations:
34+
train:
35+
url:
36+
- https://storage.googleapis.com/sfr-xinstructblip-data-research/data/audiocaps/audio_qa_final_train.csv
37+
# - /export/home/LAVIS-xgen_mm/projects/xinstructblip/data_aug/audio_qa_data/audio_qa_final_train.csv
38+
storage:
39+
- audiocaps_qa/annotations/train.csv
40+
# - /export/home/LAVIS-xgen_mm/projects/xinstructblip/data_aug/audio_qa_data/audio_qa_final_train.csv
41+
42+
# val:
43+
# url:
44+
# # - https://storage.googleapis.com/sfr-xinstructblip-data-research/data/audiocaps/audio_qa_final_val.csv
45+
# - /export/home/LAVIS-xgen_mm/projects/xinstructblip/data_aug/audio_qa_data/audio_qa_final_val.csv
46+
# storage:
47+
# # - audiocaps_qa/annotations/val.csv
48+
# - /export/home/LAVIS-xgen_mm/projects/xinstructblip/data_aug/audio_qa_data/audio_qa_final_val.csv
49+
50+
audio:
51+
storage: /export/einstein-vision/audio_datasets/audiocaps/AUDIOCAPS_32000Hz/audio
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Copyright (c) 2023, salesforce.com, inc.
2+
# All rights reserved.
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5+
6+
datasets:
7+
audioset_mm_caption: # 14141
8+
audio_processor:
9+
train:
10+
name: beats_audio
11+
sampling_rate: 16000
12+
eval:
13+
name: beats_audio
14+
sampling_rate: 16000
15+
is_eval: False
16+
17+
text_processor:
18+
train:
19+
name: blip_instruction
20+
modality: audio
21+
task: classification
22+
eval:
23+
name: blip_caption
24+
25+
data_type: [audio]
26+
27+
build_info:
28+
annotations:
29+
train:
30+
url:
31+
- https://storage.googleapis.com/sfr-xinstructblip-data-research/data//audioset/balanced_train_clean.csv
32+
# - /export/home/LAVIS-xgen_mm/lavis/configs/datasets/audioset/balanced_train_clean.csv
33+
- http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv
34+
storage:
35+
- audioset/balanced_train_clean.csv
36+
# - /export/home/LAVIS-xgen_mm/lavis/configs/datasets/audioset/balanced_train_clean.csv
37+
- audioset/annotations/class_labels_indices.csv
38+
39+
# val:
40+
# url:
41+
# - http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/eval_segments.csv
42+
# - http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv
43+
# storage:
44+
# - audioset/annotations/eval_segments.csv
45+
# - audioset/annotations/class_labels_indices.csv
46+
audio:
47+
storage: /export/einstein-vision/audio_datasets/AudioSet/all_audio
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (c) 2023, salesforce.com, inc.
2+
# All rights reserved.
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5+
6+
datasets:
7+
audioset_mm_caption_instruct: # 14141
8+
audio_processor:
9+
train:
10+
name: beats_audio
11+
sampling_rate: 16000
12+
eval:
13+
name: beats_audio
14+
sampling_rate: 16000
15+
is_eval: False
16+
17+
text_processor:
18+
train:
19+
name: blip_instruction
20+
modality: audio
21+
task: classification
22+
eval:
23+
name: blip_caption
24+
25+
data_type: [audio]
26+
27+
build_info:
28+
annotations:
29+
train:
30+
url:
31+
# - https://storage.googleapis.com/sfr-xinstructblip-data-research/data//audioset/balanced_train_clean.csv
32+
- /export/home/LAVIS-xgen_mm/lavis/configs/datasets/audioset/balanced_train_clean.csv
33+
- http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv
34+
storage:
35+
- audioset/annotations/balanced_train_clean.csv
36+
# - /export/home/LAVIS-xgen_mm/lavis/configs/datasets/audioset/balanced_train_clean.csv
37+
- audioset/annotations/class_labels_indices.csv
38+
39+
# val:
40+
# url:
41+
# - http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/eval_segments.csv
42+
# - http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv
43+
# storage:
44+
# - audioset/annotations/eval_segments.csv
45+
# - audioset/annotations/class_labels_indices.csv
46+
47+
audio:
48+
storage: /export/einstein-vision/audio_datasets/AudioSet/all_audio

0 commit comments

Comments
 (0)