new MLPerf Storage benchmarking tool mlpstorage v2.0.0b1 having issues with dataset creation. Not creating expected size dataset. #113
rajib-genAI
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to use mlpstorage 2.0.0b1 downloaded from "https://github.com/mlcommons/storage.git" ref: "https://github.com/mlcommons/storage".
Following the instructions in README.md, I am able to install in execute the new mlpstorage.
Observed an issue while running the U-Net3D workload with a dataset size. Resulting in an unimaginable GPU accelerator count from a single client/node.
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 64
[METRIC] Training Accelerator Utilization [AU] (%): 94.2091 (0.2926)
[METRIC] Training Throughput (samples/second): 1303.7871 (3.7117)
[METRIC] Training I/O Throughput (MB/second): 182281.5040 (518.9258)
[METRIC] train_au_meet_expectation: success
[METRIC] ==========================================================
Datagen and run for uned3d workload
root@dgx77:/mlperf/storage# mlpstorage training datasize --model unet3d --client-host-memory-in-gb 2048 --num-client-hosts 1 --max-accelerators 32 --accelerator-type h100
Setting attr from max_accelerators to 32
Hosts is: ['127.0.0.1']
Hosts is: ['127.0.0.1']
2025-06-01 06:10:35|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/unet3d/datasize/20250601_061035
2025-06-01 06:10:35|RESULT: Minimum file count dictated by 500 step requirement of given accelerator count and batch size.
2025-06-01 06:10:35|RESULT: Number of training files: 112000
2025-06-01 06:10:35|RESULT: Number of training subfolders: 0
2025-06-01 06:10:35|RESULT: Total disk space required for training: 15291.64 GB
2025-06-01 06:10:35|WARNING: The number of files required may be excessive for some filesystems. You can use the num_subfolders_train parameter to shard the dataset. To keep near 10,000 files per folder use "11x" subfolders by adding "--param dataset.num_subfolders_train=11"
2025-06-01 06:10:35|RESULT: Run the following command to generate data:
mlpstorage training datagen --hosts=127.0.0.1 --model=unet3d --exec-type=mpi --param dataset.num_files_train=112000 --num-processes=32 --results-dir=/tmp/mlperf_storage_results --data-dir=<INSERT_DATA_DIR>
2025-06-01 06:10:35|WARNING: The parameter for --num-processes is the same as --max-accelerators. Adjust the value according to your system.
2025-06-01 06:10:35|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/training/unet3d/datasize/20250601_061035/training_20250601_061035_metadata.json
root@dgx77:/mlperf/storage# time mlpstorage training datagen --hosts=127.0.0.1 --model=unet3d --exec-type=mpi --allow-run-as-root --oversubscribe --param dataset.num_files_train=112000 --num-processes=32 --results-dir=./results/results.dgx77.unet3d.datagen.github.new.code.$(date -d "today" +"%Y%m%d%H%M") --data-dir=/mnt/testfs/uned3d_data_mlpstorage
Hosts is: ['127.0.0.1']
Hosts is: ['127.0.0.1']
2025-06-01 06:12:35|STATUS: Benchmark results directory: ./results/results.dgx77.unet3d.datagen.github.new.code.202506010612/training/unet3d/datagen/20250601_061235
2025-06-01 06:12:35|INFO: Creating data directory: /mnt/testfs/uned3d_data_mlpstorage/unet3d...
2025-06-01 06:12:35|INFO: Creating directory: /mnt/testfs/uned3d_data_mlpstorage/unet3d/train...
2025-06-01 06:12:35|INFO: Creating directory: /mnt/testfs/uned3d_data_mlpstorage/unet3d/valid...
2025-06-01 06:12:35|INFO: Creating directory: /mnt/testfs/uned3d_data_mlpstorage/unet3d/test...
2025-06-01 06:12:35|STATUS: Running benchmark command:: mpirun -n 32 -host 127.0.0.1:32 --oversubscribe --allow-run-as-root /usr/local/bin/dlio_benchmark workload=unet3d_datagen ++hydra.run.dir=./results/results.dgx77.unet3d.datagen.github.new.code.202506010612/training/unet3d/datagen/20250601_061235 ++hydra.output_subdir=dlio_config ++workload.workflow.generate_data=True ++workload.workflow.train=False ++workload.workflow.checkpoint=False ++workload.dataset.num_files_train=112000 ++workload.dataset.data_folder=/mnt/testfs/uned3d_data_mlpstorage/unet3d --config-dir=/mlperf/storage/configs/dlio
...
...
[INFO] 2025-06-01T06:14:45.659470 Generating NPZ Data: [============================================================>] 99.7% 111681 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.665204 Generating NPZ Data: [============================================================>] 99.7% 111713 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.670244 Generating NPZ Data: [============================================================>] 99.8% 111745 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.675559 Generating NPZ Data: [============================================================>] 99.8% 111777 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.680224 Generating NPZ Data: [============================================================>] 99.8% 111809 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.683631 Generating NPZ Data: [============================================================>] 99.9% 111841 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.686706 Generating NPZ Data: [============================================================>] 99.9% 111873 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.688980 Generating NPZ Data: [============================================================>] 99.9% 111905 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.691353 Generating NPZ Data: [============================================================>] 99.9% 111937 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO]
[INFO] 2025-06-01T06:14:45.693855 Generating NPZ Data: [============================================================>] 100.0% 111969 of 112000 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:235]
[INFO] 2025-06-01T06:14:45.708871 Generation done [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:162]
2025-06-01 06:15:05|STATUS: Writing metadata for benchmark to: ./results/results.dgx77.unet3d.datagen.github.new.code.202506010612/training/unet3d/datagen/20250601_061235/training_20250601_061235_metadata.json
real 2m29.817s
user 7m21.370s
sys 2m42.396s
root@dgx77:/mlperf/storage#
If you look at the details of the dataset files and total size, it is clear that instead of having 112000 files the size 7.3GB, which should be 15291.64 GB
root@dgx77:
# ls -ltrh /mnt/testfs/unet3d_data_mlpstorage/unet3d/train/ | grep .npz -c# ls -ltrh /mnt/testfs/unet3d_data_mlpstorage/unet3d/train/img_000000_of_112000.npz112000
root@dgx77:
-rw-r--r-- 1 root root 65K Jun 1 06:13 /mnt/testfs/unet3d_data_mlpstorage/unet3d/train/img_000000_of_112000.npz
root@dgx77:
# ls -ltrh /mnt/testfs/unet3d_data_mlpstorage/unet3d/train/img_111999_of_112000.npz# du -h --max-depth=1 /mnt/testfs/unet3d_data_mlpstorage/unet3d/-rw-r--r-- 1 root root 65K Jun 1 06:14 /mnt/testfs/unet3d_data_mlpstorage/unet3d/train/img_111999_of_112000.npz
root@dgx77:
0 /mnt/testfs/unet3d_data_mlpstorage/unet3d/valid
0 /mnt/testfs/unet3d_data_mlpstorage/unet3d/test
7.3G /mnt/testfs/unet3d_data_mlpstorage/unet3d/train
7.3G /mnt/testfs/unet3d_data_mlpstorage/unet3d/
root@dgx77:~#
Now, look at the dataset files, it is of the same size.
root@dgx77:
# ls -ltrh /mnt/testfs/unet3d_data_mlpstorage/unet3d/train | more#total 7.3G
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000027_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000024_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000059_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000019_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000001_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000091_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000002_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000056_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000051_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000033_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000123_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000034_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000083_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000088_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000065_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000155_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000066_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000115_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000120_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000097_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000187_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000098_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000147_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000152_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000129_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000219_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000130_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000179_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000184_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000161_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000251_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000162_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000211_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000216_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000193_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000283_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000194_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000315_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000243_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000248_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000226_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000347_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000225_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000275_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000258_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000379_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000280_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000257_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000307_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000290_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000411_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000312_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000289_of_112000.npz
-rw-r--r-- 1 root root 65K Jun 1 06:13 img_000339_of_112000.npz
root@dgx77:
It is the same issue on a different server. Dataset size is very very small compared to what it shows using the datasize command
Instead of having 30583.27 GB of data, it is only 15GB.
root@dgx76:
# ls -ltrh /mnt/testfs/unet3d_data_30TB/unet3d/train/ | grep .npz -c# ls -ltr /mnt/testfs/unet3d_data_30TB/unet3d/train/img_000000_of_224000.npz224000
root@dgx76:
-rw-r--r-- 1 root root 66034 May 30 04:59 /mnt/testfs/unet3d_data_30TB/unet3d/train/img_000000_of_224000.npz
root@dgx76:
# ls -ltr /mnt/testfs/unet3d_data_30TB/unet3d/train/img_223999_of_224000.npz# du -h --max-depth=1 /mnt/testfs/unet3d_data_30TB/unet3d/-rw-r--r-- 1 root root 66034 May 30 05:01 /mnt/testfs/unet3d_data_30TB/unet3d/train/img_223999_of_224000.npz
root@dgx76:
15G /mnt/testfs/unet3d_data_30TB/unet3d/train
0 /mnt/testfs/unet3d_data_30TB/unet3d/valid
0 /mnt/testfs/unet3d_data_30TB/unet3d/test
15G /mnt/testfs/unet3d_data_30TB/unet3d/
root@dgx76:~#
Can you please look into fixing this inconsistent dataset size issue?
Beta Was this translation helpful? Give feedback.
All reactions