Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 56 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,16 @@ MLPerf® Storage is a benchmark suite to characterize the performance of storage
- [Installation](#installation)
- [Configuration](#configuration)
- [Workloads](#workloads)
- [U-Net3D](#u-net3d)
- [ResNet-50](#resnet-50)
- [CosmoFlow](#cosmoflow)
- [U-Net3D](#u-net3d)
- [ResNet-50](#resnet-50)
- [CosmoFlow](#cosmoflow)
- [Parameters](#parameters)
- [CLOSED](#closed)
- [OPEN](#open)
- [CLOSED](#closed)
- [OPEN](#open)
- [Submission Rules](#submission-rules)
-
-
## Overview
For an overview of how this benchmark suite is used by submitters to compare the performance of storage systems supporting an AI cluster, see the MLPerf® Storage Benchmark submission rules here: [doc](https://github.com/mlcommons/storage/blob/main/Submission_guidelines.md).
For an overview of how this benchmark suite is used by submitters to compare the performance of storage systems supporting an AI cluster, see the MLPerf® Storage Benchmark submission rules here: [doc](https://github.com/mlcommons/storage/blob/main/Submission_guidelines.md).

## Prerequisite

Expand All @@ -26,38 +26,32 @@ Following prerequisites must be satisfied
1. Pick one host to act as the launcher client host. Passwordless ssh must be setup from the launcher client host to all other participating client hosts. `ssh-copy-id` is a useful tool.
2. The code and data location(discussed in further sections) must be exactly same in every client host including the launcher host. This is because, the same benchmark command is automatically triggered in every participating client host during the distributed training process.

## Installation
## Installation
**The following installation steps must be run on every client host that will participate in running the benchmarks.**

### Dependencies
DLIO requires MPI package.
For eg: when running on Ubuntu 24.04, install openmpi tools and libraries.
DLIO requires MPI package.
For eg: when running on Ubuntu 24.04, install openmpi tools and libraries.

```bash
sudo apt install python3-pip python3-venv libopenmpi-dev openmpi-common
sudo apt install pipx libopenmpi-dev openmpi-common
```

Create a virtual environment for package installations and activate it.
Install PDM and make it visible to a shell.

```bash
python3 -m venv ~/.venvs/myenv
source ~/.venvs/myenv/bin/activate
```

### Pip
Please ensure you have the latest version of pip installed. This will fix the following error where the package is built as "UNKNOWN". Upgrade pip like so:

```bash
python3 -m pip install --upgrade pip
pipx install pdm
pipx ensurepath
```

### Prepare test environment

Clone the latest release from [MLCommons Storage](https://github.com/mlcommons/storage) repository and install Python dependencies.

```bash
git clone -b v2.0 https://github.com/mlcommons/storage.git
cd storage
pip3 install -e .
pdm install --frozen-lockfile
```

The working directory structure is as follows
Expand All @@ -72,9 +66,22 @@ The working directory structure is as follows
|---(folder contains configs for all checkpoint and training workloads)
|---vectordbbench (These configurations are PREVIEW only and not available for submission)
|---(folder contains configs for all vectordb workloads)
|---.venv (the default directory where the virtual environment managed by pdm is located)
```

The benchmark simulation will be performed through the [dlio_benchmark](https://github.com/argonne-lcf/dlio_benchmark) code, a benchmark suite for emulating I/O patterns for deep learning workloads. [dlio_benchmark](https://github.com/argonne-lcf/dlio_benchmark) is listed as a prerequisite to a specific git branch. A future release will update the installer to pull DLIO from PyPi. The DLIO configuration of each workload is specified through a yaml file. You can see the configs of all MLPerf Storage workloads in the `configs` folder.

You can invoke the mlpstorage script either through pdm
```bash
$ pdm run mlpstorage -h
```

or directly once the Python virtual environment is activated:
```bash
source .venv/bin/activate
(mlperf-storage-3.12) [...]$ mlpstorage -h
```

The benchmark simulation will be performed through the [dlio_benchmark](https://github.com/argonne-lcf/dlio_benchmark) code, a benchmark suite for emulating I/O patterns for deep learning workloads. [dlio_benchmark](https://github.com/argonne-lcf/dlio_benchmark) is listed as a prerequisite to a specific git branch. A future release will update the installer to pull DLIO from PyPi. The DLIO configuration of each workload is specified through a yaml file. You can see the configs of all MLPerf Storage workloads in the `configs` folder.

## Operation
The benchmarks uses nested commands to select the workload category, workload, and workload parameters.
Expand Down Expand Up @@ -378,10 +385,10 @@ View Only:

Example:

For running benchmark on `unet3d` workload with data located in `unet3d_data` directory using 2 h100 accelerators spread across 2 client hosts(with IPs 10.117.61.121,10.117.61.165) and results on `unet3d_results` directory,
For running benchmark on `unet3d` workload with data located in `unet3d_data` directory using 2 h100 accelerators spread across 2 client hosts(with IPs 10.117.61.121,10.117.61.165) and results on `unet3d_results` directory,

```bash
mlpstorage training run --hosts 10.117.61.121,10.117.61.165 --num-client-hosts 2 --client-host-memory-in-gb 64 --num-accelerators 2 --accelerator-type h100 --model unet3d --data-dir unet3d_data --results-dir unet3d_results --param dataset.num_files_train=400
mlpstorage training run --hosts 10.117.61.121,10.117.61.165 --num-client-hosts 2 --client-host-memory-in-gb 64 --num-accelerators 2 --accelerator-type h100 --model unet3d --data-dir unet3d_data --results-dir unet3d_results --param dataset.num_files_train=400
```

4. Benchmark submission report is generated by aggregating the individual run results. The reporting command provides the associated functions to generate a report for a given results directory
Expand Down Expand Up @@ -449,11 +456,11 @@ View Only:
--what-if View the configuration that would execute and the associated command.
```

Note: The `reportgen` script must be run in the launcher client host.
Note: The `reportgen` script must be run in the launcher client host.

## Training Models
Currently, the storage benchmark suite supports benchmarking of 3 deep learning workloads
- Image segmentation using U-Net3D model
- Image segmentation using U-Net3D model
- Image classification using Resnet-50 model
- Cosmology parameter prediction using CosmoFlow model

Expand All @@ -470,16 +477,16 @@ Generate data for the benchmark run based on the minimum files
```bash
mlpstorage training datagen --hosts 127.0.0.1 --num-processes 8 --model unet3d --data-dir unet3d_data --results-dir unet3d_results --param dataset.num_files_train=42000
```

Run the benchmark.

```bash
mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 64 --num-accelerators 4 --accelerator-type h100 --model unet3d --data-dir unet3d_data --results-dir unet3d_results --param dataset.num_files_train=42000
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.
All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
```bash
mlpstorage reports reportgen --results-dir unet3d_results
```

Expand All @@ -496,16 +503,16 @@ Generate data for the benchmark run
```bash
mlpstorage training datagen --hosts 127.0.0.1 --num-processes 8 --model resnet50 --data-dir resnet50_data --results-dir resnet50_results --param dataset.num_files_train=2557
```

Run the benchmark.

```bash
mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 64 --num-accelerators 16 --accelerator-type h100 --model resnet50 --data-dir resnet50_data --results-dir resnet50_results --param dataset.num_files_train=2557
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.
All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
```bash
mlpstorage reports reportgen --results-dir resnet50_results
```

Expand All @@ -514,49 +521,49 @@ mlpstorage reports reportgen --results-dir resnet50_results
Calculate minimum dataset size required for the benchmark run based on your client configuration

```bash
mlpstorage training datasize --model cosmoflow --client-host-memory-in-gb 64 --num-client-hosts 1 --max-accelerators 16 --accelerator-type h100
mlpstorage training datasize --model cosmoflow --client-host-memory-in-gb 64 --num-client-hosts 1 --max-accelerators 16 --accelerator-type h100
```

Generate data for the benchmark run

```bash
mlpstorage training datagen --hosts 127.0.0.1 --num-processes 8 --model cosmoflow --data-dir cosmoflow_data --results-dir=cosmoflow_results --param dataset.num_files_train=121477
```

Run the benchmark.

```bash
mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 64 --num-accelerators 16 --accelerator-type h100 --model cosmoflow --data-dir cosmoflow_data --results-dir cosmoflow_results --param dataset.num_files_train=121477
mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 64 --num-accelerators 16 --accelerator-type h100 --model cosmoflow --data-dir cosmoflow_data --results-dir cosmoflow_results --param dataset.num_files_train=121477
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.
All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
```bash
mlpstorage reports reportgen --results-dir cosmoflow_results
```

## Parameters
## Parameters

### CLOSED
Below table displays the list of configurable parameters for the benchmark in the closed category.

| Parameter | Description |Default|
| ------------------------------ | ------------------------------------------------------------ |-------|
| **Dataset params** | | |
| dataset.num_files_train | Number of files for the training set | --|
| dataset.num_subfolders_train | Number of subfolders that the training set is stored |0|
| dataset.num_files_train | Number of files for the training set | --|
| dataset.num_subfolders_train | Number of subfolders that the training set is stored |0|
| dataset.data_folder | The path where dataset is stored | --|
| **Reader params** | | |
| reader.read_threads | Number of threads to load the data | --|
| reader.computation_threads | Number of threads to preprocess the data(for TensorFlow) |1|
| reader.prefetch_size | Number of batches to prefetch |2|
| reader.transfer_size | Number of bytes in the read buffer(only for Tensorflow) | |
| reader.odirect | Whether to use direct I/O for reader (currectly applicable to UNet3D) | False |
| reader.transfer_size | Number of bytes in the read buffer(only for Tensorflow) | |
| reader.odirect | Whether to use direct I/O for reader (currectly applicable to UNet3D) | False |
| **Checkpoint params** | | |
| checkpoint.checkpoint_folder | The folder to save the checkpoints | --|
| checkpoint.checkpoint_folder | The folder to save the checkpoints | --|
| **Storage params** | | |
| storage.storage_root | The storage root directory | ./|
| storage.storage_type | The storage type |local_fs|
| storage.storage_root | The storage root directory | ./|
| storage.storage_type | The storage type |local_fs|


### OPEN
Expand All @@ -566,10 +573,10 @@ In addition to what can be changed in the CLOSED category, the following paramet
| ------------------------------ | ------------------------------------------------------------ |-------|
| framework | The machine learning framework |Pytorch for 3D U-Net |
| **Dataset params** | | |
| dataset.format | Format of the dataset | .npz for 3D U-Net |
| dataset.num_samples_per_file | Number of samples per file(only for Tensorflow using tfrecord datasets) | 1 for 3D U-Net |
| dataset.format | Format of the dataset | .npz for 3D U-Net |
| dataset.num_samples_per_file | Number of samples per file(only for Tensorflow using tfrecord datasets) | 1 for 3D U-Net |
| **Reader params** |
| reader.data_loader | Data loader type(Tensorflow or PyTorch or custom) | PyTorch for 3D U-Net |
| reader.data_loader | Data loader type(Tensorflow or PyTorch or custom) | PyTorch for 3D U-Net |


## Submission Rules
Expand Down
4 changes: 2 additions & 2 deletions mlpstorage/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# VERSION
VERSION = "2.0.0b1"
__version__ = VERSION
VERSION = "2.1.0"
__version__ = VERSION
Loading
Loading