Skip to content

Commit afb3e52

Browse files
authored
Merge pull request #948 from rohanbabbar04/final-blog-rohanbabbar04
Add Final Project Blog
2 parents 8c02606 + fe3663c commit afb3e52

File tree

5 files changed

+195
-0
lines changed

5 files changed

+195
-0
lines changed
126 KB
Loading
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
---
2+
title: "Final Update(Mid-Term -> Final): MPI Appliance for HPC Research on Chameleon"
3+
subtitle: ""
4+
summary:
5+
authors:
6+
- rohanbabbar04
7+
tags: ["osre25", "reproducibility", "MPI", "cloud computing"]
8+
categories: ["osre25", "reproducibility", "HPC", "MPI"]
9+
date: 2025-08-31
10+
lastmod: 2025-08-31
11+
featured: false
12+
draft: false
13+
14+
# Featured image
15+
# To use, add an image named `featured.jpg/png` to your page's folder.
16+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
17+
image:
18+
caption: "Chameleon Cloud"
19+
focal_point: Top
20+
preview_only: false
21+
---
22+
23+
Hi everyone! This is my final update, covering the progress made every two weeks from the midterm to the end of the
24+
project [MPI Appliance for HPC Research on Chameleon](https://ucsc-ospo.github.io/project/osre25/uchicago/mpi/), developed
25+
in collaboration with Argonne National Laboratory and the Chameleon Cloud community.
26+
This blog follows up on my earlier post, which you can find [here](https://ucsc-ospo.github.io/report/osre25/uchicago/mpi/20250803-rohan-babbar/).
27+
28+
### 🔧 July 29 – August 11, 2025
29+
30+
With the CUDA- and MPI-Spack–based appliances published, we considered releasing another image variant (ROCm-based) for AMD GPUs.
31+
This will be primarily used in CHI@TACC, which provides AMD GPUs. We have successfully published a new image on Chameleon titled [MPI and Spack for HPC (Ubuntu 22.04 - ROCm)](https://chameleoncloud.org/appliances/131/),
32+
and we also added an example to demonstrate its usage.
33+
34+
### 🔧 August 12 – August 25, 2025
35+
36+
With the examples now available on Trovi for creating an MPI cluster using Ansible and Python-CHI, my next step was to experiment with stack orchestration using Heat Orchestration Templates (HOT) on OpenStack Chameleon Cloud.
37+
This turned out to be more challenging due to a few restrictions:
38+
39+
1) **OS::Nova::Keypair (new version)**: In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
40+
2) **OS::Heat::SoftwareConfig**: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.
41+
42+
To address these issues, we adopted a new strategy for configuring and creating the MPI cluster: using a temporary bootstrap node.
43+
44+
In simple terms, the workflow of the Heat template is:
45+
46+
1) Provision master and worker nodes via the HOT template on OpenStack.
47+
2) Launch a bootstrap node, install Git and Ansible on it, and then run an Ansible playbook from the bootstrap node to configure the master and worker nodes. This includes setting up SSH, host communication, and the MPI environment.
48+
49+
This provides an alternative method for creating an MPI cluster.
50+
51+
We presented this work on August 26, 2025, to the Chameleon Team and the Argonne MPICH Team. The project was very well received.
52+
53+
Stay tuned for my final report on this work, which I’ll be sharing in my next blog post.
126 KB
Loading
38.6 KB
Loading
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: "Final Report: MPI Appliance for HPC Research on Chameleon"
3+
subtitle: ""
4+
summary:
5+
authors:
6+
- rohanbabbar04
7+
tags: ["osre25", "reproducibility", "MPI", "cloud computing"]
8+
categories: ["osre25", "reproducibility", "HPC", "MPI"]
9+
date: 2025-09-01
10+
lastmod: 2025-09-01
11+
featured: false
12+
draft: false
13+
14+
# Featured image
15+
# To use, add an image named `featured.jpg/png` to your page's folder.
16+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
17+
image:
18+
caption: "Chameleon Cloud"
19+
focal_point: Top
20+
preview_only: false
21+
---
22+
23+
Hi Everyone, This is my final report for the project I completed during my summer as a [Summer of Reproducibility (SOR)](https://ucsc-ospo.github.io/sor/) student.
24+
The project, titled "[MPI Appliance for HPC Research in Chameleon](https://ucsc-ospo.github.io/project/osre25/uchicago/mpi/)," was undertaken in collaboration with Argonne National Laboratory
25+
and the Chameleon Cloud community. The project was mentored by {{% mention raffenettik %}} and was completed over the summer.
26+
This blog details the work and outcomes of the project.
27+
28+
## Background
29+
30+
Message Passing Interface (MPI) is the backbone of high-performance computing (HPC), enabling efficient scaling across thousands of
31+
processing cores. However, reproducing MPI-based experiments remains challenging due to dependencies on specific library versions,
32+
network configurations, and multi-node setups.
33+
34+
To address this, we introduce a reproducibility initiative that provides standardized MPI environments on the Chameleon testbed.
35+
This is set up as a master–worker MPI cluster. The master node manages tasks and communication, while the worker nodes do the computations.
36+
All nodes have the same MPI libraries, software, and network settings, making experiments easier to scale and reproduce.
37+
38+
## Objectives
39+
40+
The aim of this project is to create an MPI cluster that is reproducible, easily deployable, and efficiently configurable.
41+
42+
The key objectives of this project were:
43+
44+
1) Pre-built MPI Images: Create ready-to-use images with MPI and all dependencies installed.
45+
46+
2) Automated Cluster Configuration: Develop Ansible playbooks to configure master–worker communication, including host setup, SSH key distribution, and MPI configuration across nodes.
47+
48+
3) Cluster Orchestration: Develop orchestration template to provision resources and invoke Ansible playbooks for automated cluster setup.
49+
50+
## Implementation Strategy and Deliverables
51+
52+
### Openstack Image Creation
53+
54+
The first step was to create a standardized pre-built image, which serves as the base image for all nodes in the cluster.
55+
56+
Some important features of the image include:
57+
58+
1) Built on Ubuntu 22.04 for a stable base environment.
59+
2) [Spack](https://spack.io/) + Lmod integration:
60+
- Spack handles reproducible, version-controlled installations of software packages.
61+
- Lmod (Lua Modules) provides a user-friendly way to load/unload software environments dynamically.
62+
- Together, they allow users to easily switch between MPI versions, libraries, and GPU toolkits
63+
3) [MPICH](https://github.com/pmodels/mpich) and [OpenMPI](https://github.com/open-mpi/ompi) pre-installed for standard MPI support and can be loaded/unloaded.
64+
4) Three image variants for various HPC workloads: CPU-only, NVIDIA GPU (CUDA 12.8), and AMD GPU (ROCm 6.4.2).
65+
66+
These images have been published and are available in the Chameleon Cloud Appliance Catalog:
67+
68+
- [MPI and Spack for HPC (Ubuntu 22.04)](https://chameleoncloud.org/appliances/127/) - CPU Only
69+
- [MPI and Spack for HPC (Ubuntu 22.04 - CUDA)](https://chameleoncloud.org/appliances/130/) - NVIDIA GPU (CUDA 12.8)
70+
- [MPI and Spack for HPC (Ubuntu 22.04 - ROCm)](https://chameleoncloud.org/appliances/131/) - AMD GPU (ROCm 6.4.2)
71+
72+
### Cluster Configuration using Ansible
73+
74+
The next step is to create scripts/playbooks to configure these nodes and set up an HPC cluster.
75+
We assigned specific roles to different nodes in the cluster and combined them into a single playbook to configure the entire cluster automatically.
76+
77+
Some key steps the playbook performs:
78+
79+
1) Configure /etc/hosts entries for all nodes.
80+
2) Mount Manila NFS shares on each node.
81+
3) Generate an SSH key pair on the master node and add the master’s public key to the workers’ authorized_keys.
82+
4) Scan worker node keys and update known_hosts on the master.
83+
5) (Optional) Manage software:
84+
- Install new compilers with Spack
85+
- Add new Spack packages
86+
- Update environment modules to recognize them
87+
6) Create a hostfile at /etc/mpi/hostfile.
88+
89+
The code is publicly available and can be found on the GitHub repository: https://github.com/rohanbabbar04/MPI-Spack-Experiment-Artifact
90+
91+
### Orchestration
92+
With the image now created and deployed, and the Ansible scripts ready for cluster configuration, we put everything
93+
together to orchestrate the cluster deployment.
94+
95+
This can be done in two primary ways:
96+
97+
#### Python CHI(Jupyter) + Ansible
98+
99+
[Python-CHI](https://github.com/ChameleonCloud/python-chi) is a python library designed to facilitate interaction with the Chameleon testbed. Often used within environments like Jupyter notebooks.
100+
101+
This setup can be put up as:
102+
103+
1) Create leases, launch instances, and set up shared storage using python-chi commands.
104+
2) Automatically generate inventory.ini for Ansible based on launched instances.
105+
3) Run Ansible playbook programmatically using `ansible_runner`.
106+
4) Outcome: fully configured, ready-to-use HPC cluster; SSH into master to run examples.
107+
108+
If you would like to see a working example, you can view it in the [Trovi example](https://chameleoncloud.org/experiment/share/7424a8dc-0688-4383-9d67-1e40ff37de17)
109+
110+
#### Heat Orchestration Template
111+
112+
Heat Orchestration Template(HOT) is a YAML based configuration file. Its purpose is to define/create a stack to automate
113+
the deployment and configuration of OpenStack cloud resources.
114+
115+
**Challenges**
116+
117+
We faced some challenges while working with Heat templates and stacks in particular in Chameleon Cloud
118+
119+
1) `OS::Nova::Keypair`(new version): In the latest OpenStack version, the stack fails to launch if the ``public_key`` parameter is not provided for the keypair,
120+
as auto-generation is no longer supported.
121+
2) `OS::Heat::SoftwareConfig`: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.
122+
123+
![Heat Approach](heatapproach.png)
124+
125+
To tackle these challenges, we designed an approach that is both easy to implement and reproducible. First, we launch instances
126+
by provisioning master and worker nodes using the HOT template in OpenStack. Next, we set up a bootstrap node, install Git and Ansible,
127+
and run an Ansible playbook from the bootstrap node to configure the master and worker nodes, including SSH, host communication, and
128+
MPI setup. The outcome is a fully configured, ready-to-use HPC cluster, where users can simply SSH into the master node to run examples.
129+
130+
Users can view/use the template published in the Appliance Catalog: [MPI+Spack Bare Metal Cluster](https://chameleoncloud.org/appliances/132/).
131+
For example, a demonstration of how to pass parameters is available on [Trovi](https://chameleoncloud.org/experiment/share/7424a8dc-0688-4383-9d67-1e40ff37de17).
132+
133+
## Conclusion
134+
135+
In conclusion, this work demonstrates a reproducible approach to building and configuring MPI clusters on the Chameleon testbed. By using standardized images,
136+
Ansible automation, and Orchestration Templates, we ensure that every node is consistently set up, reducing manual effort and errors. The artifact, published on Trovi,
137+
makes the entire process transparent, reusable, and easy to implement, enabling users/researchers to reliably recreate and extend the cluster environment for their own
138+
experiments.
139+
140+
## Future Work
141+
142+
Maintaining these images and possibly creating a script to reproduce MPI and Spack on a different image base environment.

0 commit comments

Comments
 (0)