Question
My program controls multiple GPUs on a single node using a single process with multiple threads (like 1 process and 8 threads to control 8 gpus on a node). How can I implement pure RDMA transfer between GPUs on single node or multiple nodes using nvshmem? (like gpu0-on-node0 -> gpu1-on-node0 or gpu2-on-node0 -> gpu3-on-node1)
I've read the documentation, and it seems that the nvshmem host API doesn't support concurrency. I can't find a way to create 2 x 8 PEs to mapping 2 x 8 GPUs (two nodes and each nodes has eight gpus) using single process with multiple threads.