Skip to content

Conversation

@Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented Oct 17, 2025

New algo for message size < 32KB, command: mpirun --allow-run-as-root -tag-output -np 8 -x LD_PRELOAD=/root/mscclpp/build/apps/nccl/libmscclpp_nccl.so -x MSCCLPP_DISABLE_CHANNEL_CACHE=1 ./build/all_reduce_perf -b 1K -e 32K -f 2 -c 1 -G 1 -n 100 -d half
Tested on H100
Perf:

[1,0]<stdout>:#                                                              out-of-place                       in-place          
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:        1024           512      half     sum      -1     4.65    0.22    0.39      0     4.60    0.22    0.39      0
[1,0]<stdout>:        2048          1024      half     sum      -1     4.96    0.41    0.72      0     4.93    0.42    0.73      0
[1,0]<stdout>:        4096          2048      half     sum      -1     5.12    0.80    1.40      0     5.12    0.80    1.40      0
[1,0]<stdout>:        8192          4096      half     sum      -1     5.11    1.60    2.81      0     5.08    1.61    2.82      0
[1,0]<stdout>:       16384          8192      half     sum      -1     5.47    3.00    5.24      0     5.44    3.01    5.27      0
[1,0]<stdout>:       32768         16384      half     sum      -1     6.24    5.25    9.19      0     6.28    5.22    9.14      0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 3.29145

Old:

[1,0]<stdout>:#                                                              out-of-place                       in-place          
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:        1024           512      half     sum      -1     5.02    0.20    0.36      0     5.12    0.20    0.35      0
[1,0]<stdout>:        2048          1024      half     sum      -1     5.28    0.39    0.68      0     5.29    0.39    0.68      0
[1,0]<stdout>:        4096          2048      half     sum      -1     5.45    0.75    1.32      0     5.46    0.75    1.31      0
[1,0]<stdout>:        8192          4096      half     sum      -1     5.50    1.49    2.61      0     5.51    1.49    2.60      0
[1,0]<stdout>:       16384          8192      half     sum      -1     5.79    2.83    4.95      0     5.80    2.82    4.94      0
[1,0]<stdout>:       32768         16384      half     sum      -1     7.36    4.45    7.79      0     7.36    4.46    7.80      0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 2.94887 

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new allreduce algorithm optimized for small message sizes (< 32KB) using NVLS packet-based communication, improving bandwidth performance by approximately 9% compared to the previous implementation.

Key changes:

  • New AllreduceNvlsPacket algorithm specifically for messages ≤ 32KB
  • Enhanced SwitchChannelDeviceHandle with packet-based broadcast functionality
  • Updated algorithm selection logic to prioritize the new packet algorithm for small messages

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
include/mscclpp/switch_channel_device.hpp Added packet broadcast specialization and multimem store for LLPacket
apps/nccl/src/nccl.cu Integrated new algorithm, optimized selection logic with static variables, fixed control flow
apps/nccl/src/allreduce.hpp Declared AllreduceNvlsPacket class and allreduceNvlsPacket kernel
apps/nccl/src/allreduce.cu Implemented AllreduceNvlsPacket algorithm with adapter and kernel launch logic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Binyang2014 Binyang2014 marked this pull request as ready for review October 17, 2025 21:50
Copy link
Contributor

@caiomcbr caiomcbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Binyang2014 Binyang2014 merged commit cbf448b into main Oct 23, 2025
14 checks passed
@Binyang2014 Binyang2014 deleted the binyli/algo branch October 23, 2025 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants