New allreduce algo for small message size #647

Binyang2014 · 2025-10-17T21:26:59Z

New algo for message size < 32KB, command: mpirun --allow-run-as-root -tag-output -np 8 -x LD_PRELOAD=/root/mscclpp/build/apps/nccl/libmscclpp_nccl.so -x MSCCLPP_DISABLE_CHANNEL_CACHE=1 ./build/all_reduce_perf -b 1K -e 32K -f 2 -c 1 -G 1 -n 100 -d half
Tested on H100
Perf:

[1,0]<stdout>:#                                                              out-of-place                       in-place          
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:        1024           512      half     sum      -1     4.65    0.22    0.39      0     4.60    0.22    0.39      0
[1,0]<stdout>:        2048          1024      half     sum      -1     4.96    0.41    0.72      0     4.93    0.42    0.73      0
[1,0]<stdout>:        4096          2048      half     sum      -1     5.12    0.80    1.40      0     5.12    0.80    1.40      0
[1,0]<stdout>:        8192          4096      half     sum      -1     5.11    1.60    2.81      0     5.08    1.61    2.82      0
[1,0]<stdout>:       16384          8192      half     sum      -1     5.47    3.00    5.24      0     5.44    3.01    5.27      0
[1,0]<stdout>:       32768         16384      half     sum      -1     6.24    5.25    9.19      0     6.28    5.22    9.14      0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 3.29145

Old:

[1,0]<stdout>:#                                                              out-of-place                       in-place          
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:        1024           512      half     sum      -1     5.02    0.20    0.36      0     5.12    0.20    0.35      0
[1,0]<stdout>:        2048          1024      half     sum      -1     5.28    0.39    0.68      0     5.29    0.39    0.68      0
[1,0]<stdout>:        4096          2048      half     sum      -1     5.45    0.75    1.32      0     5.46    0.75    1.31      0
[1,0]<stdout>:        8192          4096      half     sum      -1     5.50    1.49    2.61      0     5.51    1.49    2.60      0
[1,0]<stdout>:       16384          8192      half     sum      -1     5.79    2.83    4.95      0     5.80    2.82    4.94      0
[1,0]<stdout>:       32768         16384      half     sum      -1     7.36    4.45    7.79      0     7.36    4.46    7.80      0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 2.94887

Copilot

Pull Request Overview

This PR introduces a new allreduce algorithm optimized for small message sizes (< 32KB) using NVLS packet-based communication, improving bandwidth performance by approximately 9% compared to the previous implementation.

Key changes:

New AllreduceNvlsPacket algorithm specifically for messages ≤ 32KB
Enhanced SwitchChannelDeviceHandle with packet-based broadcast functionality
Updated algorithm selection logic to prioritize the new packet algorithm for small messages

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
include/mscclpp/switch_channel_device.hpp	Added packet broadcast specialization and multimem store for LLPacket
apps/nccl/src/nccl.cu	Integrated new algorithm, optimized selection logic with static variables, fixed control flow
apps/nccl/src/allreduce.hpp	Declared AllreduceNvlsPacket class and allreduceNvlsPacket kernel
apps/nccl/src/allreduce.cu	Implemented AllreduceNvlsPacket algorithm with adapter and kernel launch logic

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

include/mscclpp/switch_channel_device.hpp

apps/nccl/src/nccl.cu

apps/nccl/src/allreduce.hpp

apps/nccl/src/allreduce.cu

caiomcbr

LGTM

include/mscclpp/switch_channel_device.hpp

Binyang2014 added 3 commits October 17, 2025 03:38

Fix

455fd35

new allreduce algo

566b69d

WIP

370929c

Binyang2014 requested review from caiomcbr, chhwang and Copilot October 17, 2025 21:27

Copilot AI reviewed Oct 17, 2025

View reviewed changes

fix

e12c1b0

Binyang2014 marked this pull request as ready for review October 17, 2025 21:50

fix

32f6704

caiomcbr approved these changes Oct 20, 2025

View reviewed changes

chhwang approved these changes Oct 20, 2025

View reviewed changes

include/mscclpp/switch_channel_device.hpp Outdated Show resolved Hide resolved

Binyang2014 added 4 commits October 21, 2025 22:28

address comments

7eaa2fd

WIP

a147bd9

Merge branch 'main' into binyli/algo

0d6db9b

Merge branch 'main' into binyli/algo

8f1ee8b

chhwang approved these changes Oct 23, 2025

View reviewed changes

Binyang2014 merged commit cbf448b into main Oct 23, 2025
14 checks passed

Binyang2014 deleted the binyli/algo branch October 23, 2025 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New allreduce algo for small message size #647

New allreduce algo for small message size #647

Uh oh!

Binyang2014 commented Oct 17, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

caiomcbr left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

New allreduce algo for small message size #647

New allreduce algo for small message size #647

Uh oh!

Conversation

Binyang2014 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

caiomcbr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Binyang2014 commented Oct 17, 2025 •

edited

Loading