-
Notifications
You must be signed in to change notification settings - Fork 75
feat: add StatelessProcessGroup to extend collective library #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1b27b3f to
f989a80
Compare
|
@weixiao-huang @HubertZhang pls review this PR test both on npu and cuda.
test the same model using default torch.distributed module.
|
|
It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think |
默认的还是通信方式还是torch.distributed诶,只有需要跨资源的时候才需要用到StatelessProcessGroup,如果不支持这个的话,没法合入到verl呀😆 ,不支持训推分离的架构 |
|
是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够? |
新增了一个 |
In most cases, the logic remains consistent with that before. Only need to depend on vLLM when the cutom distribued module is required. It does not change the way that checkpoint-engien is a lightweight component. |
本地试了下用 # import torch.distributed as dist
import checkpoint_engine.distributed as dist
dist.init_process_group()
dist.all_reduce()
dist.xxxx()如果需要使用 |
|
如果没有其他review意见的话,能否合入下? If no more comments, can this be merged? |
|
话说要不要抽象 StatelessProcessGroup 而非 dist 呢,在 ps 中直接使用封装好的高级方法看起来方便很多?想象中 sub group 的部分可能会复杂一点但是其他的地方应当简单很多? |
没太理解这块的意思,集合通信是调用的PyNcclCommunicator(PyHcclCommunicator)中的NCCLLibrary、HCCLLibrary实现的,StatelessProcessGroup只在init时Communicator用到 |
我仔细看了一下,是否 class Distributed(ABC):
...
@abstractmethod
def sub_group(self, ranks: list[int]) -> "AbstractProcessGroup":
...这两个函数里涉及到通信的地方会简化很多,直接用传进来的 |
我理解当前的抽象方法,符合集合通信惯用的使用方法,这么修改的话不符合使用习惯。 |
vllm 里 StatelessProcessGroup 就是直接用 |
vllm里StatelessProcessGroup只用来传输metadata: https://github.com/vllm-project/vllm/blob/main/vllm/distributed/utils.py#L146 。数据面的传输还是用pynccl(pynccl): https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/rlhf_utils.py#L15 |
哦 我不是说要用 StatelessProcessGroup 做数据面的传输,是说数据面的传输接口 换句话说是希望 |
b0c6ca0 to
47a2561
Compare
6ad7671 to
0901e9f
Compare
|
其他测了一下应该 vllm_nccl 没问题,可以 rebase 一下然后大致按照功能 squash 下吗 |
75268a4 to
d455a21
Compare
我这边也测试了hccl的部分,然后squash成了一个commit,麻烦看下能否合入 |
Motivation
Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the
torch.distributed.Current support vllm, while sglang does not yet supprt
pyhccl. Which feature depends on add pyhccl in sglang.If the current approach in accptable, we will provide sglang version soon.