Add UCCL backend integration for NIXL #895

praveingk · 2025-10-13T08:47:19Z

What?

Added UCCL backend support to NIXL.

Why?

UCCL is an efficient communication library for GPUs, supporting P2P transfers, collectives and EP. UCCL focusses on flexibility for fast-evolving ML workloads, and portability for connecting heterogeneous GPU/NIC vendors. It provides a software transport stack which runs on the CPUs and are easily extensible to support different communication optimization techniques like congestion control, multipathing, efficient loss recovery, etc.
UCCL supports collectives, p2p communication and gpu-driven communication for expert parallelism.

How?

Added the basic plugin for UCCL for inter-node transfers with RDMA with further enhancements in the road map.
Added a test in gtest/plugins to test basic xfer.
Provided references to use the UCCL backend.

Signed-off-by: Pravein Govindan Kannan <[email protected]>

copy-pr-bot · 2025-10-13T08:47:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-10-13T08:47:28Z

👋 Hi praveingk! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

Signed-off-by: Pravein Govindan Kannan <[email protected]>

src/plugins/uccl/README.md

src/plugins/uccl/uccl_backend.cpp

test/gtest/plugins/uccl/uccl_test.cpp

brminich · 2025-10-24T15:07:20Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

praveingk · 2025-10-25T04:20:26Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Signed-off-by: Pravein Govindan Kannan <[email protected]>

brminich · 2025-10-27T14:02:10Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

praveingk · 2025-10-28T05:13:26Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.
Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.
I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

UCCL currently provides various congestion control algorithms - sender driven (TIMELY, SWIFT) and receiver driven (EQDS) which can be leveraged based on the workload characteristics. In the context of a usecase specific to NIXL, PD disaggregation. Receiver driven congestion control could be helpful in scenarios where specific decode pods are seeing bursty incast traffic from different prefill pods. In addition, UCCL emulates packet spraying to leverage the available network paths to avoid "single-path-of-congestion", while implementing congestion-control on each path separately. At scale, this could evenly spread traffic at the core of the network alleviating congestion, packet loss and tail latency which directly contribute to TTFT in PD disaggregation. In my preliminary experiments, I observed UCCL backend achieved 10% lesser TTFT than UCX when prefill/decode pods are allocated cross-rack, since its performing flow-splitting of a single KVCache message of 3GB into smaller chunks that are split across the available paths.

Finally, UCCL separates the optimization logic and heterogenous GPU/NIC/transport hardware logic. Hence, the same optimizations can be applied to different NICs (currently tested with Nvidia and Broadcom NICs), different transport types(AF_XDP-based user space TCP, RDMA, EFA, GPU Direct TCP-X, etc). In the first version, UCCL backend supports RDMA, and adding more transport types are on UCCL P2P agenda (currently AF_XDP-based user space TCP, amazon's EFA are available in UCCL Collectives). Since, the optimizations are run on the CPU, they can be evolved as the workloads/transport evolve.

Signed-off-by: Pravein Govindan Kannan <[email protected]>

… uccl-backend-pr

brminich · 2025-11-18T15:41:58Z

can we pls also add running nixlbench with UCCL plugin in CI in https://github.com/ai-dynamo/nixl/blob/main/.gitlab/test_nixlbench.sh?

ovidiusm

Thank you for your contribution! I left a few comments/questions, feel free to address as you think best, I am not sure if I understood correctly some of the functionality

ovidiusm · 2025-11-19T14:45:59Z

src/api/python/_api.py

                        init["num_threads"] = str(nixl_conf.num_threads)
                    elif bknd == "GDS_MT":
                        init["thread_count"] = str(nixl_conf.num_threads)
+                    elif bknd == "Uccl":


Shoud this be UCCL capitalized?

ovidiusm · 2025-11-19T14:50:18Z

src/plugins/uccl/README.md

+
+Refer to [README](https://github.com/uccl-project/uccl/tree/main/collective/rdma#environment-variables-in-uccl) for the complete list of environment variables that can be set to customize UCCL.
+
+**Important**: For `NIXL_READ` operations set `UCCL_RCMODE=1`. By default, UCCL uses RDMA UC (Unreliable Connection). However, `READ` operations need to operate on RDMA RC (Reliable Connection).


Given that, should we reject read operations if the RCMODE is not set correctly?

ovidiusm · 2025-11-19T14:57:45Z

src/plugins/uccl/uccl_backend.cpp

+nixlUcclEngine::registerMem(const nixlBlobDesc &mem,
+                            const nixl_mem_t &nixl_mem,
+                            nixlBackendMD *&out) {
+    std::lock_guard<std::mutex> lock(mutex_);


It seems that the mutex is protecting mem_reg_info_, but there is no protection in prepXfer/postXfer/genNotif
Is it intentional?

ovidiusm · 2025-11-19T14:59:30Z

src/plugins/uccl/uccl_backend.cpp

+        engine_ = nullptr;
+    }
+
+    if (listener_thread_.joinable()) {


Shouldn't the listener thread be destroyed first?

ovidiusm · 2025-11-19T15:02:53Z

src/plugins/uccl/uccl_backend.cpp

+                if (retry_count < max_retries) {
+                    NIXL_DEBUG << "Failed to get FIFO item, retry " << retry_count << "/"
+                               << max_retries << " for item " << i;
+                    std::this_thread::sleep_for(std::chrono::milliseconds(1));


I do not understand why this is a blocking operation in prepXfer

ovidiusm · 2025-11-19T15:07:07Z

src/plugins/uccl/uccl_backend.cpp

+
+    bool all_done = true;
+    for (uint64_t transfer_id : uccl_handle->transfer_ids) {
+        if (std::find(uccl_handle->completed_transfer_ids.begin(),


completed_transfer_ids is a vector, so this find is slow. I suggest using unordered set.

Note that we have seen batches of up to 64k in practice, so this is important

ovidiusm · 2025-11-19T15:08:19Z

src/plugins/uccl/uccl_backend.cpp

+    }
+
+    bool all_done = true;
+    for (uint64_t transfer_id : uccl_handle->transfer_ids) {


Would it be worth using a set for the pending ones, so that we do not recheck the completed ones every iteration?

ovidiusm · 2025-11-19T15:10:30Z

src/plugins/uccl/uccl_backend.cpp

+    conn = uccl_engine_connect(engine_, ip_addr, gpu_index, port);
+    if (!conn) {
+        NIXL_ERROR << "Failed to connect to remote agent " << remote_agent;
+        delete[] ip_addr;


Minor style comment: might be less error prone to use a smart pointer like std::unique_ptr

praveingk · 2025-11-20T04:07:16Z

Thank you for your contribution! I left a few comments/questions, feel free to address as you think best, I am not sure if I understood correctly some of the functionality

Thanks for your comments @ovidiusm and @brminich . I will address them.

praveingk added 7 commits October 13, 2025 12:03

Add support uccl-backend for NIXL

86f6d6b

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Cleanup and minor changes to readme

cd2bbea

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Further cleanup including moving startListener to private

7e211cb

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Revert changes to plugin manager test

b5e3c5b

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Fix formatting

9648e6e

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Add more details to readme

489ecd0

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Remove trailing whitespaces in readme

25da211

Signed-off-by: Pravein Govindan Kannan <[email protected]>

praveingk requested a review from a team as a code owner October 13, 2025 08:47

pull-request-size bot added the size/XXL label Oct 13, 2025

github-actions bot added the external-contribution label Oct 13, 2025

praveingk added 4 commits October 13, 2025 14:20

Remove trailing whitespaces in test

baa7c85

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Remove num_cpus from agent config

db047c0

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Fix typo in uccl backend

524e078

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Merge branch 'main' into uccl-backend-pr

be90f7f

This was referenced Oct 15, 2025

Develop an UCCL plugin for NIXL uccl-project/uccl#38

Open

[Feature]: Add uccl as kvconnect provide vllm-project/vllm#24079

Open

brminich reviewed Oct 24, 2025

View reviewed changes

praveingk added 3 commits October 27, 2025 11:54

Add UCCL to test_plugin

dde50c5

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Modify env vars descriptions and road map to README

7eb674c

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Merge branch 'main' into uccl-backend-pr

a972a7c

Signed-off-by: Pravein Govindan Kannan <[email protected]>

praveingk added 2 commits October 30, 2025 10:41

Remove custom parameter device_idx

96866cd

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Merge branch 'uccl-backend-pr' of github.com:praveingk/nixl-uccl into…

06c74a1

… uccl-backend-pr

ovidiusm self-requested a review November 18, 2025 12:43

Merge branch 'main' into uccl-backend-pr

c6b0880

ovidiusm reviewed Nov 19, 2025

View reviewed changes


		Refer to [README](https://github.com/uccl-project/uccl/tree/main/collective/rdma#environment-variables-in-uccl) for the complete list of environment variables that can be set to customize UCCL.

		Important: For `NIXL_READ` operations set `UCCL_RCMODE=1`. By default, UCCL uses RDMA UC (Unreliable Connection). However, `READ` operations need to operate on RDMA RC (Reliable Connection).

Add UCCL backend integration for NIXL #895

Are you sure you want to change the base?

Add UCCL backend integration for NIXL #895

Uh oh!

Conversation

praveingk commented Oct 13, 2025

What?

Why?

How?

Uh oh!

copy-pr-bot bot commented Oct 13, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brminich commented Oct 24, 2025

Uh oh!

praveingk commented Oct 25, 2025

Uh oh!

brminich commented Oct 27, 2025

Uh oh!

praveingk commented Oct 28, 2025

Uh oh!

brminich commented Nov 18, 2025

Uh oh!

ovidiusm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praveingk commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants