Skip to content

Conversation

@praveingk
Copy link

What?

Added UCCL backend support to NIXL.

Why?

UCCL is an efficient communication library for GPUs, supporting P2P transfers, collectives and EP. UCCL focusses on flexibility for fast-evolving ML workloads, and portability for connecting heterogeneous GPU/NIC vendors. It provides a software transport stack which runs on the CPUs and are easily extensible to support different communication optimization techniques like congestion control, multipathing, efficient loss recovery, etc.
UCCL supports collectives, p2p communication and gpu-driven communication for expert parallelism.

How?

  1. Added the basic plugin for UCCL for inter-node transfers with RDMA with further enhancements in the road map.
  2. Added a test in gtest/plugins to test basic xfer.
  3. Provided references to use the UCCL backend.

Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
@praveingk praveingk requested a review from a team as a code owner October 13, 2025 08:47
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi praveingk! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@brminich
Copy link
Contributor

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@praveingk
Copy link
Author

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
@brminich
Copy link
Contributor

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

@praveingk
Copy link
Author

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.
Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.
I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

UCCL currently provides various congestion control algorithms - sender driven (TIMELY, SWIFT) and receiver driven (EQDS) which can be leveraged based on the workload characteristics. In the context of a usecase specific to NIXL, PD disaggregation. Receiver driven congestion control could be helpful in scenarios where specific decode pods are seeing bursty incast traffic from different prefill pods. In addition, UCCL emulates packet spraying to leverage the available network paths to avoid "single-path-of-congestion", while implementing congestion-control on each path separately. At scale, this could evenly spread traffic at the core of the network alleviating congestion, packet loss and tail latency which directly contribute to TTFT in PD disaggregation. In my preliminary experiments, I observed UCCL backend achieved 10% lesser TTFT than UCX when prefill/decode pods are allocated cross-rack, since its performing flow-splitting of a single KVCache message of 3GB into smaller chunks that are split across the available paths.

Finally, UCCL separates the optimization logic and heterogenous GPU/NIC/transport hardware logic. Hence, the same optimizations can be applied to different NICs (currently tested with Nvidia and Broadcom NICs), different transport types(AF_XDP-based user space TCP, RDMA, EFA, GPU Direct TCP-X, etc). In the first version, UCCL backend supports RDMA, and adding more transport types are on UCCL P2P agenda (currently AF_XDP-based user space TCP, amazon's EFA are available in UCCL Collectives). Since, the optimizations are run on the CPU, they can be evolved as the workloads/transport evolve.

@ovidiusm ovidiusm self-requested a review November 18, 2025 12:43
@brminich
Copy link
Contributor

can we pls also add running nixlbench with UCCL plugin in CI in https://github.com/ai-dynamo/nixl/blob/main/.gitlab/test_nixlbench.sh?

Copy link
Contributor

@ovidiusm ovidiusm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I left a few comments/questions, feel free to address as you think best, I am not sure if I understood correctly some of the functionality

init["num_threads"] = str(nixl_conf.num_threads)
elif bknd == "GDS_MT":
init["thread_count"] = str(nixl_conf.num_threads)
elif bknd == "Uccl":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoud this be UCCL capitalized?


Refer to [README](https://github.com/uccl-project/uccl/tree/main/collective/rdma#environment-variables-in-uccl) for the complete list of environment variables that can be set to customize UCCL.

**Important**: For `NIXL_READ` operations set `UCCL_RCMODE=1`. By default, UCCL uses RDMA UC (Unreliable Connection). However, `READ` operations need to operate on RDMA RC (Reliable Connection).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that, should we reject read operations if the RCMODE is not set correctly?

nixlUcclEngine::registerMem(const nixlBlobDesc &mem,
const nixl_mem_t &nixl_mem,
nixlBackendMD *&out) {
std::lock_guard<std::mutex> lock(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the mutex is protecting mem_reg_info_, but there is no protection in prepXfer/postXfer/genNotif
Is it intentional?

engine_ = nullptr;
}

if (listener_thread_.joinable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the listener thread be destroyed first?

if (retry_count < max_retries) {
NIXL_DEBUG << "Failed to get FIFO item, retry " << retry_count << "/"
<< max_retries << " for item " << i;
std::this_thread::sleep_for(std::chrono::milliseconds(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why this is a blocking operation in prepXfer


bool all_done = true;
for (uint64_t transfer_id : uccl_handle->transfer_ids) {
if (std::find(uccl_handle->completed_transfer_ids.begin(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed_transfer_ids is a vector, so this find is slow. I suggest using unordered set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we have seen batches of up to 64k in practice, so this is important

}

bool all_done = true;
for (uint64_t transfer_id : uccl_handle->transfer_ids) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth using a set for the pending ones, so that we do not recheck the completed ones every iteration?

conn = uccl_engine_connect(engine_, ip_addr, gpu_index, port);
if (!conn) {
NIXL_ERROR << "Failed to connect to remote agent " << remote_agent;
delete[] ip_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style comment: might be less error prone to use a smart pointer like std::unique_ptr

@praveingk
Copy link
Author

Thank you for your contribution! I left a few comments/questions, feel free to address as you think best, I am not sure if I understood correctly some of the functionality

Thanks for your comments @ovidiusm and @brminich . I will address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants