CLC (cluster launch control) backend #8740

dshi7 · 2025-11-16T19:13:13Z

Upstream CLC backend changes from following PRs. It works well with TLX frontend (see unit test and tutorial kernel). This PR contains the backend part only.

[TLX] cluster launch control facebookexperimental/triton#345 (init)
[TLX] Return CTA ID from PTX ASM builder facebookexperimental/triton#399 (major fix in both frontend+backend)
[TLX] CLC in Blackwell GEMM facebookexperimental/triton#472 (frontend: minor fix + add tutorial kernel)
CLC scheduler (producer-consumer) facebookexperimental/triton#516 (major upgrade in frontend with producer-consumer API)
Fix CLC unit test facebookexperimental/triton#620 (unit test only)

TTGIR

include/triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOps.td

PTX lowering

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/BarrierOpToLLVM.cpp

MLIR

test/Conversion/tritonnvidiagpu_to_llvm.mlir

Test
lit test/Conversion/tritonnvidiagpu_to_llvm.mlir

make test-cpp

otherwise, it hits following errors: ******************** TEST 'TRITON :: Conversion/tritonnvidiagpu_to_llvm.mlir' FAILED ******************** Exit Code: 1 Command Output (stderr): -- RUN: at line 1: /data/users/daohang/triton/build/cmake.linux-x86_64-cpython-3.13/bin/triton-opt /data/users/daohang/triton/test/Conversion/tritonnvidiagpu_to_llvm.mlir -split-input-file --convert-triton-gpu-to-llvm=compute-capability=90 -reconcile-unrealized-casts | FileCheck /data/users/daohang/triton/test/Conversion/tritonnvidiagpu_to_llvm.mlir + /data/users/daohang/triton/build/cmake.linux-x86_64-cpython-3.13/bin/triton-opt /data/users/daohang/triton/test/Conversion/tritonnvidiagpu_to_llvm.mlir -split-input-file --convert-triton-gpu-to-llvm=compute-capability=90 -reconcile-unrealized-casts + FileCheck /data/users/daohang/triton/test/Conversion/tritonnvidiagpu_to_llvm.mlir triton-opt: /data/users/daohang/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/PTXAsmFormat.cpp:157: PTXInstrExecution &mlir::triton::PTXInstrCommon::call(ArrayRef<Operand *>, bool): Assertion `builder->executions.empty() && "builder can only hold a single execution when onlyAttachMIIRArgs " "is true."' failed. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug. Stack dump: 0. Program arguments: /data/users/daohang/triton/build/cmake.linux-x86_64-cpython-3.13/bin/triton-opt /data/users/daohang/triton/test/Conversion/tritonnvidiagpu_to_llvm.mlir -split-input-file --convert-triton-gpu-to-llvm=compute-capability=90 -reconcile-unrealized-casts #0 0x0000000007210ea8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /data/users/daohang/triton/llvm-project/llvm/lib/Support/Unix/Signals.inc:834:13 triton-lang#1 0x000000000720ea83 llvm::sys::RunSignalHandlers() /data/users/daohang/triton/llvm-project/llvm/lib/Support/Signals.cpp:105:18 triton-lang#2 0x0000000007211c41 SignalHandler(int, siginfo_t*, void*) /data/users/daohang/triton/llvm-project/llvm/lib/Support/Unix/Signals.inc:426:38 triton-lang#3 0x00007fd6b583fc30 __restore_rt (/lib64/libc.so.6+0x3fc30) triton-lang#4 0x00007fd6b588d03c __pthread_kill_implementation (/lib64/libc.so.6+0x8d03c) triton-lang#5 0x00007fd6b583fb86 gsignal (/lib64/libc.so.6+0x3fb86) triton-lang#6 0x00007fd6b5829873 abort (/lib64/libc.so.6+0x29873) triton-lang#7 0x00007fd6b582979b _nl_load_domain.cold (/lib64/libc.so.6+0x2979b) triton-lang#8 0x00007fd6b58388c6 (/lib64/libc.so.6+0x388c6) triton-lang#9 0x00000000031dad20 mlir::triton::PTXInstrCommon::operator()(llvm::ArrayRef<mlir::triton::PTXBuilder::Operand*>, bool) /data/users/daohang/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/PTXAsmFormat.cpp:169:10 /data/users/daohang/triton/test/Conversion/tritonnvidiagpu_to_llvm.mlir:91:18: error: CHECK-LABEL: expected string not found in input // CHECK-LABEL: async_clc_query_cancel ^ <stdin>:56:33: note: scanning from here llvm.func @async_clc_try_cancel(%arg0: !llvm.struct<(ptr<3>, i32)>, %arg1: !llvm.struct<(ptr<3>, i32)>, %arg2: !llvm.ptr<1>, %arg3: !llvm.ptr<1>) attributes {nvvm.kernel = 1 : ui1, nvvm.reqntid = array<i32: 128>} { ^ <stdin>:62:384: note: possible intended match here %5 = llvm.inline_asm has_side_effects asm_dialect = att operand_attrs = [] "\0A {\0A .reg .u32 first_cta_in_cluster;\0A .reg .pred pred_first_cta_in_cluster;\0A .reg .pred pred_issue;\0A mov.u32 first_cta_in_cluster, %cluster_ctaid.x;\0A setp.u32.eq pred_first_cta_in_cluster, first_cta_in_cluster, 0x0;\0A and.pred pred_issue, $2, pred_first_cta_in_cluster;\0A @pred_issue clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.multicast::cluster::all.b128 [$0], [$1];\0A }\0A ", "r,r,b" %arg1, %arg0, %4 : (!llvm.struct<(ptr<3>, i32)>, !llvm.struct<(ptr<3>, i32)>, i1) -> !llvm.void

dshi7 · 2025-11-17T01:25:02Z

broken ci complains about HIP OOM and should be false positive

ThomasRaoux · 2025-11-17T01:37:27Z

Would it be possible to stack the PRs needed until the point where we can have a Gluon execution test? That would help us understand the scope of the feature.
I don't have a good sense of how much work it is, otherwise maybe having a branch somewhere would help.

ThomasRaoux · 2025-11-17T01:39:41Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/BarrierOpToLLVM.cpp

+      mov.u32  first_cta_in_cluster, %cluster_ctaid.x;
+      setp.u32.eq pred_first_cta_in_cluster, first_cta_in_cluster, 0x0;
+      and.pred pred_issue, $2, pred_first_cta_in_cluster;


separate this out of the inline ptx, this will allow the code sequence to be optimized

sure, happy to do that. would you elaborate more about to what extent I should separate this out? asking because I was basically following the same style in ArriveBarrierOpConversion. more context or an existing example would be even better.

ThomasRaoux · 2025-11-17T01:41:17Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/BarrierOpToLLVM.cpp

+      .reg .b128 clc_result;
+      .reg .pred p1;
+      mov.s32 $0, -1;
+      ld.shared.b128 clc_result, [$1];


same here, can we separate this out?

dshi7 added 2 commits November 16, 2025 10:55

CLC backend

452bf92

dshi7 requested a review from ptillet as a code owner November 16, 2025 19:13

fix make test-cpp

cb7bf7f

dshi7 force-pushed the daohang/clc_backend branch from 8b99a8a to cb7bf7f Compare November 17, 2025 00:12

ThomasRaoux reviewed Nov 17, 2025

View reviewed changes

rename by droping async from query_cancel

21980c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLC (cluster launch control) backend #8740

CLC (cluster launch control) backend #8740

dshi7 commented Nov 16, 2025 •

edited

Loading

Uh oh!

dshi7 commented Nov 17, 2025

Uh oh!

ThomasRaoux commented Nov 17, 2025

Uh oh!

ThomasRaoux Nov 17, 2025

Uh oh!

dshi7 Nov 17, 2025

Uh oh!

ThomasRaoux Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLC (cluster launch control) backend #8740

Are you sure you want to change the base?

CLC (cluster launch control) backend #8740

Conversation

dshi7 commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dshi7 commented Nov 17, 2025

Uh oh!

ThomasRaoux commented Nov 17, 2025

Uh oh!

ThomasRaoux Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

dshi7 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dshi7 commented Nov 16, 2025 •

edited

Loading