Use integer promotion for `warp_reduce` #6819

miscco · 2025-12-01T10:12:34Z

We can leverage integer promotion to use the __reduce_meow_sync instructions

With that we get a lot of shuffle instructions turned into reduce instructions

-[ ] Merge after #6814

We can leverage integer promotion to use the `__reduce_meow_sync` instructions

davebayer · 2025-12-01T12:40:45Z

cub/cub/warp/specializations/warp_reduce_shfl.cuh

+inline constexpr bool
+  can_use_reduce_add_sync<T, ::cuda::std::plus<>, ::cuda::std::void_t<decltype(__reduce_add_sync(0xFFFFFFFF, T{}))>> =
+    ::cuda::std::is_integral_v<T> && sizeof(T) <= sizeof(unsigned);


Q: what is the decltype(__reduce_add_sync(0xFFFFFFFF, T{})) actually good for? We know that it can only be a max 32-bit integral, we needn't to test the invocability

I believe that is meant for compiler / toolkit combinations where we cannot rely solely on SM_PROVIDES_SM_80

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

decltype(__reduce_add_sync) is a historical way to handle this function. The common NV_IF_TARGET works perfectly fine with all compilers

SFINAE here is very verbose and adds compilation complexity

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

But I don't think SFINAE would help with this, the function forward declared in ${CTK_INCLUDE}/crt/sm_80_rt.h, which is always included when compiling for arch 80+

github-actions · 2025-12-02T06:48:44Z

😬 CI Workflow Results

🟥 Finished in 6h 01m: Pass: 98%/93 | Total: 4d 14h | Max: 6h 00m | Hits: 69%/84654

See results here.

bernhardmgruber · 2025-12-02T07:34:02Z

cub/benchmarks/bench/reduce/warp_reduce_base.cuh

 }

-NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name("warp_reduce").set_type_axes_names({"T{ct}"});
+NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name(bench_name).set_type_axes_names({"T{ct}"});


Suggestion: benchmarks are typically called base to distinguish them from the tuning variants.

Suggested change

NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name(bench_name).set_type_axes_names({"T{ct}"});

NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name("base").set_type_axes_names({"T{ct}"});

bernhardmgruber · 2025-12-02T07:34:46Z

cub/benchmarks/bench/reduce/warp_reduce_min.cu

+constexpr auto bench_name = "warp_reduce_min";
 using op_t = ::cuda::minimum<>;
 #include "warp_reduce_base.cuh"


Remark: There is usually no need to specify a bench name, since the file name is used for the binary.

davebayer · 2025-12-02T09:34:53Z

cub/cub/warp/specializations/warp_reduce_shfl.cuh

+                       return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input)));
                     }
                     else if constexpr (detail::can_use_reduce_or_sync<T, ReductionOp>)
                     {
-                       return __reduce_or_sync(member_mask, input);
+                       return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input)));
                     }
                     else if constexpr (detail::can_use_reduce_xor_sync<T, ReductionOp>)
                     {
-                       return __reduce_xor_sync(member_mask, input);
+                       return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input)));


We could use the builtins for bitwise operations even for 64 and 128 bit types, maybe it could work also for min/max

miscco added 7 commits December 1, 2025 10:38

Use inline variables to detect builtins

24d2dc6

Drop unused dispatch

77f5739

Rework the dispatch to more efficient intrinsics

4412106

Add support for __reduce_and_sync and __reduce_or_sync

9075131

Use better name

e40d923

Use bitwise operations and also support __reduce_xor_sync

9479fc0

Use integer promotion for warp_reduce

e7b01a6

We can leverage integer promotion to use the `__reduce_meow_sync` instructions

miscco requested a review from a team as a code owner December 1, 2025 10:12

miscco requested a review from gevtushenko December 1, 2025 10:12

github-project-automation bot added this to CCCL Dec 1, 2025

github-project-automation bot moved this to Todo in CCCL Dec 1, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 1, 2025

davebayer reviewed Dec 1, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

miscco mentioned this pull request Dec 1, 2025

Improve our WarpReduce implementation #6814

Merged

miscco assigned fbusato Dec 1, 2025

This comment has been minimized.

Sign in to view

fbusato and others added 4 commits December 1, 2025 15:47

Merge branch 'main' into warp_reduce_promotion

1f8b297

remove unused headers

ef02576

add more bench types

2cb09d0

add bench op name

0b24534

fbusato requested a review from a team as a code owner December 2, 2025 00:35

bernhardmgruber reviewed Dec 2, 2025

View reviewed changes

davebayer reviewed Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use integer promotion for `warp_reduce` #6819

Use integer promotion for `warp_reduce` #6819

Uh oh!

miscco commented Dec 1, 2025 •

edited

Loading

Uh oh!

davebayer Dec 1, 2025 •

edited

Loading

Uh oh!

miscco Dec 1, 2025

Uh oh!

miscco Dec 1, 2025

Uh oh!

fbusato Dec 1, 2025

Uh oh!

fbusato Dec 1, 2025 •

edited

Loading

Uh oh!

davebayer Dec 1, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Dec 2, 2025

Uh oh!

bernhardmgruber Dec 2, 2025

Uh oh!

bernhardmgruber Dec 2, 2025

Uh oh!

davebayer Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name(bench_name).set_type_axes_names({"T{ct}"});
	NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name("base").set_type_axes_names({"T{ct}"});

Use integer promotion for warp_reduce #6819

Are you sure you want to change the base?

Use integer promotion for warp_reduce #6819

Uh oh!

Conversation

miscco commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davebayer Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miscco Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

miscco Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

fbusato Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

fbusato Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davebayer Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Dec 2, 2025

😬 CI Workflow Results

🟥 Finished in 6h 01m: Pass: 98%/93 | Total: 4d 14h | Max: 6h 00m | Hits: 69%/84654

Uh oh!

bernhardmgruber Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

davebayer Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use integer promotion for `warp_reduce` #6819

Use integer promotion for `warp_reduce` #6819

miscco commented Dec 1, 2025 •

edited

Loading

davebayer Dec 1, 2025 •

edited

Loading

fbusato Dec 1, 2025 •

edited

Loading