Skip to content

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Dec 1, 2025

We can leverage integer promotion to use the __reduce_meow_sync instructions

With that we get a lot of shuffle instructions turned into reduce instructions
Reduce MAX

-[ ] Merge after #6814

@miscco miscco requested a review from a team as a code owner December 1, 2025 10:12
@miscco miscco requested a review from gevtushenko December 1, 2025 10:12
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 1, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 1, 2025
Comment on lines 52 to 54
inline constexpr bool
can_use_reduce_add_sync<T, ::cuda::std::plus<>, ::cuda::std::void_t<decltype(__reduce_add_sync(0xFFFFFFFF, T{}))>> =
::cuda::std::is_integral_v<T> && sizeof(T) <= sizeof(unsigned);
Copy link
Contributor

@davebayer davebayer Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what is the decltype(__reduce_add_sync(0xFFFFFFFF, T{})) actually good for? We know that it can only be a max 32-bit integral, we needn't to test the invocability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is meant for compiler / toolkit combinations where we cannot rely solely on SM_PROVIDES_SM_80

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decltype(__reduce_add_sync) is a historical way to handle this function. The common NV_IF_TARGET works perfectly fine with all compilers

Copy link
Contributor

@fbusato fbusato Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SFINAE here is very verbose and adds compilation complexity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

But I don't think SFINAE would help with this, the function forward declared in ${CTK_INCLUDE}/crt/sm_80_rt.h, which is always included when compiling for arch 80+

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@fbusato fbusato requested a review from a team as a code owner December 2, 2025 00:35
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

😬 CI Workflow Results

🟥 Finished in 6h 01m: Pass: 98%/93 | Total: 4d 14h | Max: 6h 00m | Hits: 69%/84654

See results here.

}

NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name("warp_reduce").set_type_axes_names({"T{ct}"});
NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name(bench_name).set_type_axes_names({"T{ct}"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: benchmarks are typically called base to distinguish them from the tuning variants.

Suggested change
NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name(bench_name).set_type_axes_names({"T{ct}"});
NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name("base").set_type_axes_names({"T{ct}"});

Comment on lines +29 to 31
constexpr auto bench_name = "warp_reduce_min";
using op_t = ::cuda::minimum<>;
#include "warp_reduce_base.cuh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: There is usually no need to specify a bench name, since the file name is used for the binary.

Comment on lines +563 to +571
return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input)));
}
else if constexpr (detail::can_use_reduce_or_sync<T, ReductionOp>)
{
return __reduce_or_sync(member_mask, input);
return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input)));
}
else if constexpr (detail::can_use_reduce_xor_sync<T, ReductionOp>)
{
return __reduce_xor_sync(member_mask, input);
return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the builtins for bitwise operations even for 64 and 128 bit types, maybe it could work also for min/max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants