-
Notifications
You must be signed in to change notification settings - Fork 297
Use integer promotion for warp_reduce
#6819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
We can leverage integer promotion to use the `__reduce_meow_sync` instructions
| inline constexpr bool | ||
| can_use_reduce_add_sync<T, ::cuda::std::plus<>, ::cuda::std::void_t<decltype(__reduce_add_sync(0xFFFFFFFF, T{}))>> = | ||
| ::cuda::std::is_integral_v<T> && sizeof(T) <= sizeof(unsigned); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what is the decltype(__reduce_add_sync(0xFFFFFFFF, T{})) actually good for? We know that it can only be a max 32-bit integral, we needn't to test the invocability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is meant for compiler / toolkit combinations where we cannot rely solely on SM_PROVIDES_SM_80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decltype(__reduce_add_sync) is a historical way to handle this function. The common NV_IF_TARGET works perfectly fine with all compilers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SFINAE here is very verbose and adds compilation complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better said, there are compiler where
__reduce_min_syncand friends might not be implemented but that have partial SM80 support
But I don't think SFINAE would help with this, the function forward declared in ${CTK_INCLUDE}/crt/sm_80_rt.h, which is always included when compiling for arch 80+
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
😬 CI Workflow Results🟥 Finished in 6h 01m: Pass: 98%/93 | Total: 4d 14h | Max: 6h 00m | Hits: 69%/84654See results here. |
| } | ||
|
|
||
| NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name("warp_reduce").set_type_axes_names({"T{ct}"}); | ||
| NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name(bench_name).set_type_axes_names({"T{ct}"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: benchmarks are typically called base to distinguish them from the tuning variants.
| NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name(bench_name).set_type_axes_names({"T{ct}"}); | |
| NVBENCH_BENCH_TYPES(warp_reduce, NVBENCH_TYPE_AXES(value_types)).set_name("base").set_type_axes_names({"T{ct}"}); |
| constexpr auto bench_name = "warp_reduce_min"; | ||
| using op_t = ::cuda::minimum<>; | ||
| #include "warp_reduce_base.cuh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: There is usually no need to specify a bench name, since the file name is used for the binary.
| return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input))); | ||
| } | ||
| else if constexpr (detail::can_use_reduce_or_sync<T, ReductionOp>) | ||
| { | ||
| return __reduce_or_sync(member_mask, input); | ||
| return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input))); | ||
| } | ||
| else if constexpr (detail::can_use_reduce_xor_sync<T, ReductionOp>) | ||
| { | ||
| return __reduce_xor_sync(member_mask, input); | ||
| return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the builtins for bitwise operations even for 64 and 128 bit types, maybe it could work also for min/max
We can leverage integer promotion to use the
__reduce_meow_syncinstructionsWith that we get a lot of shuffle instructions turned into reduce instructions

-[ ] Merge after #6814