-
Notifications
You must be signed in to change notification settings - Fork 650
proposed more optimized versions of next_pow2 and prev_pow2 in util.h #6083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
• The current implementation of functions <next_pow2> and <prev_pow2> has linear complexity O(n). A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed. Fully bitwise versions run in O(log(sizeof(T))) instead of O(n). • Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.
Greptile OverviewGreptile SummaryThis PR optimizes the These utility functions are fundamental building blocks used throughout DALI's data processing pipelines for memory alignment calculations, buffer sizing operations, and tensor dimension management. The performance improvement is particularly significant for GPU workloads where these functions may be called frequently in device kernels. The change maintains API compatibility while providing comprehensive documentation for edge case behavior. Important Files Changed
Confidence score: 2/5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
include/dali/core/util.h, line 208 (link)logic: this returns true for n=0, but documentation says it should return false for n <= 0
1 file reviewed, 3 comments
| #else | ||
| // CPU fallback (portable bitwise version with loop) | ||
| x--; | ||
| for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: this loop breaks constexpr evaluation since it's not a constant expression - the function is marked constexpr but won't work at compile time for CPU builds
include/dali/core/util.h
Outdated
| if constexpr (sizeof(U) == 4) { | ||
| // __clz() — count leading zeros (32-bit) | ||
| int lz = __clz(x - 1); | ||
| int pos = 31 - lz; | ||
| return static_cast<T>(U(1) << (pos + 1)); | ||
| } else { | ||
| // 64-bit version uses __clzll() | ||
| int lz = __clzll(x - 1); | ||
| int pos = 63 - lz; | ||
| return static_cast<T>(U(1) << (pos + 1)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: no handling for non-32/64-bit integer types - what happens with 8-bit or 16-bit types? Should there be explicit handling for 8-bit and 16-bit integer types, or is the assumption that they'll be promoted to 32-bit acceptable?
| } | ||
| #else | ||
| // CPU fallback (portable bitwise version with loop) | ||
| x--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this CPU fallback provides any difference over:
T pow2 = 1;
while (n > pow2) {
pow2 += pow2;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This code is way more complex than it used to be, has ths same theoretical complexity but a much larger constant factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that this loop is replaced with CPU clz build in?
Because while (n > pow2) has the same complexity as for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) because we double pow2 every iteration same way as we would shift it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic assumption is the loop while (n > pow2) depends on n and number of iterations is increasing with number n (not fixed).
While the loop for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) has fixed number of iterations corresponding to the type of U (32bit, 64bit etc.). And as a consequence, the compiler will unloop for on fixed number of operations.
Which cannot be said about the while cycle, since the number n is an input parameter, constantly changing, and there is probability that the compiler will not unloop it in some cases.
One way or another, the cycle for may be unloop manually in the code for 64 bit, 32 bit etc. cases separately like (as pseudocode example):
# if define 32 bit case
n |= n >> 1;
n |= n >> 2;
...
# if define 64 bit case
n |= n >> 1;
n |= n >> 2;
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see - the code is indeed log(log(n)), which is faster than log(n) which we had before. Still, I'd recommend using gcc/clang builtins if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 8bit and 16bit cases for GPU part as discussed before in review (commit f0ce8f42)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also one thought about architecture.
I agree, that calling <next_pow2> inside <prev_pow2> is not good idea. But also I would like to avoid the repeat of the code in the sense of refactoring. So I can propose next tiny architecture for this part of code (pseudo C++ code is using next):
template<Parameter, typename T>
base2pow2(T n) {
....
here is the main part of the code parametrized by <Parameter>
....
}
template<typename T>
next_pow2 (T n) {
base2pow2<parameter_1>(n);
}
template<typename T>
prev_pow2 (T n) {
base2pow2<parameter_2>(n);
}
extra parametrized base2pow2 will be added which should be substituted to next_pow2 and prev_pow2 by compiler. I see two profits in it:
- to avoid repeated code;
- functionality expansion - it will be possible to add new functions like
prev_prev_pow2(find a power of two that is one less thanprev_pow2) ornext_next_pow2(find a power of two that is one more thannext_pow2) and so on (if the need arises). And it will be functions which callbase2pow2with different parameters.
include/dali/core/util.h
Outdated
|
|
||
| #if defined(__CUDA_ARCH__) | ||
| // CUDA DEVICE PATH | ||
| if constexpr (sizeof(U) == 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please swap the condition. Now the 64-bit variant would be used for 8 and 16-bit inputs, which is a big waste of resources. Use 64-bit __clzll only for 64-bit inputs and 32-bit otherwise.
GCC and Clang support constexpr int next_pow2(int x)
{
if (x <= 1)
return 1;
int y = 1 << (31 - __builtin_clz(x));
return x > y ? y << 1 : y;
}Regarding constexpr int prev_pow2(int x)
{
if (x < 1)
return 0;
return 1 << (31 - __builtin_clz(x));
}I'd recommend keeping separate implementations for those two functions. |
I can also add the support of |
I wouldn't do it, since our code won't compile with MSVC anyway and there's no way we could test it. |
JanuszL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please apply changes discussed in review.
…art in <next_pow2> function in <util.h> module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
include/dali/core/util.h, line 231 (link)logic: returns true for n=0 which contradicts mathematical definition of powers of 2 and the documentation stating n <= 0 should return false
1 file reviewed, 1 comment
• The current implementation of functions <next_pow2> and <prev_pow2> has complexity O(log(n)).
A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed.
Fully bitwise versions run in O(1) for CPU and NVIDIA GPU instead of O(log(n)).
The efficiency of the bitwise version is that it doesn't use a loop, but performs only a fixed number of operations, propagating the most significant set bit to the right using bitwise shifts.
This turns O(log n) into O(1) (for a fixed type size, such as 32 or 64 bits).
• Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.
Category:
Refactoring (Redesign of existing code that doesn't affect functionality)
Description:
Additional information:
Affected modules and functionalities:
updated include/dali/core/utils.h
Key points relevant for the review:
focus on code realization.
Tests:
Existing tests apply
If you select
Existing tests applyoption, please list which test cases cover the introducedfunctionality. For example:
--->
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A