Skip to content

Conversation

@ivHeisser
Copy link

@ivHeisser ivHeisser commented Nov 15, 2025

• The current implementation of functions <next_pow2> and <prev_pow2> has complexity O(log(n)).
A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed.
Fully bitwise versions run in O(1) for CPU and NVIDIA GPU instead of O(log(n)).
The efficiency of the bitwise version is that it doesn't use a loop, but performs only a fixed number of operations, propagating the most significant set bit to the right using bitwise shifts.
This turns O(log n) into O(1) (for a fixed type size, such as 32 or 64 bits).

• Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

  • Refactoring to improve performance

Additional information:

Affected modules and functionalities:

updated include/dali/core/utils.h

Key points relevant for the review:

focus on code realization.

Tests:

Existing tests apply

If you select Existing tests apply option, please list which test cases cover the introduced
functionality. For example:

  • test_operator_gaussian_blur.py: test_gaussian*
  • tensor_list_test.cc: TensorListVariableBatchSizeTest*
    --->
  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • [ *] Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

• The current implementation of functions <next_pow2> and <prev_pow2> has linear complexity O(n). 
  A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed. 
  Fully bitwise versions run in O(log(sizeof(T))) instead of O(n).

• Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.
@greptile-apps
Copy link

greptile-apps bot commented Nov 15, 2025

Greptile Overview

Greptile Summary

This PR optimizes the next_pow2 and prev_pow2 utility functions in include/dali/core/util.h by replacing O(log n) loop-based implementations with O(1) bitwise operations. The optimization introduces hardware intrinsics (__clz/__clzll) for CUDA device code to efficiently count leading zeros, while providing a portable CPU fallback using bit propagation algorithms. Additionally, a new is_pow2 function is added using the standard (n & (n-1)) == 0 bit manipulation technique.

These utility functions are fundamental building blocks used throughout DALI's data processing pipelines for memory alignment calculations, buffer sizing operations, and tensor dimension management. The performance improvement is particularly significant for GPU workloads where these functions may be called frequently in device kernels. The change maintains API compatibility while providing comprehensive documentation for edge case behavior.

Important Files Changed

Filename Score Overview
include/dali/core/util.h 2/5 Critical optimization to power-of-2 utility functions with performance improvements but contains constexpr violations and incorrect edge case handling

Confidence score: 2/5

  • This PR has significant implementation issues that could break existing functionality and compilation
  • Score lowered due to constexpr violations in CPU fallback code, incorrect is_pow2 behavior for n=0, and potential edge case failures in type promotion logic
  • Pay close attention to the constexpr compatibility and mathematical correctness of the is_pow2 function

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. include/dali/core/util.h, line 208 (link)

    logic: this returns true for n=0, but documentation says it should return false for n <= 0

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

#else
// CPU fallback (portable bitwise version with loop)
x--;
for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: this loop breaks constexpr evaluation since it's not a constant expression - the function is marked constexpr but won't work at compile time for CPU builds

Comment on lines 147 to 157
if constexpr (sizeof(U) == 4) {
// __clz() — count leading zeros (32-bit)
int lz = __clz(x - 1);
int pos = 31 - lz;
return static_cast<T>(U(1) << (pos + 1));
} else {
// 64-bit version uses __clzll()
int lz = __clzll(x - 1);
int pos = 63 - lz;
return static_cast<T>(U(1) << (pos + 1));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: no handling for non-32/64-bit integer types - what happens with 8-bit or 16-bit types? Should there be explicit handling for 8-bit and 16-bit integer types, or is the assumption that they'll be promoted to 32-bit acceptable?

}
#else
// CPU fallback (portable bitwise version with loop)
x--;
Copy link
Contributor

@JanuszL JanuszL Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this CPU fallback provides any difference over:

  T pow2 = 1;
  while (n > pow2) {
    pow2 += pow2;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This code is way more complex than it used to be, has ths same theoretical complexity but a much larger constant factor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description corrected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that this loop is replaced with CPU clz build in?
Because while (n > pow2) has the same complexity as for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) because we double pow2 every iteration same way as we would shift it.

Copy link
Author

@ivHeisser ivHeisser Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic assumption is the loop while (n > pow2) depends on n and number of iterations is increasing with number n (not fixed).
While the loop for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) has fixed number of iterations corresponding to the type of U (32bit, 64bit etc.). And as a consequence, the compiler will unloop for on fixed number of operations.
Which cannot be said about the while cycle, since the number n is an input parameter, constantly changing, and there is probability that the compiler will not unloop it in some cases.
One way or another, the cycle for may be unloop manually in the code for 64 bit, 32 bit etc. cases separately like (as pseudocode example):
# if define 32 bit case
n |= n >> 1;
n |= n >> 2;
...
# if define 64 bit case
n |= n >> 1;
n |= n >> 2;
...

Copy link
Contributor

@mzient mzient Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see - the code is indeed log(log(n)), which is faster than log(n) which we had before. Still, I'd recommend using gcc/clang builtins if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 8bit and 16bit cases for GPU part as discussed before in review (commit f0ce8f42)

Copy link
Author

@ivHeisser ivHeisser Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also one thought about architecture.
I agree, that calling <next_pow2> inside <prev_pow2> is not good idea. But also I would like to avoid the repeat of the code in the sense of refactoring. So I can propose next tiny architecture for this part of code (pseudo C++ code is using next):

template<Parameter, typename T>
base2pow2(T n) {
  ....
  here is the main part of the code parametrized by <Parameter>
   ....
}

template<typename T>
next_pow2 (T n) {
  base2pow2<parameter_1>(n);
}

template<typename T>
prev_pow2 (T n) {
  base2pow2<parameter_2>(n);
}

extra parametrized base2pow2 will be added which should be substituted to next_pow2 and prev_pow2 by compiler. I see two profits in it:

  1. to avoid repeated code;
  2. functionality expansion - it will be possible to add new functions like prev_prev_pow2 (find a power of two that is one less than prev_pow2) or next_next_pow2 (find a power of two that is one more than next_pow2) and so on (if the need arises). And it will be functions which call base2pow2 with different parameters.


#if defined(__CUDA_ARCH__)
// CUDA DEVICE PATH
if constexpr (sizeof(U) == 4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please swap the condition. Now the 64-bit variant would be used for 8 and 16-bit inputs, which is a big waste of resources. Use 64-bit __clzll only for 64-bit inputs and 32-bit otherwise.

@mzient
Copy link
Contributor

mzient commented Nov 17, 2025

  1. The current implementation has, in fact O(log2(N)) complexity, not O(N) as claimed in the description.
  2. The proposed log2(N) solution has no advantage over existing code - in fact, it's much more complex and very likely slower.

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

constexpr int next_pow2(int x)
{
    if (x <= 1)
        return 1;
    int y = 1 << (31 - __builtin_clz(x));
    return x > y ? y << 1 : y;
}

Regarding prev_pow2 - it makes no sense to implement it in terms of next_pow2, because it's actually much simpler when using __clz:

constexpr int prev_pow2(int x)
{
    if (x < 1)
        return 0;
    return 1 << (31 - __builtin_clz(x));
}

I'd recommend keeping separate implementations for those two functions.

@ivHeisser
Copy link
Author

ivHeisser commented Nov 17, 2025

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

I can also add the support of MSVC with _lzcnt_u32 / _lzcnt_u64, if it is needed.

@mzient
Copy link
Contributor

mzient commented Nov 18, 2025

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

I can also add the support of MSVC with _lzcnt_u32 / _lzcnt_u64, if it is needed.

I wouldn't do it, since our code won't compile with MSVC anyway and there's no way we could test it.

Copy link
Contributor

@JanuszL JanuszL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply changes discussed in review.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. include/dali/core/util.h, line 231 (link)

    logic: returns true for n=0 which contradicts mathematical definition of powers of 2 and the documentation stating n <= 0 should return false

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants