proposed more optimized versions of next_pow2 and prev_pow2 in util.h #6083

ivHeisser · 2025-11-15T16:32:35Z

• The current implementation of functions <next_pow2> and <prev_pow2> has complexity O(log(n)).
A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed.
Fully bitwise versions run in O(1) for CPU and NVIDIA GPU instead of O(log(n)).
The efficiency of the bitwise version is that it doesn't use a loop, but performs only a fixed number of operations, propagating the most significant set bit to the right using bitwise shifts.
This turns O(log n) into O(1) (for a fixed type size, such as 32 or 64 bits).

• Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

Refactoring to improve performance

Additional information:

Affected modules and functionalities:

updated include/dali/core/utils.h

Key points relevant for the review:

focus on code realization.

Tests:

Existing tests apply

If you select Existing tests apply option, please list which test cases cover the introduced
functionality. For example:

test_operator_gaussian_blur.py: test_gaussian*
tensor_list_test.cc: TensorListVariableBatchSizeTest*
--->
Existing tests apply
New tests added
- Python tests
- GTests
- [ *] Benchmark
- Other
N/A

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

• The current implementation of functions <next_pow2> and <prev_pow2> has linear complexity O(n). A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed. Fully bitwise versions run in O(log(sizeof(T))) instead of O(n). • Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.

greptile-apps · 2025-11-15T16:34:07Z

Greptile Overview

Greptile Summary

This PR optimizes the next_pow2 and prev_pow2 utility functions in include/dali/core/util.h by replacing O(log n) loop-based implementations with O(1) bitwise operations. The optimization introduces hardware intrinsics (__clz/__clzll) for CUDA device code to efficiently count leading zeros, while providing a portable CPU fallback using bit propagation algorithms. Additionally, a new is_pow2 function is added using the standard (n & (n-1)) == 0 bit manipulation technique.

These utility functions are fundamental building blocks used throughout DALI's data processing pipelines for memory alignment calculations, buffer sizing operations, and tensor dimension management. The performance improvement is particularly significant for GPU workloads where these functions may be called frequently in device kernels. The change maintains API compatibility while providing comprehensive documentation for edge case behavior.

Important Files Changed

Filename	Score	Overview
`include/dali/core/util.h`	2/5	Critical optimization to power-of-2 utility functions with performance improvements but contains constexpr violations and incorrect edge case handling

Confidence score: 2/5

This PR has significant implementation issues that could break existing functionality and compilation
Score lowered due to constexpr violations in CPU fallback code, incorrect is_pow2 behavior for n=0, and potential edge case failures in type promotion logic
Pay close attention to the constexpr compatibility and mathematical correctness of the is_pow2 function

greptile-apps

Additional Comments (1)

include/dali/core/util.h, line 208 (link)

logic: this returns true for n=0, but documentation says it should return false for n <= 0

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-15T16:34:05Z

include/dali/core/util.h

+#else
+  // CPU fallback (portable bitwise version with loop)
+  x--;
+  for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) {


logic: this loop breaks constexpr evaluation since it's not a constant expression - the function is marked constexpr but won't work at compile time for CPU builds

greptile-apps · 2025-11-15T16:34:06Z

include/dali/core/util.h

+  if constexpr (sizeof(U) == 4) {
+    // __clz() — count leading zeros (32-bit)
+    int lz = __clz(x - 1);
+    int pos = 31 - lz;
+    return static_cast<T>(U(1) << (pos + 1));
+  } else {
+    // 64-bit version uses __clzll()
+    int lz = __clzll(x - 1);
+    int pos = 63 - lz;
+    return static_cast<T>(U(1) << (pos + 1));
+  }


logic: no handling for non-32/64-bit integer types - what happens with 8-bit or 16-bit types? Should there be explicit handling for 8-bit and 16-bit integer types, or is the assumption that they'll be promoted to 32-bit acceptable?

JanuszL · 2025-11-17T08:22:41Z

include/dali/core/util.h

+  }
+#else
+  // CPU fallback (portable bitwise version with loop)
+  x--;


I'm not sure if this CPU fallback provides any difference over:

T pow2 = 1; while (n > pow2) { pow2 += pow2;

👍 This code is way more complex than it used to be, has ths same theoretical complexity but a much larger constant factor.

Description corrected.

Do you mean that this loop is replaced with CPU clz build in?
Because while (n > pow2) has the same complexity as for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) because we double pow2 every iteration same way as we would shift it.

The basic assumption is the loop while (n > pow2) depends on n and number of iterations is increasing with number n (not fixed).
While the loop for (unsigned i = 1; i < sizeof(U) * 8; i <<= 1) has fixed number of iterations corresponding to the type of U (32bit, 64bit etc.). And as a consequence, the compiler will unloop for on fixed number of operations.
Which cannot be said about the while cycle, since the number n is an input parameter, constantly changing, and there is probability that the compiler will not unloop it in some cases.
One way or another, the cycle for may be unloop manually in the code for 64 bit, 32 bit etc. cases separately like (as pseudocode example):
# if define 32 bit case
n |= n >> 1;
n |= n >> 2;
...
# if define 64 bit case
n |= n >> 1;
n |= n >> 2;
...

OK, I see - the code is indeed log(log(n)), which is faster than log(n) which we had before. Still, I'd recommend using gcc/clang builtins if possible.

added 8bit and 16bit cases for GPU part as discussed before in review (commit f0ce8f42)

And also one thought about architecture.
I agree, that calling <next_pow2> inside <prev_pow2> is not good idea. But also I would like to avoid the repeat of the code in the sense of refactoring. So I can propose next tiny architecture for this part of code (pseudo C++ code is using next):

template<Parameter, typename T> base2pow2(T n) { .... here is the main part of the code parametrized by <Parameter> .... } template<typename T> next_pow2 (T n) { base2pow2<parameter_1>(n); } template<typename T> prev_pow2 (T n) { base2pow2<parameter_2>(n); }

extra parametrized base2pow2 will be added which should be substituted to next_pow2 and prev_pow2 by compiler. I see two profits in it:

to avoid repeated code;

functionality expansion - it will be possible to add new functions like prev_prev_pow2 (find a power of two that is one less than prev_pow2) or next_next_pow2 (find a power of two that is one more than next_pow2) and so on (if the need arises). And it will be functions which call base2pow2 with different parameters.

mzient · 2025-11-17T09:34:28Z

include/dali/core/util.h

+
+#if defined(__CUDA_ARCH__)
+  // CUDA DEVICE PATH
+  if constexpr (sizeof(U) == 4) {


Please swap the condition. Now the 64-bit variant would be used for 8 and 16-bit inputs, which is a big waste of resources. Use 64-bit __clzll only for 64-bit inputs and 32-bit otherwise.

mzient · 2025-11-17T09:50:51Z

The current implementation has, in fact O(log2(N)) complexity, not O(N) as claimed in the description.
The proposed log2(N) solution has no advantage over existing code - in fact, it's much more complex and very likely slower.

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

constexpr int next_pow2(int x)
{
    if (x <= 1)
        return 1;
    int y = 1 << (31 - __builtin_clz(x));
    return x > y ? y << 1 : y;
}

Regarding prev_pow2 - it makes no sense to implement it in terms of next_pow2, because it's actually much simpler when using __clz:

constexpr int prev_pow2(int x)
{
    if (x < 1)
        return 0;
    return 1 << (31 - __builtin_clz(x));
}

I'd recommend keeping separate implementations for those two functions.

ivHeisser · 2025-11-17T21:34:14Z

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

I can also add the support of MSVC with _lzcnt_u32 / _lzcnt_u64, if it is needed.

mzient · 2025-11-18T10:10:13Z

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

I can also add the support of MSVC with _lzcnt_u32 / _lzcnt_u64, if it is needed.

I wouldn't do it, since our code won't compile with MSVC anyway and there's no way we could test it.

JanuszL

Please apply changes discussed in review.

…art in <next_pow2> function in <util.h> module

greptile-apps

Additional Comments (1)

include/dali/core/util.h, line 231 (link)

logic: returns true for n=0 which contradicts mathematical definition of powers of 2 and the documentation stating n <= 0 should return false

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps bot reviewed Nov 15, 2025

View reviewed changes

JanuszL reviewed Nov 17, 2025

View reviewed changes

mzient reviewed Nov 17, 2025

View reviewed changes

JanuszL assigned mzient and JanuszL Nov 17, 2025

JanuszL requested changes Dec 1, 2025

View reviewed changes

ivHeisser added 2 commits December 2, 2025 01:30

added 8bit and 16bit support for GPU part and pragma unroll for CPU p…

f0ce8f4

…art in <next_pow2> function in <util.h> module

Merge branch 'NVIDIA:main' into main

e1dadb6

greptile-apps bot reviewed Dec 2, 2025

View reviewed changes

proposed more optimized versions of next_pow2 and prev_pow2 in util.h #6083

Are you sure you want to change the base?

proposed more optimized versions of next_pow2 and prev_pow2 in util.h #6083

Conversation

ivHeisser commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

greptile-apps bot commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 2/5

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

JanuszL Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzient Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ivHeisser Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

JanuszL Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ivHeisser Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzient Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivHeisser Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ivHeisser Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzient Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

mzient commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivHeisser commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzient commented Nov 18, 2025

Uh oh!

JanuszL left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ivHeisser commented Nov 15, 2025 •

edited

Loading

greptile-apps bot commented Nov 15, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

JanuszL Nov 17, 2025 •

edited

Loading

ivHeisser Nov 17, 2025 •

edited

Loading

mzient Nov 18, 2025 •

edited

Loading

ivHeisser Dec 2, 2025 •

edited

Loading

mzient commented Nov 17, 2025 •

edited

Loading

ivHeisser commented Nov 17, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading