[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench #6846

NaderAlAwar · 2025-12-02T20:37:16Z

Description

This adds a Python benchmark using pynvbench for coop.warp.sum. This was already done in C++ in #6431, this reimplements it in Python.

Comparing the two, we get these results from the pre-existing C++ benchmark (I deleted the results for types we don't support in cuda.coop right now):

| T{ct} | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|-------|---------|------------|-------|------------|-------|
|    I8 |    438x |  41.403 us | 1.11% |  33.060 us | 1.21% |
|   I16 |    308x |  41.496 us | 1.34% |  32.183 us | 1.32% |
|   I32 |    472x |  17.428 us | 1.44% |   9.120 us | 3.03% |
|   I64 |    652x |  61.092 us | 3.58% |  52.017 us | 3.93% |
|   F16 |    562x |  37.591 us | 0.80% |  29.853 us | 1.12% |
|   F32 |    462x |  38.499 us | 0.46% |  29.664 us | 1.23% |
|   F64 |    416x | 226.925 us | 0.14% | 218.994 us | 0.14% |

and these results from the new Python benchmark

| T{ct} | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-------|---------|------------|--------|------------|-------|
|    I8 |    436x |  53.733 us |  4.15% |  33.017 us | 1.26% |
|   I16 |    326x |  53.525 us |  5.77% |  32.181 us | 1.40% |
|   I32 |    664x |  28.928 us |  4.75% |   9.101 us | 3.42% |
|   I64 |    486x |  77.020 us |  2.57% |  55.843 us | 0.81% |
|   F16 |    624x |  54.038 us | 16.46% |  32.318 us | 1.85% |
|   F32 |    428x |  50.040 us |  2.32% |  29.422 us | 1.36% |
|   F64 |    474x | 238.258 us |  0.78% | 218.305 us | 0.21% |

The GPU Times for both are identical but the Python implementation has larger CPU Times, likely due to overhead introduced by the numba-cuda kernel call.

One other thing of note is that the SASS for two versions are identical, except for an extra NOP in the Python version that appears after the redux instruction:

REDUX.SUM.S32 UR4, R0 ;
NOP ; <- not present in the C++ version

I spent some time investigating this and I have a minimal reproducer that shows issue is from numba-cuda. Will open an issue to track this.

This PR will also introduce pynvbench as an optional dependency (haven't implemented this yet).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

eating an intrinsic that mimics the memcpy used to generate the random input data b) calculating the grid size in stead of using grid 1 c) not passing the input stream

copy-pr-bot · 2025-12-02T20:37:20Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

NaderAlAwar added 3 commits December 2, 2025 12:21

Add initial version of warp reduce benchmark in python

30fa617

Follow C++ benchmark more closely by a) cr

394e1a0

eating an intrinsic that mimics the memcpy used to generate the random input data b) calculating the grid size in stead of using grid 1 c) not passing the input stream

Use an axis for types more consistent with C++

99ffdbb

github-project-automation bot added this to CCCL Dec 2, 2025

github-project-automation bot moved this to Todo in CCCL Dec 2, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench #6846

[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench #6846

NaderAlAwar commented Dec 2, 2025

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench #6846

Are you sure you want to change the base?

[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench #6846

Conversation

NaderAlAwar commented Dec 2, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench #6846

[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench #6846