Skip to content

Conversation

@NaderAlAwar
Copy link
Contributor

Description

closes #6606

This adds a Python benchmark using pynvbench for coop.warp.sum. This was already done in C++ in #6431, this reimplements it in Python.

Comparing the two, we get these results from the pre-existing C++ benchmark (I deleted the results for types we don't support in cuda.coop right now):

| T{ct} | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|-------|---------|------------|-------|------------|-------|
|    I8 |    438x |  41.403 us | 1.11% |  33.060 us | 1.21% |
|   I16 |    308x |  41.496 us | 1.34% |  32.183 us | 1.32% |
|   I32 |    472x |  17.428 us | 1.44% |   9.120 us | 3.03% |
|   I64 |    652x |  61.092 us | 3.58% |  52.017 us | 3.93% |
|   F16 |    562x |  37.591 us | 0.80% |  29.853 us | 1.12% |
|   F32 |    462x |  38.499 us | 0.46% |  29.664 us | 1.23% |
|   F64 |    416x | 226.925 us | 0.14% | 218.994 us | 0.14% |

and these results from the new Python benchmark

| T{ct} | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-------|---------|------------|--------|------------|-------|
|    I8 |    436x |  53.733 us |  4.15% |  33.017 us | 1.26% |
|   I16 |    326x |  53.525 us |  5.77% |  32.181 us | 1.40% |
|   I32 |    664x |  28.928 us |  4.75% |   9.101 us | 3.42% |
|   I64 |    486x |  77.020 us |  2.57% |  55.843 us | 0.81% |
|   F16 |    624x |  54.038 us | 16.46% |  32.318 us | 1.85% |
|   F32 |    428x |  50.040 us |  2.32% |  29.422 us | 1.36% |
|   F64 |    474x | 238.258 us |  0.78% | 218.305 us | 0.21% |

The GPU Times for both are identical but the Python implementation has larger CPU Times, likely due to overhead introduced by the numba-cuda kernel call.

One other thing of note is that the SASS for two versions are identical, except for an extra NOP in the Python version that appears after the redux instruction:

REDUX.SUM.S32 UR4, R0 ;
NOP ; <- not present in the C++ version

I spent some time investigating this and I have a minimal reproducer that shows issue is from numba-cuda. Will open an issue to track this.

This PR will also introduce pynvbench as an optional dependency (haven't implemented this yet).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

eating an intrinsic that mimics the memcpy used to generate the random input data b) calculating the grid size in
stead of using grid 1 c) not passing the input stream
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 2, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Add WarpReduce Device-Side Benchmarks for cuda.coop

1 participant