Why Are Some Operators Slower with CUDA Graph Enabled on Qwen3-235B-A22B-FP8 Using FP8 GEMM Kernels? #7436

yhyang201 · 2025-06-22T09:28:34Z

yhyang201
Jun 22, 2025
Collaborator

May I ask why, with CUDA Graph enabled, some operators seem to run slower compared to when CUDA Graph is disabled?

One thing we've observed is that the fp8_gemm_kernel template instantiations in deepgemm appear to be slightly different.

The model we're using is Qwen/Qwen3-235B-A22B-FP8.

The testing script is sglang.bench_one_batch, which has been modified to be compatible with DP Attention, and can be found here:
https://gist.github.com/yhyang201/613624cd2f757bd6e3bca6d297ecfe1b