-
Notifications
You must be signed in to change notification settings - Fork 102
Description
While benchmarking NVIDIA/cccl#7449, I was running batch+cold benchmarks for the baseline and comparison. The results looked like this:
## [0] NVIDIA B200
| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
| I8 | I32 | 2^16 | 6.216 us | 2.94% | 6.248 us | 4.14% | 0.032 us | 0.51% | SAME |
| I8 | I32 | 2^20 | 7.596 us | 8.27% | 7.765 us | 8.13% | 0.170 us | 2.23% | SAME |
| I8 | I32 | 2^24 | 14.967 us | 2.65% | 14.818 us | 3.28% | -0.149 us | -0.99% | SAME |
| I8 | I32 | 2^28 | 116.804 us | 0.58% | 115.438 us | 0.62% | -1.367 us | -1.17% | FAST |
| I8 | I64 | 2^16 | 6.672 us | 5.87% | 7.145 us | 1.34% | 0.473 us | 7.09% | SLOW |
| I8 | I64 | 2^20 | 7.696 us | 8.09% | 7.453 us | 7.45% | -0.244 us | -3.16% | SAME |
| I8 | I64 | 2^24 | 14.952 us | 3.16% | 14.796 us | 2.99% | -0.156 us | -1.05% | SAME |
| I8 | I64 | 2^28 | 117.424 us | 0.82% | 115.719 us | 0.22% | -1.705 us | -1.45% | FAST |
| I8 | I64 | 2^32 | 1.744 ms | 0.04% | 1.709 ms | 0.02% | -35.097 us | -2.01% | FAST |
| I16 | I32 | 2^16 | 6.674 us | 6.17% | 7.153 us | 1.96% | 0.479 us | 7.18% | SLOW |
| I16 | I32 | 2^20 | 7.894 us | 11.14% | 7.987 us | 9.14% | 0.093 us | 1.18% | SAME |
| I16 | I32 | 2^24 | 18.247 us | 3.33% | 19.027 us | 2.78% | 0.780 us | 4.27% | SLOW |
| I16 | I32 | 2^28 | 161.580 us | 0.55% | 183.287 us | 0.13% | 21.707 us | 13.43% | SLOW |
| I16 | I64 | 2^16 | 6.772 us | 6.40% | 7.127 us | 2.37% | 0.355 us | 5.25% | SLOW |
| I16 | I64 | 2^20 | 7.842 us | 10.96% | 7.939 us | 9.54% | 0.097 us | 1.24% | SAME |
| I16 | I64 | 2^24 | 18.425 us | 1.48% | 18.542 us | 2.33% | 0.117 us | 0.64% | SAME |
| I16 | I64 | 2^28 | 162.397 us | 0.25% | 173.081 us | 0.18% | 10.683 us | 6.58% | SLOW |
| I16 | I64 | 2^32 | 2.467 ms | 0.23% | 2.644 ms | 0.04% | 176.730 us | 7.16% | SLOW |
| F32 | I32 | 2^16 | 6.732 us | 5.64% | 7.149 us | 1.92% | 0.417 us | 6.19% | SLOW |
| F32 | I32 | 2^20 | 8.228 us | 2.03% | 8.529 us | 5.47% | 0.301 us | 3.66% | SLOW |
| F32 | I32 | 2^24 | 27.246 us | 1.63% | 27.051 us | 1.76% | -0.196 us | -0.72% | SAME |
| F32 | I32 | 2^28 | 314.382 us | 0.32% | 314.495 us | 0.33% | 0.113 us | 0.04% | SAME |
| F32 | I64 | 2^16 | 6.736 us | 6.45% | 7.153 us | 5.38% | 0.417 us | 6.18% | SLOW |
| F32 | I64 | 2^20 | 8.238 us | 2.37% | 8.281 us | 2.99% | 0.043 us | 0.52% | SAME |
| F32 | I64 | 2^24 | 27.238 us | 1.47% | 27.594 us | 0.72% | 0.356 us | 1.31% | SLOW |
| F32 | I64 | 2^28 | 314.291 us | 0.33% | 314.429 us | 0.29% | 0.139 us | 0.04% | SAME |
| F32 | I64 | 2^32 | 4.913 ms | 0.06% | 4.914 ms | 0.07% | 0.760 us | 0.02% | SAME |
| F64 | I32 | 2^16 | 6.245 us | 5.34% | 7.152 us | 1.86% | 0.907 us | 14.52% | SLOW |
| F64 | I32 | 2^20 | 9.499 us | 7.34% | 9.661 us | 7.26% | 0.162 us | 1.70% | SAME |
| F64 | I32 | 2^24 | 47.058 us | 2.05% | 47.119 us | 1.87% | 0.061 us | 0.13% | SAME |
| F64 | I32 | 2^28 | 621.279 us | 0.22% | 621.277 us | 0.22% | -0.002 us | -0.00% | SAME |
| F64 | I64 | 2^16 | 6.389 us | 8.82% | 7.159 us | 1.29% | 0.770 us | 12.06% | SLOW |
| F64 | I64 | 2^20 | 9.392 us | 6.80% | 9.453 us | 6.44% | 0.061 us | 0.65% | SAME |
| F64 | I64 | 2^24 | 47.015 us | 1.99% | 47.163 us | 1.89% | 0.148 us | 0.31% | SAME |
| F64 | I64 | 2^28 | 621.189 us | 0.22% | 621.299 us | 0.22% | 0.110 us | 0.02% | SAME |
| F64 | I64 | 2^32 | 9.824 ms | 0.14% | 9.824 ms | 0.14% | -0.320 us | -0.00% | SAME |
| I128 | I32 | 2^16 | 6.941 us | 8.10% | 7.194 us | 4.19% | 0.253 us | 3.64% | SAME |
| I128 | I32 | 2^20 | 12.165 us | 7.30% | 12.186 us | 6.00% | 0.021 us | 0.17% | SAME |
| I128 | I32 | 2^24 | 85.657 us | 1.06% | 85.617 us | 1.11% | -0.039 us | -0.05% | SAME |
| I128 | I32 | 2^28 | 1.235 ms | 0.15% | 1.235 ms | 0.14% | -0.029 us | -0.00% | SAME |
| I128 | I64 | 2^16 | 6.934 us | 8.14% | 7.148 us | 1.73% | 0.214 us | 3.09% | SLOW |
| I128 | I64 | 2^20 | 12.187 us | 7.10% | 12.230 us | 5.74% | 0.043 us | 0.36% | SAME |
| I128 | I64 | 2^24 | 85.653 us | 1.04% | 85.774 us | 1.08% | 0.121 us | 0.14% | SAME |
| I128 | I64 | 2^28 | 1.235 ms | 0.15% | 1.236 ms | 0.16% | 0.496 us | 0.04% | SAME |
| I128 | I64 | 2^32 | 19.657 ms | 0.09% | 19.657 ms | 0.10% | 0.629 us | 0.00% | SAME |
After I added the no_batch exec tag and rerun the baseline and comparison benchmark, I get the following diff:
## [0] NVIDIA B200
| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
| I8 | I32 | 2^16 | 6.278 us | 2.67% | 6.281 us | 2.73% | 0.003 us | 0.04% | SAME |
| I8 | I32 | 2^20 | 7.603 us | 7.82% | 7.657 us | 7.86% | 0.055 us | 0.72% | SAME |
| I8 | I32 | 2^24 | 14.925 us | 3.10% | 14.916 us | 3.27% | -0.009 us | -0.06% | SAME |
| I8 | I32 | 2^28 | 116.646 us | 0.66% | 116.922 us | 0.54% | 0.276 us | 0.24% | SAME |
| I8 | I64 | 2^16 | 6.732 us | 6.25% | 6.594 us | 7.07% | -0.138 us | -2.04% | SAME |
| I8 | I64 | 2^20 | 7.685 us | 7.74% | 7.799 us | 7.05% | 0.114 us | 1.48% | SAME |
| I8 | I64 | 2^24 | 14.973 us | 3.42% | 14.865 us | 3.93% | -0.108 us | -0.72% | SAME |
| I8 | I64 | 2^28 | 117.184 us | 0.64% | 117.746 us | 0.53% | 0.562 us | 0.48% | SAME |
| I8 | I64 | 2^32 | 1.744 ms | 0.04% | 1.746 ms | 0.02% | 1.703 us | 0.10% | SLOW |
| I16 | I32 | 2^16 | 6.704 us | 6.87% | 7.117 us | 2.83% | 0.413 us | 6.16% | SLOW |
| I16 | I32 | 2^20 | 7.741 us | 9.77% | 7.606 us | 10.99% | -0.135 us | -1.74% | SAME |
| I16 | I32 | 2^24 | 18.177 us | 4.25% | 17.687 us | 4.00% | -0.490 us | -2.69% | SAME |
| I16 | I32 | 2^28 | 161.497 us | 0.51% | 161.063 us | 0.38% | -0.434 us | -0.27% | SAME |
| I16 | I64 | 2^16 | 6.758 us | 6.65% | 7.101 us | 2.96% | 0.343 us | 5.08% | SLOW |
| I16 | I64 | 2^20 | 7.823 us | 10.24% | 7.672 us | 11.39% | -0.151 us | -1.93% | SAME |
| I16 | I64 | 2^24 | 18.536 us | 1.20% | 19.375 us | 1.58% | 0.839 us | 4.52% | SLOW |
| I16 | I64 | 2^28 | 162.395 us | 0.29% | 177.118 us | 0.12% | 14.723 us | 9.07% | SLOW |
| I16 | I64 | 2^32 | 2.465 ms | 0.21% | 2.706 ms | 0.02% | 241.561 us | 9.80% | SLOW |
| F32 | I32 | 2^16 | 6.740 us | 6.52% | 7.110 us | 3.12% | 0.370 us | 5.49% | SLOW |
| F32 | I32 | 2^20 | 8.316 us | 2.28% | 9.127 us | 2.93% | 0.811 us | 9.75% | SLOW |
| F32 | I32 | 2^24 | 27.209 us | 1.90% | 27.597 us | 0.95% | 0.388 us | 1.42% | SLOW |
| F32 | I32 | 2^28 | 314.389 us | 0.32% | 314.550 us | 0.32% | 0.160 us | 0.05% | SAME |
| F32 | I64 | 2^16 | 6.726 us | 7.81% | 7.121 us | 2.67% | 0.396 us | 5.89% | SLOW |
| F32 | I64 | 2^20 | 8.314 us | 2.32% | 8.792 us | 5.85% | 0.478 us | 5.75% | SLOW |
| F32 | I64 | 2^24 | 27.223 us | 1.82% | 27.165 us | 2.02% | -0.057 us | -0.21% | SAME |
| F32 | I64 | 2^28 | 314.410 us | 0.32% | 314.415 us | 0.34% | 0.004 us | 0.00% | SAME |
| F32 | I64 | 2^32 | 4.913 ms | 0.07% | 4.913 ms | 0.06% | -0.076 us | -0.00% | SAME |
| F64 | I32 | 2^16 | 6.305 us | 4.45% | 6.313 us | 5.20% | 0.007 us | 0.11% | SAME |
| F64 | I32 | 2^20 | 9.564 us | 6.67% | 9.574 us | 6.77% | 0.009 us | 0.10% | SAME |
| F64 | I32 | 2^24 | 47.068 us | 2.01% | 47.027 us | 1.87% | -0.041 us | -0.09% | SAME |
| F64 | I32 | 2^28 | 621.249 us | 0.22% | 621.143 us | 0.22% | -0.107 us | -0.02% | SAME |
| F64 | I64 | 2^16 | 6.389 us | 8.14% | 6.357 us | 7.21% | -0.032 us | -0.50% | SAME |
| F64 | I64 | 2^20 | 9.537 us | 6.60% | 9.557 us | 6.64% | 0.020 us | 0.21% | SAME |
| F64 | I64 | 2^24 | 47.035 us | 1.96% | 47.048 us | 1.84% | 0.014 us | 0.03% | SAME |
| F64 | I64 | 2^28 | 621.240 us | 0.20% | 621.140 us | 0.21% | -0.100 us | -0.02% | SAME |
| F64 | I64 | 2^32 | 9.824 ms | 0.14% | 9.823 ms | 0.13% | -0.903 us | -0.01% | SAME |
| I128 | I32 | 2^16 | 7.102 us | 9.65% | 7.126 us | 8.99% | 0.025 us | 0.35% | SAME |
| I128 | I32 | 2^20 | 12.169 us | 6.68% | 12.194 us | 6.66% | 0.025 us | 0.21% | SAME |
| I128 | I32 | 2^24 | 85.718 us | 1.03% | 85.601 us | 1.04% | -0.118 us | -0.14% | SAME |
| I128 | I32 | 2^28 | 1.235 ms | 0.14% | 1.235 ms | 0.14% | -0.358 us | -0.03% | SAME |
| I128 | I64 | 2^16 | 7.015 us | 8.63% | 6.942 us | 8.74% | -0.072 us | -1.03% | SAME |
| I128 | I64 | 2^20 | 12.189 us | 6.96% | 12.221 us | 6.71% | 0.033 us | 0.27% | SAME |
| I128 | I64 | 2^24 | 85.678 us | 1.07% | 85.693 us | 1.03% | 0.015 us | 0.02% | SAME |
| I128 | I64 | 2^28 | 1.235 ms | 0.15% | 1.235 ms | 0.15% | -0.163 us | -0.01% | SAME |
| I128 | I64 | 2^32 | 19.655 ms | 0.09% | 19.656 ms | 0.09% | 0.529 us | 0.00% | SAME |
The diffs were produced with nvbench_compare.py which AFAIK only compares cold measurements. Thus, the presence of batch benchmarks seem to significantly impact the cold benchmarks, since the first diff (batch+cold benchmark run, only showing cold) is a LOT more shaky then the second diff (only cold benchmark run, showing cold).
This makes me question whether there is either a bug in nvbench, like a missing L2 flush before a cold benchmark (or after a batch benchmark), or batch and cold benchmarks should never run back to back.