Skip to content

Running batch benchmarks affects cold benchmarks #310

@bernhardmgruber

Description

@bernhardmgruber

While benchmarking NVIDIA/cccl#7449, I was running batch+cold benchmarks for the baseline and comparison. The results looked like this:

## [0] NVIDIA B200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |      I32      |      2^16      |   6.216 us |       2.94% |   6.248 us |       4.14% |   0.032 us |   0.51% |   SAME   |
|   I8    |      I32      |      2^20      |   7.596 us |       8.27% |   7.765 us |       8.13% |   0.170 us |   2.23% |   SAME   |
|   I8    |      I32      |      2^24      |  14.967 us |       2.65% |  14.818 us |       3.28% |  -0.149 us |  -0.99% |   SAME   |
|   I8    |      I32      |      2^28      | 116.804 us |       0.58% | 115.438 us |       0.62% |  -1.367 us |  -1.17% |   FAST   |
|   I8    |      I64      |      2^16      |   6.672 us |       5.87% |   7.145 us |       1.34% |   0.473 us |   7.09% |   SLOW   |
|   I8    |      I64      |      2^20      |   7.696 us |       8.09% |   7.453 us |       7.45% |  -0.244 us |  -3.16% |   SAME   |
|   I8    |      I64      |      2^24      |  14.952 us |       3.16% |  14.796 us |       2.99% |  -0.156 us |  -1.05% |   SAME   |
|   I8    |      I64      |      2^28      | 117.424 us |       0.82% | 115.719 us |       0.22% |  -1.705 us |  -1.45% |   FAST   |
|   I8    |      I64      |      2^32      |   1.744 ms |       0.04% |   1.709 ms |       0.02% | -35.097 us |  -2.01% |   FAST   |
|   I16   |      I32      |      2^16      |   6.674 us |       6.17% |   7.153 us |       1.96% |   0.479 us |   7.18% |   SLOW   |
|   I16   |      I32      |      2^20      |   7.894 us |      11.14% |   7.987 us |       9.14% |   0.093 us |   1.18% |   SAME   |
|   I16   |      I32      |      2^24      |  18.247 us |       3.33% |  19.027 us |       2.78% |   0.780 us |   4.27% |   SLOW   |
|   I16   |      I32      |      2^28      | 161.580 us |       0.55% | 183.287 us |       0.13% |  21.707 us |  13.43% |   SLOW   |
|   I16   |      I64      |      2^16      |   6.772 us |       6.40% |   7.127 us |       2.37% |   0.355 us |   5.25% |   SLOW   |
|   I16   |      I64      |      2^20      |   7.842 us |      10.96% |   7.939 us |       9.54% |   0.097 us |   1.24% |   SAME   |
|   I16   |      I64      |      2^24      |  18.425 us |       1.48% |  18.542 us |       2.33% |   0.117 us |   0.64% |   SAME   |
|   I16   |      I64      |      2^28      | 162.397 us |       0.25% | 173.081 us |       0.18% |  10.683 us |   6.58% |   SLOW   |
|   I16   |      I64      |      2^32      |   2.467 ms |       0.23% |   2.644 ms |       0.04% | 176.730 us |   7.16% |   SLOW   |
|   F32   |      I32      |      2^16      |   6.732 us |       5.64% |   7.149 us |       1.92% |   0.417 us |   6.19% |   SLOW   |
|   F32   |      I32      |      2^20      |   8.228 us |       2.03% |   8.529 us |       5.47% |   0.301 us |   3.66% |   SLOW   |
|   F32   |      I32      |      2^24      |  27.246 us |       1.63% |  27.051 us |       1.76% |  -0.196 us |  -0.72% |   SAME   |
|   F32   |      I32      |      2^28      | 314.382 us |       0.32% | 314.495 us |       0.33% |   0.113 us |   0.04% |   SAME   |
|   F32   |      I64      |      2^16      |   6.736 us |       6.45% |   7.153 us |       5.38% |   0.417 us |   6.18% |   SLOW   |
|   F32   |      I64      |      2^20      |   8.238 us |       2.37% |   8.281 us |       2.99% |   0.043 us |   0.52% |   SAME   |
|   F32   |      I64      |      2^24      |  27.238 us |       1.47% |  27.594 us |       0.72% |   0.356 us |   1.31% |   SLOW   |
|   F32   |      I64      |      2^28      | 314.291 us |       0.33% | 314.429 us |       0.29% |   0.139 us |   0.04% |   SAME   |
|   F32   |      I64      |      2^32      |   4.913 ms |       0.06% |   4.914 ms |       0.07% |   0.760 us |   0.02% |   SAME   |
|   F64   |      I32      |      2^16      |   6.245 us |       5.34% |   7.152 us |       1.86% |   0.907 us |  14.52% |   SLOW   |
|   F64   |      I32      |      2^20      |   9.499 us |       7.34% |   9.661 us |       7.26% |   0.162 us |   1.70% |   SAME   |
|   F64   |      I32      |      2^24      |  47.058 us |       2.05% |  47.119 us |       1.87% |   0.061 us |   0.13% |   SAME   |
|   F64   |      I32      |      2^28      | 621.279 us |       0.22% | 621.277 us |       0.22% |  -0.002 us |  -0.00% |   SAME   |
|   F64   |      I64      |      2^16      |   6.389 us |       8.82% |   7.159 us |       1.29% |   0.770 us |  12.06% |   SLOW   |
|   F64   |      I64      |      2^20      |   9.392 us |       6.80% |   9.453 us |       6.44% |   0.061 us |   0.65% |   SAME   |
|   F64   |      I64      |      2^24      |  47.015 us |       1.99% |  47.163 us |       1.89% |   0.148 us |   0.31% |   SAME   |
|   F64   |      I64      |      2^28      | 621.189 us |       0.22% | 621.299 us |       0.22% |   0.110 us |   0.02% |   SAME   |
|   F64   |      I64      |      2^32      |   9.824 ms |       0.14% |   9.824 ms |       0.14% |  -0.320 us |  -0.00% |   SAME   |
|  I128   |      I32      |      2^16      |   6.941 us |       8.10% |   7.194 us |       4.19% |   0.253 us |   3.64% |   SAME   |
|  I128   |      I32      |      2^20      |  12.165 us |       7.30% |  12.186 us |       6.00% |   0.021 us |   0.17% |   SAME   |
|  I128   |      I32      |      2^24      |  85.657 us |       1.06% |  85.617 us |       1.11% |  -0.039 us |  -0.05% |   SAME   |
|  I128   |      I32      |      2^28      |   1.235 ms |       0.15% |   1.235 ms |       0.14% |  -0.029 us |  -0.00% |   SAME   |
|  I128   |      I64      |      2^16      |   6.934 us |       8.14% |   7.148 us |       1.73% |   0.214 us |   3.09% |   SLOW   |
|  I128   |      I64      |      2^20      |  12.187 us |       7.10% |  12.230 us |       5.74% |   0.043 us |   0.36% |   SAME   |
|  I128   |      I64      |      2^24      |  85.653 us |       1.04% |  85.774 us |       1.08% |   0.121 us |   0.14% |   SAME   |
|  I128   |      I64      |      2^28      |   1.235 ms |       0.15% |   1.236 ms |       0.16% |   0.496 us |   0.04% |   SAME   |
|  I128   |      I64      |      2^32      |  19.657 ms |       0.09% |  19.657 ms |       0.10% |   0.629 us |   0.00% |   SAME   |

After I added the no_batch exec tag and rerun the baseline and comparison benchmark, I get the following diff:

## [0] NVIDIA B200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |      I32      |      2^16      |   6.278 us |       2.67% |   6.281 us |       2.73% |   0.003 us |   0.04% |   SAME   |
|   I8    |      I32      |      2^20      |   7.603 us |       7.82% |   7.657 us |       7.86% |   0.055 us |   0.72% |   SAME   |
|   I8    |      I32      |      2^24      |  14.925 us |       3.10% |  14.916 us |       3.27% |  -0.009 us |  -0.06% |   SAME   |
|   I8    |      I32      |      2^28      | 116.646 us |       0.66% | 116.922 us |       0.54% |   0.276 us |   0.24% |   SAME   |
|   I8    |      I64      |      2^16      |   6.732 us |       6.25% |   6.594 us |       7.07% |  -0.138 us |  -2.04% |   SAME   |
|   I8    |      I64      |      2^20      |   7.685 us |       7.74% |   7.799 us |       7.05% |   0.114 us |   1.48% |   SAME   |
|   I8    |      I64      |      2^24      |  14.973 us |       3.42% |  14.865 us |       3.93% |  -0.108 us |  -0.72% |   SAME   |
|   I8    |      I64      |      2^28      | 117.184 us |       0.64% | 117.746 us |       0.53% |   0.562 us |   0.48% |   SAME   |
|   I8    |      I64      |      2^32      |   1.744 ms |       0.04% |   1.746 ms |       0.02% |   1.703 us |   0.10% |   SLOW   |
|   I16   |      I32      |      2^16      |   6.704 us |       6.87% |   7.117 us |       2.83% |   0.413 us |   6.16% |   SLOW   |
|   I16   |      I32      |      2^20      |   7.741 us |       9.77% |   7.606 us |      10.99% |  -0.135 us |  -1.74% |   SAME   |
|   I16   |      I32      |      2^24      |  18.177 us |       4.25% |  17.687 us |       4.00% |  -0.490 us |  -2.69% |   SAME   |
|   I16   |      I32      |      2^28      | 161.497 us |       0.51% | 161.063 us |       0.38% |  -0.434 us |  -0.27% |   SAME   |
|   I16   |      I64      |      2^16      |   6.758 us |       6.65% |   7.101 us |       2.96% |   0.343 us |   5.08% |   SLOW   |
|   I16   |      I64      |      2^20      |   7.823 us |      10.24% |   7.672 us |      11.39% |  -0.151 us |  -1.93% |   SAME   |
|   I16   |      I64      |      2^24      |  18.536 us |       1.20% |  19.375 us |       1.58% |   0.839 us |   4.52% |   SLOW   |
|   I16   |      I64      |      2^28      | 162.395 us |       0.29% | 177.118 us |       0.12% |  14.723 us |   9.07% |   SLOW   |
|   I16   |      I64      |      2^32      |   2.465 ms |       0.21% |   2.706 ms |       0.02% | 241.561 us |   9.80% |   SLOW   |
|   F32   |      I32      |      2^16      |   6.740 us |       6.52% |   7.110 us |       3.12% |   0.370 us |   5.49% |   SLOW   |
|   F32   |      I32      |      2^20      |   8.316 us |       2.28% |   9.127 us |       2.93% |   0.811 us |   9.75% |   SLOW   |
|   F32   |      I32      |      2^24      |  27.209 us |       1.90% |  27.597 us |       0.95% |   0.388 us |   1.42% |   SLOW   |
|   F32   |      I32      |      2^28      | 314.389 us |       0.32% | 314.550 us |       0.32% |   0.160 us |   0.05% |   SAME   |
|   F32   |      I64      |      2^16      |   6.726 us |       7.81% |   7.121 us |       2.67% |   0.396 us |   5.89% |   SLOW   |
|   F32   |      I64      |      2^20      |   8.314 us |       2.32% |   8.792 us |       5.85% |   0.478 us |   5.75% |   SLOW   |
|   F32   |      I64      |      2^24      |  27.223 us |       1.82% |  27.165 us |       2.02% |  -0.057 us |  -0.21% |   SAME   |
|   F32   |      I64      |      2^28      | 314.410 us |       0.32% | 314.415 us |       0.34% |   0.004 us |   0.00% |   SAME   |
|   F32   |      I64      |      2^32      |   4.913 ms |       0.07% |   4.913 ms |       0.06% |  -0.076 us |  -0.00% |   SAME   |
|   F64   |      I32      |      2^16      |   6.305 us |       4.45% |   6.313 us |       5.20% |   0.007 us |   0.11% |   SAME   |
|   F64   |      I32      |      2^20      |   9.564 us |       6.67% |   9.574 us |       6.77% |   0.009 us |   0.10% |   SAME   |
|   F64   |      I32      |      2^24      |  47.068 us |       2.01% |  47.027 us |       1.87% |  -0.041 us |  -0.09% |   SAME   |
|   F64   |      I32      |      2^28      | 621.249 us |       0.22% | 621.143 us |       0.22% |  -0.107 us |  -0.02% |   SAME   |
|   F64   |      I64      |      2^16      |   6.389 us |       8.14% |   6.357 us |       7.21% |  -0.032 us |  -0.50% |   SAME   |
|   F64   |      I64      |      2^20      |   9.537 us |       6.60% |   9.557 us |       6.64% |   0.020 us |   0.21% |   SAME   |
|   F64   |      I64      |      2^24      |  47.035 us |       1.96% |  47.048 us |       1.84% |   0.014 us |   0.03% |   SAME   |
|   F64   |      I64      |      2^28      | 621.240 us |       0.20% | 621.140 us |       0.21% |  -0.100 us |  -0.02% |   SAME   |
|   F64   |      I64      |      2^32      |   9.824 ms |       0.14% |   9.823 ms |       0.13% |  -0.903 us |  -0.01% |   SAME   |
|  I128   |      I32      |      2^16      |   7.102 us |       9.65% |   7.126 us |       8.99% |   0.025 us |   0.35% |   SAME   |
|  I128   |      I32      |      2^20      |  12.169 us |       6.68% |  12.194 us |       6.66% |   0.025 us |   0.21% |   SAME   |
|  I128   |      I32      |      2^24      |  85.718 us |       1.03% |  85.601 us |       1.04% |  -0.118 us |  -0.14% |   SAME   |
|  I128   |      I32      |      2^28      |   1.235 ms |       0.14% |   1.235 ms |       0.14% |  -0.358 us |  -0.03% |   SAME   |
|  I128   |      I64      |      2^16      |   7.015 us |       8.63% |   6.942 us |       8.74% |  -0.072 us |  -1.03% |   SAME   |
|  I128   |      I64      |      2^20      |  12.189 us |       6.96% |  12.221 us |       6.71% |   0.033 us |   0.27% |   SAME   |
|  I128   |      I64      |      2^24      |  85.678 us |       1.07% |  85.693 us |       1.03% |   0.015 us |   0.02% |   SAME   |
|  I128   |      I64      |      2^28      |   1.235 ms |       0.15% |   1.235 ms |       0.15% |  -0.163 us |  -0.01% |   SAME   |
|  I128   |      I64      |      2^32      |  19.655 ms |       0.09% |  19.656 ms |       0.09% |   0.529 us |   0.00% |   SAME   |

The diffs were produced with nvbench_compare.py which AFAIK only compares cold measurements. Thus, the presence of batch benchmarks seem to significantly impact the cold benchmarks, since the first diff (batch+cold benchmark run, only showing cold) is a LOT more shaky then the second diff (only cold benchmark run, showing cold).

This makes me question whether there is either a bug in nvbench, like a missing L2 flush before a cold benchmark (or after a batch benchmark), or batch and cold benchmarks should never run back to back.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions