add bailout for estimation and benchmarking in autotuning #8816

krillillin · 2025-11-24T04:51:52Z

Add bailout_in_estimation/benchmarking as optional params to prune_configs_by

A common pattern for myself is to autotune over a large search space before dialing
in on a subset of that search space for subsequent runs. Usually like:

@triton.autotune(
    configs=[
        triton.Config(
            {
                 "BLOCK_SIZE_M": block_size_m,
                 ...
            },
            ...,
         )
         for block_size_m in [16, 32, 64, 128, 256]
         ...
    ],
    ...
)
@triton.jit
def kernel(...):
    ...

This large search space usually has a couple bad and very bad choices, sometimes
10x, 100x, or 1000x slower than the best choice. I think it would be useful to be able
to tell the autotuner "if the first run of choice X is Y times slower than the fastest choice
so far, then stop benchmarking".

I think this is most valuable for folks who reserve hardware from cloud providers, as it reduces
the time spent using that hardware to benchmark choices that are really bad so it costs less overall.

I followed the format of prune_configs_by that already exists, so the usage would be like:

@triton.autotune(
    configs=[...],
    prune_configs_by={
        "bailout_in_estimation": lambda best_timing: lambda: num_iters, this_iter, timings = ...,
        "bailout_in_benchmarking": lambda best_timing: lambda: num_iters, this_iter, timings = ...,
    },
    ...
)
@triton.jit
def kernel(...):
    ...

This way the user can define the criteria for which they would like to bailout, depending on
the best timing seen so far across all configs, as well as where in do_bench the bailout is (
I thought it made sense in this way, since we maybe want to be more forgiving early on and
less forgiving as we progress through the benchmarking).

This gave me some nice overall reduction in autotuning cost, for the tutorial matmul on two shapes
~50% less time was spent in benchmarking for a search space of ~750 configs. The performance
seemed to be similar enough given noise:

with prune_configs_by={
    "bailout_in_estimation": lambda best: lambda num_iters, this_iter, timings: min(
        timings
    )
    > ((max(num_iters - this_iter, 1.5)) * best),
    "bailout_in_benchmarking": lambda best: lambda num_iters,
    this_iter,
    timings: min(timings)
    > ((max(1.5 - ((this_iter / num_iters) * 0.5), 1.1)) * best),
},

matmul-performance-fp16:
        M       N       K      cuBLAS      Triton
0  1024.0  1024.0  1024.0  171.196087  197.379013
1  1152.0  1152.0  1152.0  201.161024  223.773981

no bailout

matmul-performance-fp16:
         M       N       K      cuBLAS      Triton
0   1024.0  1024.0  1024.0  170.760470  199.136092
1   1152.0  1152.0  1152.0  195.401817  229.140257

There is some additional time saved, since if we bailout we don't synchronize on the remaining
events and therefore there is some pipelining of those remaining events executing and the next
config being compiled, but I didn't know a good way to measure this directly.

I also think it would be valuable to allow users to skip configs that would definitely fail
with out of resources. For example, if the user knows that increasing block sizes strictly
increases shared memory usage for their kernel then it would make sense that if a certain
config fails with out of resources then any related configs that are strictly increasing of the
already failing block sizes can be skipped. If this pull request is valuable, I can follow up with
this option as a next change.

Thank you for the consideration!

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.

mypy fails, with an unrelated error in python/triton/runtime/build.py:34

Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because the bailout parameters are user-defined.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

krillillin · 2025-11-24T05:25:09Z

@peterbell10 @ThomasRaoux if you may possibly review this change? I see you have reviewed changes to the affected files before, thank you for your time

Jokeren · 2025-11-24T13:54:20Z

python/triton/testing.py



-def do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode="mean"):
+def do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode="mean", bailout_in_estimation=None, bailout_in_benchmarking=None):


Adding such an option to do_bench may confuse users IMO. The purpose of this function is just to get the time for a given fn, not related with bailout.

If this is all you need, maybe you can just custom your own do_bench function and supply it to the autotuner.

Understood, may I add a custom do_bench_with_bailout or similar so that other users might also benefit from the change? If it were to be added to autotuner.py and explained with a docstring, this might prevent confusion? Thank you!

Would it make sense instead to have a more general do_bench_for_autotuning? I think there are a couple differences between benchmark for performance and benchmark for autotuning that we can take advantage of to decrease the cost of autotuning.

Bailout, but also we may want to process multiple fn at once so that the noise is distributed? So it could take 1 or N fns to benchmark at once?

add bailout for estimation and benchmarking in autotuning

f24cb3b

krillillin marked this pull request as ready for review November 24, 2025 05:21

krillillin requested a review from ptillet as a code owner November 24, 2025 05:21

best_timing is reset properly

b36bc85

Jokeren reviewed Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add bailout for estimation and benchmarking in autotuning #8816

add bailout for estimation and benchmarking in autotuning #8816

krillillin commented Nov 24, 2025 •

edited

Loading

Uh oh!

krillillin commented Nov 24, 2025

Uh oh!

Jokeren Nov 24, 2025

Uh oh!

krillillin Nov 24, 2025

Uh oh!

krillillin Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode="mean"):
		def do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode="mean", bailout_in_estimation=None, bailout_in_benchmarking=None):

add bailout for estimation and benchmarking in autotuning #8816

Are you sure you want to change the base?

add bailout for estimation and benchmarking in autotuning #8816

Conversation

krillillin commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New contributor declaration

Uh oh!

krillillin commented Nov 24, 2025

Uh oh!

Jokeren Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

krillillin Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

krillillin Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krillillin commented Nov 24, 2025 •

edited

Loading