Skip to content

Conversation

@krillillin
Copy link

@krillillin krillillin commented Nov 24, 2025

Add bailout_in_estimation/benchmarking as optional params to prune_configs_by

A common pattern for myself is to autotune over a large search space before dialing
in on a subset of that search space for subsequent runs. Usually like:

@triton.autotune(
    configs=[
        triton.Config(
            {
                 "BLOCK_SIZE_M": block_size_m,
                 ...
            },
            ...,
         )
         for block_size_m in [16, 32, 64, 128, 256]
         ...
    ],
    ...
)
@triton.jit
def kernel(...):
    ...

This large search space usually has a couple bad and very bad choices, sometimes
10x, 100x, or 1000x slower than the best choice. I think it would be useful to be able
to tell the autotuner "if the first run of choice X is Y times slower than the fastest choice
so far, then stop benchmarking".

I think this is most valuable for folks who reserve hardware from cloud providers, as it reduces
the time spent using that hardware to benchmark choices that are really bad so it costs less overall.

I followed the format of prune_configs_by that already exists, so the usage would be like:

@triton.autotune(
    configs=[...],
    prune_configs_by={
        "bailout_in_estimation": lambda best_timing: lambda: num_iters, this_iter, timings = ...,
        "bailout_in_benchmarking": lambda best_timing: lambda: num_iters, this_iter, timings = ...,
    },
    ...
)
@triton.jit
def kernel(...):
    ...

This way the user can define the criteria for which they would like to bailout, depending on
the best timing seen so far across all configs, as well as where in do_bench the bailout is (
I thought it made sense in this way, since we maybe want to be more forgiving early on and
less forgiving as we progress through the benchmarking).

This gave me some nice overall reduction in autotuning cost, for the tutorial matmul on two shapes
~50% less time was spent in benchmarking for a search space of ~750 configs. The performance
seemed to be similar enough given noise:

with prune_configs_by={
    "bailout_in_estimation": lambda best: lambda num_iters, this_iter, timings: min(
        timings
    )
    > ((max(num_iters - this_iter, 1.5)) * best),
    "bailout_in_benchmarking": lambda best: lambda num_iters,
    this_iter,
    timings: min(timings)
    > ((max(1.5 - ((this_iter / num_iters) * 0.5), 1.1)) * best),
},

matmul-performance-fp16:
        M       N       K      cuBLAS      Triton
0  1024.0  1024.0  1024.0  171.196087  197.379013
1  1152.0  1152.0  1152.0  201.161024  223.773981
no bailout

matmul-performance-fp16:
         M       N       K      cuBLAS      Triton
0   1024.0  1024.0  1024.0  170.760470  199.136092
1   1152.0  1152.0  1152.0  195.401817  229.140257

There is some additional time saved, since if we bailout we don't synchronize on the remaining
events and therefore there is some pipelining of those remaining events executing and the next
config being compiled, but I didn't know a good way to measure this directly.

I also think it would be valuable to allow users to skip configs that would definitely fail
with out of resources. For example, if the user knows that increasing block sizes strictly
increases shared memory usage for their kernel then it would make sense that if a certain
config fails with out of resources then any related configs that are strictly increasing of the
already failing block sizes can be skipped. If this pull request is valuable, I can follow up with
this option as a next change.

Thank you for the consideration!

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

mypy fails, with an unrelated error in python/triton/runtime/build.py:34

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because the bailout parameters are user-defined.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@krillillin krillillin marked this pull request as ready for review November 24, 2025 05:21
@krillillin krillillin requested a review from ptillet as a code owner November 24, 2025 05:21
@krillillin
Copy link
Author

@peterbell10 @ThomasRaoux if you may possibly review this change? I see you have reviewed changes to the affected files before, thank you for your time



def do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode="mean"):
def do_bench(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, return_mode="mean", bailout_in_estimation=None, bailout_in_benchmarking=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding such an option to do_bench may confuse users IMO. The purpose of this function is just to get the time for a given fn, not related with bailout.

If this is all you need, maybe you can just custom your own do_bench function and supply it to the autotuner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, may I add a custom do_bench_with_bailout or similar so that other users might also benefit from the change? If it were to be added to autotuner.py and explained with a docstring, this might prevent confusion? Thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense instead to have a more general do_bench_for_autotuning? I think there are a couple differences between benchmark for performance and benchmark for autotuning that we can take advantage of to decrease the cost of autotuning.

Bailout, but also we may want to process multiple fn at once so that the noise is distributed? So it could take 1 or N fns to benchmark at once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants