Skip to content

Conversation

@scxiao
Copy link
Contributor

@scxiao scxiao commented Nov 24, 2025

This PR is related to a lds access scenario in the HSTU attn backward kernel. When lds is used for a layout conversion and input of the convert_layout op is a TransOp, the order of the lds should be the same as the input rather than the output of transOp. This can make the lds write to be vectorized. Then for lds read, we can either use ds_read or ds_read_tr, so both lds write and read are vectorized. The PR can help uplift the hstu bwd kernel by 5%.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@scxiao scxiao requested a review from ptillet as a code owner November 24, 2025 20:23
@scxiao scxiao marked this pull request as draft November 24, 2025 20:23
Copy link
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the way.
The issue here (and anywhere where we use getOrderForMemory is that getOrderForMemory is a bad approximation of what we really want. It would be much better to use getOrder(srcType) and bail out when there is not enough vectorisation along the loading axis, or something like that.

@lezcano
Copy link
Contributor

lezcano commented Nov 25, 2025

Note that what's happening here is that this pass wanted to do is to split a convert_layout in a local_load and a local_store in case they can be merged with other local_loads / stores. You'd only want to do that if you know that the intermediate layout you are choosing is optimal, so that's basically what I was hinting at in the previous message.

There is also a world where we just kill this pass as it's quite legacy and I don't think it's that useful nowadays. @antiagainst do you know if this one is still useful on the AMD side?

@scxiao
Copy link
Contributor Author

scxiao commented Nov 26, 2025

Note that what's happening here is that this pass wanted to do is to split a convert_layout in a local_load and a local_store in case they can be merged with other local_loads / stores. You'd only want to do that if you know that the intermediate layout you are choosing is optimal, so that's basically what I was hinting at in the previous message.

There is also a world where we just kill this pass as it's quite legacy and I don't think it's that useful nowadays. @antiagainst do you know if this one is still useful on the AMD side?

Thanks for the comments. Maybe for now, I will change the function getOrder(srcType).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants