refine lds layout order for transposed input #8827

scxiao · 2025-11-24T20:23:10Z

This PR is related to a lds access scenario in the HSTU attn backward kernel. When lds is used for a layout conversion and input of the convert_layout op is a TransOp, the order of the lds should be the same as the input rather than the output of transOp. This can make the lds write to be vectorized. Then for lds read, we can either use ds_read or ds_read_tr, so both lds write and read are vectorized. The PR can help uplift the hstu bwd kernel by 5%.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

lezcano

This is not the way.
The issue here (and anywhere where we use getOrderForMemory is that getOrderForMemory is a bad approximation of what we really want. It would be much better to use getOrder(srcType) and bail out when there is not enough vectorisation along the loading axis, or something like that.

lezcano · 2025-11-25T09:19:42Z

Note that what's happening here is that this pass wanted to do is to split a convert_layout in a local_load and a local_store in case they can be merged with other local_loads / stores. You'd only want to do that if you know that the intermediate layout you are choosing is optimal, so that's basically what I was hinting at in the previous message.

There is also a world where we just kill this pass as it's quite legacy and I don't think it's that useful nowadays. @antiagainst do you know if this one is still useful on the AMD side?

scxiao · 2025-11-26T17:37:58Z

Note that what's happening here is that this pass wanted to do is to split a convert_layout in a local_load and a local_store in case they can be merged with other local_loads / stores. You'd only want to do that if you know that the intermediate layout you are choosing is optimal, so that's basically what I was hinting at in the previous message.

There is also a world where we just kill this pass as it's quite legacy and I don't think it's that useful nowadays. @antiagainst do you know if this one is still useful on the AMD side?

Thanks for the comments. Maybe for now, I will change the function getOrder(srcType).

scxiao added 3 commits November 24, 2025 11:30

refine the lds order for transposed input tensor

26cde91

clang format

5ff2dd0

fix lit tests and clang format

526b9d7

scxiao requested a review from ptillet as a code owner November 24, 2025 20:23

scxiao marked this pull request as draft November 24, 2025 20:23

lezcano reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refine lds layout order for transposed input #8827

refine lds layout order for transposed input #8827

Uh oh!

scxiao commented Nov 24, 2025

Uh oh!

lezcano left a comment

Uh oh!

lezcano commented Nov 25, 2025

Uh oh!

scxiao commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refine lds layout order for transposed input #8827

Are you sure you want to change the base?

refine lds layout order for transposed input #8827

Uh oh!

Conversation

scxiao commented Nov 24, 2025

New contributor declaration

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

lezcano commented Nov 25, 2025

Uh oh!

scxiao commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants