Skip to content

Conversation

@alexsamardzic
Copy link
Contributor

@alexsamardzic alexsamardzic commented Nov 20, 2025

This PR is to add update_tensor_descriptor operation, to simplify writing kernels like grouped MM (like pytorch/pytorch#166063, in particular to avoid handling special cases like this). This is also to match update_tensormap op in CuTe DSL, like used here.

The operation reads the existing descriptor from GMEM into SMEM, performs updates in SMEM, and writes updated descriptor back into GMEM. So the rationale for using this operation instead of creating a new descriptor is to save a GMEM allocation (more precisely, to trade it for reading from GMEM), and to emit only tensormap.replace.tile.* PTX instructions for the descriptor fields that are actually changed. Otherwise, the implementation closely follows make_tensor_descriptor implementation. The end-to-end performance improvement is minor, but the main advantage is that the code for the cases like the kernel pointed above is cleaner. This PR makes it possible to change tensor base pointer, shape and strides fields in the descriptor; changing other fields could be added in the future if there is a need.

//
// Update Tensor Descriptor Op
//
def TT_UpdateTensorDescOp : TT_Op<"update_tensor_descriptor", [
Copy link
Contributor

@peterbell10 peterbell10 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-place update kind of forces the underlying implementation to be memory-backed which may not be the case for all hardware, including pre-hopper where we translate tensor descriptors to normal pointer indexing.

Copy link
Contributor Author

@alexsamardzic alexsamardzic Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several options to fix it, for example lowering basically to what make_tensor_descriptor does for older hardware. But overall: I'm actually not sure having this operation in Triton is appropriate, so how about removing the Triton version, and moving Gluon version under tma or better hopper namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit that removes Triton version of the operation from PR, and moves Gluon version under tma namespace.


a = desc.load([moffset, noffset])

tl.update_tensor_descriptor(desc, base=b_ptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is illegal. We're passing the descriptor in param space which should be constant.

I also think this will break if your launch grid is larger than the number of SMs. I expect the second program to be scheduled on a single SM would see the already updated descriptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, indeed. Would it be acceptable to limit the operation to work only for the descriptors created from within the kernel, and thus avoid both problems you pointed to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another commit to implement the change proposed.

@alexsamardzic alexsamardzic force-pushed the add-update-tensor-descriptor branch 2 times, most recently from 0ff51d0 to f4f5db2 Compare November 23, 2025 16:43
@alexsamardzic alexsamardzic force-pushed the add-update-tensor-descriptor branch from f4f5db2 to 00f04d7 Compare November 24, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants