Skip to content

Conversation

@Egor-Krivov
Copy link
Contributor

Add tutorial for Tensor Descriptors to the README.md

Closes #5492

@Egor-Krivov Egor-Krivov changed the title [doc] Add readme tutorial [doc] Add tutorial on TDs to readme Nov 24, 2025

Basic rules:

1. **Use Tensor Descriptors:** For inputs and outputs of matmul operations (`tl.dot`), use Tensor Descriptors. This utilizes the hardware-optimized DPAS operation and asynchronous loading. You can often expect more than a 2x performance improvement compared to the basic tensor of pointers approach.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncronous loading -> 2D block IO HW operations to load the operand of a tt.dot operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also clarify that tensor descriptors declared in the kernel (device side) should be used (not tensor descriptors declared on the host).


1. **Use Tensor Descriptors:** For inputs and outputs of matmul operations (`tl.dot`), use Tensor Descriptors. This utilizes the hardware-optimized DPAS operation and asynchronous loading. You can often expect more than a 2x performance improvement compared to the basic tensor of pointers approach.
2. **Benchmark:** Experiment with the performance of your kernel. You can use `triton.testing.do_bench` for basic benchmarking, as demonstrated in the [tutorials](../python/tutorials/02-fused-softmax.py).
3. **Type Annotations:** Use proper type annotations for your kernels. Good type annotations allow for better optimization, but be careful to avoid excessive recompilation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, what does this mean. It may be misleading because adding an explicit type (e.g. tl.int64) to the strides used to create the tensor descriptor is going to cause us problems.

---


Tensor Descriptors support shapes up to 5 dimensions, but for performance, it is best to use 2 dimensions whenever possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this statement even more clear, change:

"but for performance, it is best to use 2 dimensions whenever possible."

to:

"however the Intel XPU backend currently optimizes well 2 dimensional tensor descriptors, so avoid using higher dimensionality tensor descriptors whenever possible."

Summary:
1. Use Tensor Desciptors to load memory reqired for `tl.dot` and to save results.
2. Strive to use 2D tensor desctiptors for better performance.
3. Last tensor stride should be `tl.constexpr` or have no type annotation. Annotating with `tl.int64` will result in poor perfomance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just tl.int64, using tl.int32 for example would also limit performance. :Let's suggest to use tl.constexpr whenever possible, and if not then avoid explicitly annotating the data type.

## Use proper type annotations
1. Set `tl.constexpr` type annotation for block sizes and boolean flags to let the compiler optimize. Each combination of arguments with this annotation is compiled separately. Avoid setting it for values that vary widely at runtime (like the number of tokens) to prevent excessive recompilation.
2. No Annotation: You can keep type annotations empty and let the compiler guess. This is good for parameters that change often (like strides) to avoid recompilation.
3. Avoid writing `tl.int64` type annotation for the last stride of a tensor. It is often important for the compiler to know that the tensor is contiguous.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, but generalize to avoid annotating the type explicitly (not just tl.int64)

## Tune kernel configuration

### GRF Mode
Setting it higher can be good for kernel that uses many registers, but will decrease hardware utilizaion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting it higher --> using grf_mode = large

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grf_mode = large is deprecated, please use grf_mode = 256 instead.

```
### Example 3 : GEMM operations
Intel backend for triton requires
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else was supposed to go here ?

@Egor-Krivov
Copy link
Contributor Author

Egor-Krivov commented Nov 25, 2025

@etiotto I'll process all of you comments and fix issues. In the meantime, maybe you have some other ideas about key things for performance that I missed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Doc] Write readme section on Tensor Descriptor usage

4 participants