-
Notifications
You must be signed in to change notification settings - Fork 76
[doc] Add tutorial on TDs to readme #5541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| Basic rules: | ||
|
|
||
| 1. **Use Tensor Descriptors:** For inputs and outputs of matmul operations (`tl.dot`), use Tensor Descriptors. This utilizes the hardware-optimized DPAS operation and asynchronous loading. You can often expect more than a 2x performance improvement compared to the basic tensor of pointers approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asyncronous loading -> 2D block IO HW operations to load the operand of a tt.dot operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also clarify that tensor descriptors declared in the kernel (device side) should be used (not tensor descriptors declared on the host).
|
|
||
| 1. **Use Tensor Descriptors:** For inputs and outputs of matmul operations (`tl.dot`), use Tensor Descriptors. This utilizes the hardware-optimized DPAS operation and asynchronous loading. You can often expect more than a 2x performance improvement compared to the basic tensor of pointers approach. | ||
| 2. **Benchmark:** Experiment with the performance of your kernel. You can use `triton.testing.do_bench` for basic benchmarking, as demonstrated in the [tutorials](../python/tutorials/02-fused-softmax.py). | ||
| 3. **Type Annotations:** Use proper type annotations for your kernels. Good type annotations allow for better optimization, but be careful to avoid excessive recompilation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, what does this mean. It may be misleading because adding an explicit type (e.g. tl.int64) to the strides used to create the tensor descriptor is going to cause us problems.
| --- | ||
|
|
||
|
|
||
| Tensor Descriptors support shapes up to 5 dimensions, but for performance, it is best to use 2 dimensions whenever possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this statement even more clear, change:
"but for performance, it is best to use 2 dimensions whenever possible."
to:
"however the Intel XPU backend currently optimizes well 2 dimensional tensor descriptors, so avoid using higher dimensionality tensor descriptors whenever possible."
| Summary: | ||
| 1. Use Tensor Desciptors to load memory reqired for `tl.dot` and to save results. | ||
| 2. Strive to use 2D tensor desctiptors for better performance. | ||
| 3. Last tensor stride should be `tl.constexpr` or have no type annotation. Annotating with `tl.int64` will result in poor perfomance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not just tl.int64, using tl.int32 for example would also limit performance. :Let's suggest to use tl.constexpr whenever possible, and if not then avoid explicitly annotating the data type.
| ## Use proper type annotations | ||
| 1. Set `tl.constexpr` type annotation for block sizes and boolean flags to let the compiler optimize. Each combination of arguments with this annotation is compiled separately. Avoid setting it for values that vary widely at runtime (like the number of tokens) to prevent excessive recompilation. | ||
| 2. No Annotation: You can keep type annotations empty and let the compiler guess. This is good for parameters that change often (like strides) to avoid recompilation. | ||
| 3. Avoid writing `tl.int64` type annotation for the last stride of a tensor. It is often important for the compiler to know that the tensor is contiguous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, but generalize to avoid annotating the type explicitly (not just tl.int64)
| ## Tune kernel configuration | ||
|
|
||
| ### GRF Mode | ||
| Setting it higher can be good for kernel that uses many registers, but will decrease hardware utilizaion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting it higher --> using grf_mode = large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grf_mode = large is deprecated, please use grf_mode = 256 instead.
| ``` | ||
| ### Example 3 : GEMM operations | ||
| Intel backend for triton requires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else was supposed to go here ?
|
@etiotto I'll process all of you comments and fix issues. In the meantime, maybe you have some other ideas about key things for performance that I missed? |
Add tutorial for Tensor Descriptors to the
README.mdCloses #5492