-
Notifications
You must be signed in to change notification settings - Fork 617
[OPS] support triton causal_conv1d_fn ops #4119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Triton-based implementation for causal_conv1d_fn to optimize its performance on Ascend hardware. The changes include a new Triton kernel, an update to the function wrapper, and the addition of a comprehensive test suite with a PyTorch reference implementation for validation. The filename casual_conv1d.py has also been corrected. The overall implementation is good, but I've identified a performance issue in the Triton kernel related to a redundant memory load.
|
Some performance tests need to be conducted. |
| 1)[mask].to(torch.int32) | ||
|
|
||
| BLOCK_N = 256 | ||
| grid = (total_seq_blocks, triton.cdiv(dim, BLOCK_N)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter may need to be tuned to achieve better performance.
01a550f to
9380e82
Compare
|
@weijinqian0 @wxsIcey Regarding parameter tuning: Since the For an input length of 64, the optimal combination is (32, 256); for a length of 256, the optimal combination is (64, 256); and for a length of 8192, the optimal combination is (128, 512). However, when Taking both performance and stability into account, I decided not to use branch statements to dynamically select parameters. Instead, the parameters are directly set to (64, 256). This rationale is consistent with the implementation of this operator on NVIDIA GPUs, which also hardcodes its parameters to (8, 256). We can revisit this for further optimization in the future when the triton ascend is more stable. |
|
Regarding performance testing: Since the operator executes during the prefill stage, I set the concurrency to 40 with an input size of 64 and output size of 10 to better demonstrate the performance gains. The test script used is as follows. The results are summarized below: |
9380e82 to
0358477
Compare
Signed-off-by: QilaiZhang <[email protected]>
Signed-off-by: QilaiZhang <[email protected]>
0358477 to
f1404e0
Compare
|
@wxsIcey The current tests have passed. We are ready for the next test when convenient. |



What this PR does / why we need it?
Support triton causal_conv1d_fn ops.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
CI passed with new added/existing test.