Skip to content

Conversation

@QilaiZhang
Copy link
Contributor

@QilaiZhang QilaiZhang commented Nov 11, 2025

What this PR does / why we need it?

Support triton causal_conv1d_fn ops.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed with new added/existing test.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Triton-based implementation for causal_conv1d_fn to optimize its performance on Ascend hardware. The changes include a new Triton kernel, an update to the function wrapper, and the addition of a comprehensive test suite with a PyTorch reference implementation for validation. The filename casual_conv1d.py has also been corrected. The overall implementation is good, but I've identified a performance issue in the Triton kernel related to a redundant memory load.

@weijinqian0
Copy link
Collaborator

Some performance tests need to be conducted.

1)[mask].to(torch.int32)

BLOCK_N = 256
grid = (total_seq_blocks, triton.cdiv(dim, BLOCK_N))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter may need to be tuned to achieve better performance.

@QilaiZhang QilaiZhang force-pushed the causal_conv1d_fwd branch 3 times, most recently from 01a550f to 9380e82 Compare November 13, 2025 02:40
@QilaiZhang
Copy link
Contributor Author

QilaiZhang commented Nov 13, 2025

@weijinqian0 @wxsIcey Regarding parameter tuning: Since the batch_ptr and token_chunk_offset_ptr of this operator need to be calculated based on BLOCK_M and BLOCK_N, it is not suitable for triton.autotune. I manually tested the performance under different inputs and parameters using the msprof op tool, and the results are shown in the figure below.
triton-optimized

For an input length of 64, the optimal combination is (32, 256); for a length of 256, the optimal combination is (64, 256); and for a length of 8192, the optimal combination is (128, 512). However, when BLOCK_M is greater than or equal to 128, the compilation time is long and the kernel often hangs or crashes during testing, so this option is not considered.

Taking both performance and stability into account, I decided not to use branch statements to dynamically select parameters. Instead, the parameters are directly set to (64, 256). This rationale is consistent with the implementation of this operator on NVIDIA GPUs, which also hardcodes its parameters to (8, 256). We can revisit this for further optimization in the future when the triton ascend is more stable.

@QilaiZhang
Copy link
Contributor Author

QilaiZhang commented Nov 13, 2025

Regarding performance testing: Since the operator executes during the prefill stage, I set the concurrency to 40 with an input size of 64 and output size of 10 to better demonstrate the performance gains. The test script used is as follows.

python3 benchmarks/benchmark_serving.py --backend openai-chat --base-url ${BASE_URL} \
--endpoint=/v1/chat/completions --model /model/Qwen3-next-80B-A3B-Thinking/ --served-model-name Qwen3-next-80B-A3B-Thinking \
--dataset-name sonnet --dataset-path benchmarks/sonnet.txt --max-concurrency 40 \
--sonnet-input-len 64 --sonnet-output-len 10 --sonnet-prefix-len 0 --num-prompts 400 \
--ignore-eos --percentile-metrics "ttft,tpot,itl,e2el"

The results are summarized below:

  • PyTorch implementation
    微信图片_20251113104738

  • Triton kernel (before tuning) with (8, 256)
    微信图片_20251113104744

  • Triton kernel (after tuning) with (64, 256)
    微信图片_20251113104728

@QilaiZhang
Copy link
Contributor Author

@wxsIcey The current tests have passed. We are ready for the next test when convenient.

@wxsIcey wxsIcey added ready read for review ready-for-test start test by label for PR labels Nov 14, 2025
@QilaiZhang QilaiZhang closed this Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants