Skip to content

Conversation

@Angazenn
Copy link
Collaborator

@Angazenn Angazenn commented Nov 10, 2025

What this PR does / why we need it?

Currently, we set seq_lens in dummy attn_metadata to be max_model_len to get max workspace for attention during capturing.
However, setting it consistently to be max_model_len causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with seqs_lens as [8] but max_model_len is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a performance issue in _dummy_run by reducing seq_lens to 1 when not capturing a graph. This avoids executing a long-running attention operation during initialization, which is a good improvement. The implementation looks correct. I've added one high-severity comment to suggest clarifying a confusing comment in the code that contradicts the implementation logic, which will improve maintainability.

Signed-off-by: Angazenn <[email protected]>
# However, when force_attention == False, the model might be running
# normal inference. If dp_size > 1, we only need dummy_run
# to execute a short attention with seq_lens as 1.
seq_lens = self.model_config.max_model_len if force_attention else 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When force_attention == False, the dummy attn metadata is None meaning skipping attention part, so I don't quite understand why we need this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is force_attention or aclgraph_runtime_mode == CUDAGraphMode.FULL. When capturing, force_attention = (aclgraph_runtime_mode == CUDAGraphMode.FULL). However, when not capturing, force_attention in _dummy_run is always False. So _dummy_run in actual inference executes dummy attention when using full graph mode.

Copy link
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the error message when we remove the max_model_len completely. @Angazenn

Signed-off-by: Angazenn <[email protected]>
@Angazenn
Copy link
Collaborator Author

Please leave the error message when we remove the max_model_len completely. @Angazenn

The bug before should be resolved. I think we can change tomax_query_len here.

@Angazenn Angazenn changed the title [main][bugfix] Reduce seq_lens in dummy attn_metadata to be 1 when not capturing [main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len Nov 11, 2025
@Angazenn Angazenn added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Nov 12, 2025
@yiz-liu yiz-liu merged commit fc7e5cd into vllm-project:main Nov 12, 2025
59 checks passed
yiz-liu pushed a commit that referenced this pull request Nov 12, 2025
…data to max_query_len (#4099)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This is cherry-pick from #4097 .
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Angazenn <[email protected]>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
vllm-project#4097)

### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: Angazenn <[email protected]>
Signed-off-by: luolun <[email protected]>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
vllm-project#4097)

### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: Angazenn <[email protected]>
Signed-off-by: hwhaokun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants