[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len #4097

Angazenn · 2025-11-10T11:37:44Z

What this PR does / why we need it?

Currently, we set seq_lens in dummy attn_metadata to be max_model_len to get max workspace for attention during capturing.
However, setting it consistently to be max_model_len causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with seqs_lens as [8] but max_model_len is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-11-10T11:37:52Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request addresses a performance issue in _dummy_run by reducing seq_lens to 1 when not capturing a graph. This avoids executing a long-running attention operation during initialization, which is a good improvement. The implementation looks correct. I've added one high-severity comment to suggest clarifying a confusing comment in the code that contradicts the implementation logic, which will improve maintainability.

vllm_ascend/worker/model_runner_v1.py

Signed-off-by: Angazenn <[email protected]>

whx-sjtu · 2025-11-11T01:55:36Z

vllm_ascend/worker/model_runner_v1.py

+            # However, when force_attention == False, the model might be running
+            # normal inference. If dp_size > 1, we only need dummy_run
+            # to execute a short attention with seq_lens as 1.
+            seq_lens = self.model_config.max_model_len if force_attention else 1


When force_attention == False, the dummy attn metadata is None meaning skipping attention part, so I don't quite understand why we need this.

The condition is force_attention or aclgraph_runtime_mode == CUDAGraphMode.FULL. When capturing, force_attention = (aclgraph_runtime_mode == CUDAGraphMode.FULL). However, when not capturing, force_attention in _dummy_run is always False. So _dummy_run in actual inference executes dummy attention when using full graph mode.

yiz-liu

Please leave the error message when we remove the max_model_len completely. @Angazenn

Signed-off-by: Angazenn <[email protected]>

Angazenn · 2025-11-11T09:51:09Z

Please leave the error message when we remove the max_model_len completely. @Angazenn

The bug before should be resolved. I think we can change tomax_query_len here.

…data to max_query_len (#4099)  ### What this PR does / why we need it? This is cherry-pick from #4097 . Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested?  --------- Signed-off-by: Angazenn <[email protected]>

vllm-project#4097) ### What this PR does / why we need it? Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: Angazenn <[email protected]> Signed-off-by: luolun <[email protected]>

vllm-project#4097) ### What this PR does / why we need it? Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: Angazenn <[email protected]> Signed-off-by: hwhaokun <[email protected]>

gemini-code-assist bot reviewed Nov 10, 2025

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Outdated Show resolved Hide resolved

Angazenn added ready read for review ready-for-test start test by label for PR labels Nov 10, 2025

Angazenn mentioned this pull request Nov 10, 2025

[cherry-pick][v0.11.0-dev][bugfix] Change seq_lens in dummy attn_metadata to max_query_len #4099

Merged

bugfix

a30adc6

Signed-off-by: Angazenn <[email protected]>

Angazenn force-pushed the dummy branch from 51cb6f3 to a30adc6 Compare November 10, 2025 11:48

whx-sjtu reviewed Nov 11, 2025

View reviewed changes

yiz-liu reviewed Nov 11, 2025

View reviewed changes

change to max_query_len

bf833ca

Signed-off-by: Angazenn <[email protected]>

Angazenn changed the title ~~[main][bugfix] Reduce seq_lens in dummy attn_metadata to be 1 when not capturing~~ [main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len Nov 11, 2025

Angazenn added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Nov 12, 2025

yiz-liu approved these changes Nov 12, 2025

View reviewed changes

yiz-liu merged commit fc7e5cd into vllm-project:main Nov 12, 2025
59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len #4097

[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len #4097

Uh oh!

Angazenn commented Nov 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

whx-sjtu Nov 11, 2025

Uh oh!

Angazenn Nov 11, 2025

Uh oh!

yiz-liu left a comment •

edited

Loading

Uh oh!

Angazenn commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len #4097

[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len #4097

Uh oh!

Conversation

Angazenn commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

whx-sjtu Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Angazenn Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Angazenn commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Angazenn commented Nov 10, 2025 •

edited

Loading

yiz-liu left a comment •

edited

Loading