-
Notifications
You must be signed in to change notification settings - Fork 579
[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len #4097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a performance issue in _dummy_run by reducing seq_lens to 1 when not capturing a graph. This avoids executing a long-running attention operation during initialization, which is a good improvement. The implementation looks correct. I've added one high-severity comment to suggest clarifying a confusing comment in the code that contradicts the implementation logic, which will improve maintainability.
Signed-off-by: Angazenn <[email protected]>
| # However, when force_attention == False, the model might be running | ||
| # normal inference. If dp_size > 1, we only need dummy_run | ||
| # to execute a short attention with seq_lens as 1. | ||
| seq_lens = self.model_config.max_model_len if force_attention else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When force_attention == False, the dummy attn metadata is None meaning skipping attention part, so I don't quite understand why we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition is force_attention or aclgraph_runtime_mode == CUDAGraphMode.FULL. When capturing, force_attention = (aclgraph_runtime_mode == CUDAGraphMode.FULL). However, when not capturing, force_attention in _dummy_run is always False. So _dummy_run in actual inference executes dummy attention when using full graph mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave the error message when we remove the max_model_len completely. @Angazenn
Signed-off-by: Angazenn <[email protected]>
The bug before should be resolved. I think we can change to |
…data to max_query_len (#4099) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This is cherry-pick from #4097 . Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Angazenn <[email protected]>
vllm-project#4097) ### What this PR does / why we need it? Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: Angazenn <[email protected]> Signed-off-by: luolun <[email protected]>
vllm-project#4097) ### What this PR does / why we need it? Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: Angazenn <[email protected]> Signed-off-by: hwhaokun <[email protected]>
What this PR does / why we need it?
Currently, we set
seq_lensin dummy attn_metadata to bemax_model_lento get max workspace for attention during capturing.However, setting it consistently to be
max_model_lencausing dummy_run to execute a long attention when running actual inference. For example, if there is a single req withseqs_lensas [8] butmax_model_lenis 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation.Does this PR introduce any user-facing change?
No.
How was this patch tested?