[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (vllm-project#4097)

Angazenn · web-flow · commit d408343454a3 · 2025-11-12T17:31:39.000+08:00
### What this PR does / why we need it? Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: luolun <luolun1995@cmbchina.com>
diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
@@ -2824,7 +2824,7 @@ def _build_dummy_attn_metadata(
 
             attn_metadata = {}
 
-            seq_lens = self.model_config.max_model_len
+            seq_lens = max_query_len
             self.seq_lens_np[:num_reqs] = seq_lens
             self.seq_lens_np[num_reqs:] = 0