Commit d408343
authored
[main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (vllm-project#4097)
### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b
---------
Signed-off-by: Angazenn <[email protected]>
Signed-off-by: luolun <[email protected]>1 parent 9e5b118 commit d408343
1 file changed
+1
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2824 | 2824 | | |
2825 | 2825 | | |
2826 | 2826 | | |
2827 | | - | |
| 2827 | + | |
2828 | 2828 | | |
2829 | 2829 | | |
2830 | 2830 | | |
| |||
0 commit comments