Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion vllm_ascend/worker/model_runner_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -2665,7 +2665,12 @@ def _build_dummy_attn_metadata(

attn_metadata = {}

seq_lens = self.model_config.max_model_len
# When force_attention == True, the model runs in capturing so we
# need seq_lens as max_model_len to get max workspace for attention op.
# However, when force_attention == False, the model might be running
# normal inference. If dp_size > 1, we only need dummy_run
# to execute a short attention with seq_lens as 1.
seq_lens = self.model_config.max_model_len if force_attention else 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When force_attention == False, the dummy attn metadata is None meaning skipping attention part, so I don't quite understand why we need this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is force_attention or aclgraph_runtime_mode == CUDAGraphMode.FULL. When capturing, force_attention = (aclgraph_runtime_mode == CUDAGraphMode.FULL). However, when not capturing, force_attention in _dummy_run is always False. So _dummy_run in actual inference executes dummy attention when using full graph mode.

self.seq_lens_np[:num_reqs] = seq_lens
self.seq_lens_np[num_reqs:] = 0

Expand Down
Loading