-
Notifications
You must be signed in to change notification settings - Fork 662
[Graph Optimization] Support deepseekV3 SOT Dy2St && CUDAGraph #4785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
29ec828
effa7f4
57bd540
05416ae
e397a1f
e5c7a13
ac349b7
8d64fe6
e8fd411
8422a36
79b03c7
c8fa3f1
4aa507b
8ae652d
7ca7efa
7ad94cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -166,6 +166,16 @@ def __init__( | |
| "The current platform does not support Flash Attention V3, so Flash Attention V2 will be used instead." | ||
| ) | ||
|
|
||
| if fd_config.graph_opt_config.use_cudagraph: | ||
| if fd_config.graph_opt_config.full_cuda_graph: | ||
| print( | ||
| "[Warning] Full graph capture with CUDAGraph is not supported in the presence of control flow; " | ||
| "`full_cuda_graph` has been automatically set to False." | ||
| ) | ||
|
|
||
| flag = "FLAGS_cuda_graph_blacklist" | ||
| paddle.set_flags({flag: ",".join(list(set(paddle.get_flags(flag)[flag].split(",") + ["pd_op.if"])))}) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里仍然需要在设置
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 现在这个就是只要设置
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 对于其他模型应该不对吧,应该不是
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这个是
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 其他模型目前直接开启
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 能跑,大部分模型这个参数都不生效,只有 append attention 后端(ERNIE4.5Turbo)会受影响 |
||
|
|
||
| def init_attention_metadata(self, forward_meta: ForwardMeta): | ||
| """Initialize attention metadata hence all layers in the forward pass can reuse it.""" | ||
| metadata = MLAAttentionMetadata() | ||
|
|
@@ -205,6 +215,8 @@ def init_attention_metadata(self, forward_meta: ForwardMeta): | |
| self.group_size, | ||
| self.block_size, | ||
| ) | ||
| forward_meta.needs_prefill = forward_meta.max_len_tensor_cpu[1] > 0 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这两个参数能否直接用之前的逻辑?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 如果 不开CUDAGraph,只开SOT转静 可以复用之前的逻辑 之前的代码是: if forward_meta.max_len_tensor_cpu[0]:
...这里包含了两个隐形操作:
这两个操作都是 CPU Kernel 的操作,由于是 if op 之前的算子,所以这俩算子会变成 CUDAGraph OP 的子 OP 这样会导致 改成这样之后: forward_meta.needs_prefill = forward_meta.max_len_tensor_cpu[1] > 0
if forward_meta.needs_prefill:
....保证 if op 能选择正确的分支 |
||
| forward_meta.needs_decode = forward_meta.max_len_tensor_cpu[2] > 0 | ||
|
|
||
| # MLA | ||
| metadata.max_enc_len_this_time = forward_meta.max_len_tensor_cpu[1] | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.