-
Notifications
You must be signed in to change notification settings - Fork 661
[Graph Optimization] Support deepseekV3 SOT Dy2St && CUDAGraph #4785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
|
|
||
| def forward( | ||
| @paddle.jit.marker.capture_control_flow | ||
| def prefill_or_decode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确定要叫这个名字么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chang-wenbin 这个改成什么名字合适一些呢?
你在 prefill 和 decode 之间,选择了 or ? 🤪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议就改成mla_attention吧;
但是这个改动后和之前代码的区别是啥呀,这样不会有控制流吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| def forward( | ||
| @paddle.jit.marker.capture_control_flow | ||
| def prefill_or_decode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议就改成mla_attention吧;
但是这个改动后和之前代码的区别是啥呀,这样不会有控制流吗?
| self.block_size, | ||
| self.speculate_max_draft_token_num + 1, | ||
| ) | ||
| forward_meta.needs_prefill = forward_meta.max_len_tensor_cpu[1] > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个参数能否直接用之前的逻辑?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果 不开CUDAGraph,只开SOT转静 可以复用之前的逻辑
开CUDAGraph+SOT转静 不能复用之前的逻辑,原因如下:
之前的代码是:
if forward_meta.max_len_tensor_cpu[0]:
...这里包含了两个隐形操作:
- 对这个 CPU 的 int Tensor
max_len_tensor_cpu做索引操作 max_len_tensor_cpu[0]这个 CPU 的 int Scalar,cast 为 bool Scalar
这两个操作都是 CPU Kernel 的操作,由于是 if op 之前的算子,所以这俩算子会变成 CUDAGraph OP 的子 OP
由于是 CPU Kernel 所以无法被 CUDAGraph Capture 到
这样会导致 max_len_tensor_cpu[0] == 0 时,依然会进入 prefill 的分支
改成这样之后:
forward_meta.needs_prefill = forward_meta.max_len_tensor_cpu[1] > 0
if forward_meta.needs_prefill:
....保证 if op 能选择正确的分支
| forward_meta=forward_meta, | ||
| ) | ||
|
|
||
| fmha_out_prefill = fmha_out_prefill.reshape([-1, self.num_attention_heads_tp, self.qk_head_dim]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里check过输出shape已经是[-1, self.num_attention_heads_tp, self.qk_head_dim]了吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,这里的 query、key 和 fmha_out_prefill 都是同 shape
都是 [bs, self.num_attention_heads_tp, self.qk_head_dim]
fastdeploy/model_executor/layers/attention/mla_attention_backend.py
Outdated
Show resolved
Hide resolved
| raise RuntimeError("CUDAGraph full graph capture is not supported due to the presence of control flow.") | ||
| else: | ||
| flag = "FLAGS_cuda_graph_blacklist" | ||
| paddle.set_flags({flag: ",".join(list(set(paddle.get_flags(flag)[flag].split(",") + ["pd_op.if"])))}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里仍然需要在设置 full_cuda_graph=false 的时候手动在外面设置 FLAGS_cuda_graph_blacklist 么?用户可以做到只设置 full_cuda_graph=false 就可以跑么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在这个就是只要设置 full_cuda_graph=false 就行,默认会有 pd_op.if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于其他模型应该不对吧,应该不是 pd_op.if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是 MLA 的后端,里面有 if 语句,目前只给 Deepseek V3 用,其他 Attention 后端不一定有 if 语句
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其他模型目前直接开启 full_cuda_graph=false 是不能做到直接能跑的是吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能跑,大部分模型这个参数都不生效,只有 append attention 后端(ERNIE4.5Turbo)会受影响
后面单独提 PR 给 append attention 后端加上这个 paddle.set/get flag
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM

Motivation
支持 Deepseek V3 SOT 动转静 + 开启 CUDAGraph
Modifications
修改 Deepseek V3 组网,将两个 if 分支移出到新函数,并添加装饰器
@paddle.jit.marker.capture_control_flow进行 AST 局部转静,前置PR:graph_tracing_guardPaddle#76198builtin.parameterto top in IR when ast dy2static Paddle#76190Usage or Command
None
Accuracy Tests
None
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.