[Graph Optimization] Support deepseekV3 SOT Dy2St && CUDAGraph #4785

DrRyanHuang · 2025-11-03T13:38:09Z

Motivation

支持 Deepseek V3 SOT 动转静 + 开启 CUDAGraph

Modifications

修改 Deepseek V3 组网，将两个 if 分支移出到新函数，并添加装饰器 @paddle.jit.marker.capture_control_flow 进行 AST 局部转静，前置PR:

Usage or Command

None

Accuracy Tests

None

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-03T13:38:16Z

Thanks for your contribution!

SigureMo · 2025-11-03T14:02:02Z

fastdeploy/model_executor/models/deepseek_v3.py


-    def forward(
+    @paddle.jit.marker.capture_control_flow
+    def prefill_or_decode(


确定要叫这个名字么？

@chang-wenbin 这个改成什么名字合适一些呢？

~~你在 prefill 和 decode 之间，选择了 or ? 🤪~~

建议就改成mla_attention吧;
但是这个改动后和之前代码的区别是啥呀，这样不会有控制流吗？

这样是将 mla_attention 函数做AST局部转静，这样 if 控制流语句就可以转化为if算子（SOT转静不支持Tensor做if的条件），这样做 Deepseek V3 就可以整图转静

下图是IR图中的if算子，红框内部的是 mla_attention 中的两个 if Op，两个红框之间的是 CUDAGraph Op

chang-wenbin · 2025-11-04T04:17:45Z

fastdeploy/model_executor/models/deepseek_v3.py


-    def forward(
+    @paddle.jit.marker.capture_control_flow
+    def prefill_or_decode(


建议就改成mla_attention吧;
但是这个改动后和之前代码的区别是啥呀，这样不会有控制流吗？

chang-wenbin · 2025-11-05T08:59:13Z

fastdeploy/model_executor/layers/attention/mla_attention_backend.py

            self.block_size,
            self.speculate_max_draft_token_num + 1,
        )
+        forward_meta.needs_prefill = forward_meta.max_len_tensor_cpu[1] > 0


这两个参数能否直接用之前的逻辑？

如果 不开CUDAGraph，只开SOT转静 可以复用之前的逻辑
开CUDAGraph+SOT转静 不能复用之前的逻辑，原因如下：

之前的代码是：

if forward_meta.max_len_tensor_cpu[0]: ...

这里包含了两个隐形操作：

对这个 CPU 的 int Tensor max_len_tensor_cpu 做索引操作

max_len_tensor_cpu[0] 这个 CPU 的 int Scalar，cast 为 bool Scalar

这两个操作都是 CPU Kernel 的操作，由于是 if op 之前的算子，所以这俩算子会变成 CUDAGraph OP 的子 OP
由于是 CPU Kernel 所以无法被 CUDAGraph Capture 到

这样会导致 max_len_tensor_cpu[0] == 0 时，依然会进入 prefill 的分支

改成这样之后：

forward_meta.needs_prefill = forward_meta.max_len_tensor_cpu[1] > 0 if forward_meta.needs_prefill: ....

保证 if op 能选择正确的分支

chang-wenbin · 2025-11-05T09:00:05Z

fastdeploy/model_executor/models/deepseek_v3.py

                forward_meta=forward_meta,
            )
-
-            fmha_out_prefill = fmha_out_prefill.reshape([-1, self.num_attention_heads_tp, self.qk_head_dim])


这里check过输出shape已经是[-1, self.num_attention_heads_tp, self.qk_head_dim]了吗？

是的，这里的 query、key 和 fmha_out_prefill 都是同 shape
都是 [bs, self.num_attention_heads_tp, self.qk_head_dim]

fastdeploy/model_executor/models/deepseek_v3.py

fastdeploy/model_executor/layers/attention/mla_attention_backend.py

SigureMo · 2025-11-06T02:31:55Z

fastdeploy/model_executor/layers/attention/mla_attention_backend.py

+            raise RuntimeError("CUDAGraph full graph capture is not supported due to the presence of control flow.")
+        else:
+            flag = "FLAGS_cuda_graph_blacklist"
+            paddle.set_flags({flag: ",".join(list(set(paddle.get_flags(flag)[flag].split(",") + ["pd_op.if"])))})


这里仍然需要在设置 full_cuda_graph=false 的时候手动在外面设置 FLAGS_cuda_graph_blacklist 么？用户可以做到只设置 full_cuda_graph=false 就可以跑么？

现在这个就是只要设置 full_cuda_graph=false 就行，默认会有 pd_op.if

对于其他模型应该不对吧，应该不是 pd_op.if

这个是 MLA 的后端，里面有 if 语句，目前只给 Deepseek V3 用，其他 Attention 后端不一定有 if 语句

其他模型目前直接开启 full_cuda_graph=false 是不能做到直接能跑的是吧

能跑，大部分模型这个参数都不生效，只有 append attention 后端（ERNIE4.5Turbo）会受影响
后面单独提 PR 给 append attention 后端加上这个 paddle.set/get flag

fastdeploy/model_executor/forward_meta.py

gongshaotian

LGTM

support deepseekV3 sot dy2st

29ec828

DrRyanHuang requested review from SigureMo and chang-wenbin November 3, 2025 13:38

SigureMo reviewed Nov 3, 2025

View reviewed changes

DrRyanHuang changed the title ~~[PD Disaggregation] Support deepseekV3 sot dy2st~~ [Graph Optimization] Support deepseekV3 sot dy2st Nov 4, 2025

DrRyanHuang added 3 commits November 4, 2025 20:40

add empty && rm redundant reshape

effa7f4

assign -> output

57bd540

mv cast & slice(cpu) out of cudagraphOp

05416ae

DrRyanHuang changed the title ~~[Graph Optimization] Support deepseekV3 sot dy2st~~ [Graph Optimization] Support deepseekV3 SOT Dy2St && CUDAGraph Nov 5, 2025

chang-wenbin reviewed Nov 5, 2025

View reviewed changes

DrRyanHuang added 3 commits November 5, 2025 19:15

rename run_prefill_or_decode_attention -> mla_attention

e397a1f

cudagraph_switch_threshold 1024 -> 512

e5c7a13

use paddle.set/get_flags instead of ENV

ac349b7

DrRyanHuang requested review from SigureMo, chang-wenbin, gongshaotian and zyfncg November 6, 2025 02:14

add if use_cudagraph

8d64fe6

SigureMo reviewed Nov 6, 2025

View reviewed changes

lead to

e8fd411

SigureMo previously approved these changes Nov 6, 2025

View reviewed changes

gongshaotian reviewed Nov 6, 2025

View reviewed changes

fastdeploy/model_executor/forward_meta.py Show resolved Hide resolved

gongshaotian previously approved these changes Nov 6, 2025

View reviewed changes

DrRyanHuang added 4 commits November 6, 2025 21:41

Merge branch 'develop' into deepseek_new

8422a36

Merge branch 'PaddlePaddle:develop' into deepseek_new

79b03c7

Merge branch 'develop' into deepseek_new

c8fa3f1

Merge branch 'develop' into deepseek_new

4aa507b

DrRyanHuang dismissed gongshaotian’s stale review via 4aa507b November 11, 2025 11:55

fix conflict && SOT ✅

8ae652d

Merge branch 'develop' into deepseek_new

7ca7efa

DrRyanHuang dismissed SigureMo’s stale review via 7ca7efa November 21, 2025 13:01

[Graph Optimization] Support deepseekV3 SOT Dy2St && CUDAGraph #4785

Are you sure you want to change the base?

[Graph Optimization] Support deepseekV3 SOT Dy2St && CUDAGraph #4785

Conversation

DrRyanHuang commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrRyanHuang Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrRyanHuang Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrRyanHuang Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrRyanHuang Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrRyanHuang Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DrRyanHuang commented Nov 3, 2025 •

edited

Loading

DrRyanHuang Nov 4, 2025 •

edited

Loading

DrRyanHuang Nov 5, 2025 •

edited

Loading

DrRyanHuang Nov 5, 2025 •

edited

Loading

DrRyanHuang Nov 6, 2025 •

edited

Loading

DrRyanHuang Nov 6, 2025 •

edited

Loading