[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. #12275

ykcombat · 2025-10-28T10:01:48Z

Motivation

This PR provides PD-Multiplexing scheduler and context manager. After that, we will enable cuda graph and communicator switching for PD-Multiplexing to get better performance. Subsequent PRs of PD-Multiplexing can be found here.

Modifications

Add mixin class to provide PD-Multiplexing scheduler.
Add PD-Multiplexing context manager.
Disable KV transferring in alt stream for PD-Multiplexing.
Add split prefill information to ScheduleBatch.
Delay importing sgl_kernel.spatial to pdmux initialization.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-28T10:02:13Z

Summary of Changes

Hello @ykcombat, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive PD-Multiplexing system designed to enhance the performance of the SGLang runtime. It integrates a new scheduler mixin, a dedicated event loop for multiplexing, and a context management system that dynamically allocates GPU resources (SMs) between prefill and decode operations using CUDA streams. These changes are foundational for future optimizations like CUDA graph and communicator switching, aiming to improve overall throughput and efficiency.

Highlights

PD-Multiplexing Scheduler and Context: Introduced a new PD-Multiplexing scheduler and context manager, laying the groundwork for improved performance through CUDA graph and communicator switching.
Split Prefill Logic: Added specific logic for 'split prefill' operations, including a new split_prefill_batch in the scheduler and a dedicated forward_batch_split_prefill method in the TPWorker.
Dynamic Stream Management: Implemented dynamic adjustment of CUDA stream groups based on decode batch size, allowing for flexible allocation of SMs between prefill and decode operations.
KV Cache Transfer Optimization: Disabled KV transferring in the alternate stream when PD-Multiplexing is enabled, optimizing memory operations for this new mode.
Lazy Import of sgl_kernel.spatial: The sgl_kernel.spatial module, which is crucial for green context stream creation, is now imported lazily during PD-Multiplexing initialization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces PD-Multiplexing, a feature designed to improve performance by running prefill and decode operations concurrently. It adds a new scheduler mixin, a context manager for multiplexing, and modifies the model runner to support multiple attention backends for different streams. The changes are extensive and introduce complex scheduling logic. My review focuses on the correctness and robustness of this new logic. I've identified a potential race condition in the attention backend initialization that could be critical, a bug in the stream group selection logic, and a minor issue in a validation error message. Overall, the feature is a significant addition, but these issues should be addressed to ensure its stability and correctness.

python/sglang/srt/model_executor/model_runner.py

gemini-code-assist · 2025-10-28T10:06:28Z

python/sglang/srt/multiplex/multiplexing_mixin.py

+            if manual_divisions:
+                for i in range(len(manual_divisions)):
+                    _, _, threshold = manual_divisions[i]
+                    if decode_bs >= threshold:
+                        stream_idx = i + 1
+            else:
+                stream_idx = max(
+                    1,
+                    min(
+                        self.real_sm_group_num - 2,
+                        decode_bs
+                        * (self.real_sm_group_num - 2)
+                        // self.pdmux_config.decode_bs_divisor,
+                    ),
+                )


The logic for selecting stream_idx when manual_divisions is used has a couple of issues:

Potential UnboundLocalError: If decode_bs is smaller than all defined thresholds, stream_idx will not be initialized, causing an UnboundLocalError when set_current_stream_idx is called.

Incorrect selection with unsorted thresholds: The current loop iterates through manual_divisions and overwrites stream_idx. If the thresholds are not sorted in ascending order, this can lead to selecting a suboptimal stream group. For example, with thresholds [10, 5] and decode_bs=12, it would incorrectly select the group for threshold 5.

To fix this, I suggest initializing stream_idx and iterating through the divisions in reverse to find the correct group. This ensures a fallback value and correct selection even with unsorted thresholds.

Suggested change

if manual_divisions:

for i in range(len(manual_divisions)):

_, _, threshold = manual_divisions[i]

if decode_bs >= threshold:

stream_idx = i + 1

else:

stream_idx = max(

1,

min(

self.real_sm_group_num - 2,

decode_bs

* (self.real_sm_group_num - 2)

// self.pdmux_config.decode_bs_divisor,

),

)

if manual_divisions:

# Default to the first mixed group (index 1)

stream_idx = 1

# Iterate in reverse to find the largest threshold that is met.

for i in range(len(manual_divisions) - 1, -1, -1):

_, _, threshold = manual_divisions[i]

if decode_bs >= threshold:

stream_idx = i + 1

break

else:

stream_idx = max(

1,

min(

self.real_sm_group_num - 2,

decode_bs

* (self.real_sm_group_num - 2)

// self.pdmux_config.decode_bs_divisor,

),

)

gemini-code-assist · 2025-10-28T10:06:28Z

python/sglang/srt/multiplex/pdmux_context.py

+    if raw["sm_group_num"] < 3:
+        raise ValueError("sm_group_num must greater than 3")


The validation message for sm_group_num is slightly incorrect. The check is sm_group_num < 3, but the error message says it must be "greater than 3". It should be "at least 3" or "greater than or equal to 3" to be consistent with the check.

Suggested change

if raw["sm_group_num"] < 3:

raise ValueError("sm_group_num must greater than 3")

if raw["sm_group_num"] < 3:

raise ValueError("sm_group_num must be at least 3")

ykcombat and others added 14 commits September 26, 2025 07:34

pdmux scheduler

0461d14

disable alt stream

7f3fd05

Merge remote-tracking branch 'upstream/main' into pdmux_scheduler

80b0473

Merge remote-tracking branch 'upstream/main' into pdmux_scheduler

d817656

add split prefill mode to extend mode

4c02e30

move init_pdmux to multiplexing_mixin

8a59846

Merge remote-tracking branch 'upstream/main' into pdmux_scheduler

18fed0b

remove import time

1f1da52

Merge branch 'main' into pdmux_scheduler

6243e52

fix lint

a672bf6

Merge branch 'main' into pdmux_scheduler

4e785b2

lint

03dff85

move import spatial to initialization

634fa65

Merge remote-tracking branch 'upstream/main' into pdmux_scheduler

3da31ad

ykcombat requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy, xiezhq-hermann and zhyncs as code owners October 28, 2025 10:01

ykcombat added the run-ci label Oct 28, 2025

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

ykcombat mentioned this pull request Oct 29, 2025

[Feature] PD-Multiplexing Context and Scheduler. #11592

Merged

4 tasks

Raphael-Hao mentioned this pull request Oct 29, 2025

[Feature] Support PD-Multiplexing #10813

Open

14 tasks

Fridge003 and others added 3 commits October 30, 2025 12:18

Merge branch 'main' into pdmux_scheduler

83a3bec

Merge branch 'main' into pdmux_scheduler

ebb97e1

Merge branch 'main' into pdmux_scheduler

3fa6f41

hnyls2002 merged commit 41efcae into sgl-project:main Oct 31, 2025
59 of 75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. #12275

[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. #12275

Uh oh!

ykcombat commented Oct 28, 2025

Uh oh!

gemini-code-assist bot commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Oct 28, 2025

Uh oh!

gemini-code-assist bot Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if raw["sm_group_num"] < 3:
		raise ValueError("sm_group_num must greater than 3")

[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. #12275

[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. #12275

Uh oh!

Conversation

ykcombat commented Oct 28, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants