Skip to content

Conversation

@ykcombat
Copy link
Collaborator

Motivation

This PR provides PD-Multiplexing scheduler and context manager. After that, we will enable cuda graph and communicator switching for PD-Multiplexing to get better performance. Subsequent PRs of PD-Multiplexing can be found here.

Modifications

  • Add mixin class to provide PD-Multiplexing scheduler.
  • Add PD-Multiplexing context manager.
  • Disable KV transferring in alt stream for PD-Multiplexing.
  • Add split prefill information to ScheduleBatch.
  • Delay importing sgl_kernel.spatial to pdmux initialization.

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ykcombat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive PD-Multiplexing system designed to enhance the performance of the SGLang runtime. It integrates a new scheduler mixin, a dedicated event loop for multiplexing, and a context management system that dynamically allocates GPU resources (SMs) between prefill and decode operations using CUDA streams. These changes are foundational for future optimizations like CUDA graph and communicator switching, aiming to improve overall throughput and efficiency.

Highlights

  • PD-Multiplexing Scheduler and Context: Introduced a new PD-Multiplexing scheduler and context manager, laying the groundwork for improved performance through CUDA graph and communicator switching.
  • Split Prefill Logic: Added specific logic for 'split prefill' operations, including a new split_prefill_batch in the scheduler and a dedicated forward_batch_split_prefill method in the TPWorker.
  • Dynamic Stream Management: Implemented dynamic adjustment of CUDA stream groups based on decode batch size, allowing for flexible allocation of SMs between prefill and decode operations.
  • KV Cache Transfer Optimization: Disabled KV transferring in the alternate stream when PD-Multiplexing is enabled, optimizing memory operations for this new mode.
  • Lazy Import of sgl_kernel.spatial: The sgl_kernel.spatial module, which is crucial for green context stream creation, is now imported lazily during PD-Multiplexing initialization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces PD-Multiplexing, a feature designed to improve performance by running prefill and decode operations concurrently. It adds a new scheduler mixin, a context manager for multiplexing, and modifies the model runner to support multiple attention backends for different streams. The changes are extensive and introduce complex scheduling logic. My review focuses on the correctness and robustness of this new logic. I've identified a potential race condition in the attention backend initialization that could be critical, a bug in the stream group selection logic, and a minor issue in a validation error message. Overall, the feature is a significant addition, but these issues should be addressed to ensure its stability and correctness.

Comment on lines +43 to +57
if manual_divisions:
for i in range(len(manual_divisions)):
_, _, threshold = manual_divisions[i]
if decode_bs >= threshold:
stream_idx = i + 1
else:
stream_idx = max(
1,
min(
self.real_sm_group_num - 2,
decode_bs
* (self.real_sm_group_num - 2)
// self.pdmux_config.decode_bs_divisor,
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for selecting stream_idx when manual_divisions is used has a couple of issues:

  1. Potential UnboundLocalError: If decode_bs is smaller than all defined thresholds, stream_idx will not be initialized, causing an UnboundLocalError when set_current_stream_idx is called.
  2. Incorrect selection with unsorted thresholds: The current loop iterates through manual_divisions and overwrites stream_idx. If the thresholds are not sorted in ascending order, this can lead to selecting a suboptimal stream group. For example, with thresholds [10, 5] and decode_bs=12, it would incorrectly select the group for threshold 5.

To fix this, I suggest initializing stream_idx and iterating through the divisions in reverse to find the correct group. This ensures a fallback value and correct selection even with unsorted thresholds.

Suggested change
if manual_divisions:
for i in range(len(manual_divisions)):
_, _, threshold = manual_divisions[i]
if decode_bs >= threshold:
stream_idx = i + 1
else:
stream_idx = max(
1,
min(
self.real_sm_group_num - 2,
decode_bs
* (self.real_sm_group_num - 2)
// self.pdmux_config.decode_bs_divisor,
),
)
if manual_divisions:
# Default to the first mixed group (index 1)
stream_idx = 1
# Iterate in reverse to find the largest threshold that is met.
for i in range(len(manual_divisions) - 1, -1, -1):
_, _, threshold = manual_divisions[i]
if decode_bs >= threshold:
stream_idx = i + 1
break
else:
stream_idx = max(
1,
min(
self.real_sm_group_num - 2,
decode_bs
* (self.real_sm_group_num - 2)
// self.pdmux_config.decode_bs_divisor,
),
)

Comment on lines +35 to +36
if raw["sm_group_num"] < 3:
raise ValueError("sm_group_num must greater than 3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The validation message for sm_group_num is slightly incorrect. The check is sm_group_num < 3, but the error message says it must be "greater than 3". It should be "at least 3" or "greater than or equal to 3" to be consistent with the check.

Suggested change
if raw["sm_group_num"] < 3:
raise ValueError("sm_group_num must greater than 3")
if raw["sm_group_num"] < 3:
raise ValueError("sm_group_num must be at least 3")

@hnyls2002 hnyls2002 merged commit 41efcae into sgl-project:main Oct 31, 2025
59 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants