-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. #12275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @ykcombat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive PD-Multiplexing system designed to enhance the performance of the SGLang runtime. It integrates a new scheduler mixin, a dedicated event loop for multiplexing, and a context management system that dynamically allocates GPU resources (SMs) between prefill and decode operations using CUDA streams. These changes are foundational for future optimizations like CUDA graph and communicator switching, aiming to improve overall throughput and efficiency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces PD-Multiplexing, a feature designed to improve performance by running prefill and decode operations concurrently. It adds a new scheduler mixin, a context manager for multiplexing, and modifies the model runner to support multiple attention backends for different streams. The changes are extensive and introduce complex scheduling logic. My review focuses on the correctness and robustness of this new logic. I've identified a potential race condition in the attention backend initialization that could be critical, a bug in the stream group selection logic, and a minor issue in a validation error message. Overall, the feature is a significant addition, but these issues should be addressed to ensure its stability and correctness.
| if manual_divisions: | ||
| for i in range(len(manual_divisions)): | ||
| _, _, threshold = manual_divisions[i] | ||
| if decode_bs >= threshold: | ||
| stream_idx = i + 1 | ||
| else: | ||
| stream_idx = max( | ||
| 1, | ||
| min( | ||
| self.real_sm_group_num - 2, | ||
| decode_bs | ||
| * (self.real_sm_group_num - 2) | ||
| // self.pdmux_config.decode_bs_divisor, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for selecting stream_idx when manual_divisions is used has a couple of issues:
- Potential
UnboundLocalError: Ifdecode_bsis smaller than all defined thresholds,stream_idxwill not be initialized, causing anUnboundLocalErrorwhenset_current_stream_idxis called. - Incorrect selection with unsorted thresholds: The current loop iterates through
manual_divisionsand overwritesstream_idx. If the thresholds are not sorted in ascending order, this can lead to selecting a suboptimal stream group. For example, with thresholds[10, 5]anddecode_bs=12, it would incorrectly select the group for threshold 5.
To fix this, I suggest initializing stream_idx and iterating through the divisions in reverse to find the correct group. This ensures a fallback value and correct selection even with unsorted thresholds.
| if manual_divisions: | |
| for i in range(len(manual_divisions)): | |
| _, _, threshold = manual_divisions[i] | |
| if decode_bs >= threshold: | |
| stream_idx = i + 1 | |
| else: | |
| stream_idx = max( | |
| 1, | |
| min( | |
| self.real_sm_group_num - 2, | |
| decode_bs | |
| * (self.real_sm_group_num - 2) | |
| // self.pdmux_config.decode_bs_divisor, | |
| ), | |
| ) | |
| if manual_divisions: | |
| # Default to the first mixed group (index 1) | |
| stream_idx = 1 | |
| # Iterate in reverse to find the largest threshold that is met. | |
| for i in range(len(manual_divisions) - 1, -1, -1): | |
| _, _, threshold = manual_divisions[i] | |
| if decode_bs >= threshold: | |
| stream_idx = i + 1 | |
| break | |
| else: | |
| stream_idx = max( | |
| 1, | |
| min( | |
| self.real_sm_group_num - 2, | |
| decode_bs | |
| * (self.real_sm_group_num - 2) | |
| // self.pdmux_config.decode_bs_divisor, | |
| ), | |
| ) |
| if raw["sm_group_num"] < 3: | ||
| raise ValueError("sm_group_num must greater than 3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation message for sm_group_num is slightly incorrect. The check is sm_group_num < 3, but the error message says it must be "greater than 3". It should be "at least 3" or "greater than or equal to 3" to be consistent with the check.
| if raw["sm_group_num"] < 3: | |
| raise ValueError("sm_group_num must greater than 3") | |
| if raw["sm_group_num"] < 3: | |
| raise ValueError("sm_group_num must be at least 3") |
Motivation
This PR provides PD-Multiplexing scheduler and context manager. After that, we will enable cuda graph and communicator switching for PD-Multiplexing to get better performance. Subsequent PRs of PD-Multiplexing can be found here.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist