[Feature]  Support PD-Multiplexing

## Try PD-Multiplexing Now

Try PD-Multiplexing on the [PoC branch](https://github.com/sgl-project/sglang/pull/10692)

## **Benchmark Results (Preview)**  
The figures above show benchmark results on **NVIDIA H200** with **CodeLlama-34b-Instruct-hf**. on the above [PoC branch](https://github.com/sgl-project/sglang/pull/10692). We compare **ChunkPrefill** against **PD-Multiplexing (pdmux)**.  

These are preliminary results — we will provide more detailed benchmark data in follow-up updates.




<img width="10467" height="6385" alt="3-H200" src="https://github.com/user-attachments/assets/2b9a79f6-f750-4b76-ad4f-52a546fa2a63" />

<p align="center"><em>
Results of ShareGPT and LooGLE on a single H200 with CodeLlama-34b-hf, with SLO target of ITL to 60&nbsp;ms
</em></p>

We set the SLO target of ITL to 60 ms. We do not impose SLO constraints on TTFT. Instead, we report the P99 of TTFT to demonstrate the efficiency of PD-Multiplexing. In the figure, solid points indicate that the corresponding baseline meets the ITL SLO requirement, while empty points indicate that the baseline violates it.


## Checklist:

- [x] Layer-wise Prefill: https://github.com/sgl-project/sglang/pull/7634
- [x] Add a test for Layer-wise Prefill: https://github.com/sgl-project/sglang/pull/8231
- [x] CUDA Green Context Support: https://github.com/sgl-project/sglang/pull/7649
- [x] TP Group Switching for PD-Multiplexing: https://github.com/sgl-project/sglang/pull/7653
- [x] Server args and documentation of PD-Multiplexing: https://github.com/sgl-project/sglang/pull/11427
- [x] Fix split prefill with FA3: https://github.com/sgl-project/sglang/pull/11428
- [x] Reuse flashinfer workspace for PD-Multiplexing: https://github.com/sgl-project/sglang/pull/11540
- [x] PD-Multiplexing scheduler: https://github.com/sgl-project/sglang/pull/12275
- [x] CUDA Graph for PD-Multiplexing: https://github.com/sgl-project/sglang/pull/11595
- [ ] Add unit tests
- [ ] Scheduling tutorial and deployment guide. 
- [ ] Compatibility with PyTorch > 2.6.
- [ ] Supporting MoE models.
- [ ] Supporting Speculative models.

@ykcombat @jason-fxz 

### Related resources

[Arxiv: Optimizing SLO-oriented LLM Serving with PD-Multiplexing](https://arxiv.org/abs/2504.14489)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support PD-Multiplexing #10813

Try PD-Multiplexing Now

Benchmark Results (Preview)

Checklist:

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support PD-Multiplexing #10813

Description

Try PD-Multiplexing Now

Benchmark Results (Preview)

Checklist:

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions