-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Try PD-Multiplexing Now
Try PD-Multiplexing on the PoC branch
Benchmark Results (Preview)
The figures above show benchmark results on NVIDIA H200 with CodeLlama-34b-Instruct-hf. on the above PoC branch. We compare ChunkPrefill against PD-Multiplexing (pdmux).
These are preliminary results — we will provide more detailed benchmark data in follow-up updates.
Results of ShareGPT and LooGLE on a single H200 with CodeLlama-34b-hf, with SLO target of ITL to 60 ms
We set the SLO target of ITL to 60 ms. We do not impose SLO constraints on TTFT. Instead, we report the P99 of TTFT to demonstrate the efficiency of PD-Multiplexing. In the figure, solid points indicate that the corresponding baseline meets the ITL SLO requirement, while empty points indicate that the baseline violates it.
Checklist:
- Layer-wise Prefill: [Feature] Layer-wise Prefill #7634
- Add a test for Layer-wise Prefill: [Feature] Add a test for Layer-wise Prefill #8231
- CUDA Green Context Support: [Feature] CUDA Green Context Support #7649
- TP Group Switching for PD-Multiplexing: [Feature]TP Group Switching for PD-Multiplexing #7653
- Server args and documentation of PD-Multiplexing: [Documentation][Configuration] Server args and documentation of PD-Multiplexing. #11427
- Fix split prefill with FA3: [Fix] Fix split prefill with fa3. #11428
- Reuse flashinfer workspace for PD-Multiplexing: [Feature] Reuse flashinfer workspace for PD-Multiplexing. #11540
- PD-Multiplexing scheduler: [Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. #12275
- CUDA Graph for PD-Multiplexing: [Feature] Enable CUDA graph for PD-Multiplexing. #11595
- Add unit tests
- Scheduling tutorial and deployment guide.
- Compatibility with PyTorch > 2.6.
- Supporting MoE models.
- Supporting Speculative models.
Related resources
Arxiv: Optimizing SLO-oriented LLM Serving with PD-Multiplexing