Collaborate on Reusable Kubernetes Inference Orchestration #4761

rootfs · 2025-03-25T12:44:19Z

rootfs
Mar 25, 2025

Hey Community!

I am exploring collaboration with different inference engines to create a reusable orchestration flows on Kubernetes. I have started a proposal at vLLM Production Stack.

The goal is to create a generic API so that different inference engines (vLLM or SGLang) can be deployed on Kubernetes for different performance, SLA, and resource usage goals.

Such API should support the following use cases:

Creating a basic inference engine on Kubernetes, e.g. through deploying a Kubernetes deployment.
Exposing the KV cache configs (e.g. speculative decoding, P/D disagg) on the API to support SLA and performance goals.
Supporting routing algorithms such as GPU metrics aware, weighted round robin, prefix hash aware, session aware to multiple inference engines
Other LLM gateway features such as semantic cache, prompt guard, etc.

Currently, there are quite a few efforts towards to this goal. However, they lacks the reusable support for the above use cases.

I am looking forward to hearing from the community and starting a productive collaboration to explore this direction.

Cheers!

Huamin Chen, Red Hat

rootfs · 2025-03-25T17:12:12Z

rootfs
Mar 25, 2025
Author

@zhaochenyang20 WDYT?

0 replies

zhaochenyang20 · 2025-03-25T20:28:02Z

zhaochenyang20
Mar 25, 2025
Collaborator

I've asked our team. thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collaborate on Reusable Kubernetes Inference Orchestration #4761

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Collaborate on Reusable Kubernetes Inference Orchestration #4761

Uh oh!

rootfs Mar 25, 2025

Replies: 2 comments

Uh oh!

rootfs Mar 25, 2025 Author

Uh oh!

zhaochenyang20 Mar 25, 2025 Collaborator

rootfs
Mar 25, 2025

rootfs
Mar 25, 2025
Author

zhaochenyang20
Mar 25, 2025
Collaborator