-
Notifications
You must be signed in to change notification settings - Fork 191
Add prefix cache aware benchmarking config #1822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @rlakhtakia. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1e31046 to
85375d7
Compare
| helm install my-release ../inference-perf -f high-cache-values.yaml --set config.data.shared_prefix.num_groups=512 | ||
| ``` | ||
|
|
||
| ## Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this guide covers the inference-perf config/deployment only. can you document the vllm and inferencepool deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the guide to be under benchmarking as a new section called advanced configs as well reference links to the original guide for setup instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry where is referenced in the original guide? I don't see anywhere that references this new prefix-cache-aware file.
8d4c7ef to
91bd33a
Compare
| ### Storage Parameters | ||
|
|
||
| Note: Currently inference-perf outputs benchmark results to standard output only, and results will be deleted once pod is finished running the job. | ||
| > Note: Currently inference-perf outputs benchmark results to standard output only, and results will be deleted once pod is finished running the job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only true if you do not specify an output path/gcs bucket right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if a gcs or s3 bucket is not specified this is true. updated to reflect this.
|
Looks good! Left a small comment. /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kfswain, rlakhtakia The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/ok-to-test |
| helm install my-release ../inference-perf -f high-cache-values.yaml --set config.data.shared_prefix.num_groups=512 | ||
| ``` | ||
|
|
||
| ## Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry where is referenced in the original guide? I don't see anywhere that references this new prefix-cache-aware file.
30596e3 to
0265f47
Compare
3e790df to
9391b35
Compare
|
/lgtm /hold if there are any nits you want to address. Feel free to unhold. |
all changes have been addressed. /unhold |
|
/lgtm |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Add standard benchmarking config for precise prefix cache scenario (high and low cache)
Does this PR introduce a user-facing change?: