Skip to content

Commit 91bd33a

Browse files
committed
Add prefix cache aware benchmarking config
1 parent 95fd944 commit 91bd33a

File tree

4 files changed

+326
-54
lines changed

4 files changed

+326
-54
lines changed
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# High-Cache Configuration
2+
job:
3+
image:
4+
repository: quay.io/inference-perf/inference-perf
5+
tag: "0.2.0" # Defaults to .Chart.AppVersion
6+
serviceAccountName: ""
7+
nodeSelector: {}
8+
# Example resources:
9+
# resources:
10+
# requests:
11+
# cpu: "1"
12+
# memory: "4Gi"
13+
# limits:
14+
# cpu: "2"
15+
# memory: "8Gi"
16+
resources: {}
17+
18+
logLevel: INFO
19+
20+
# A GCS bucket path that points to the dataset file.
21+
# The file will be copied from this path to the local file system
22+
# at /dataset/dataset.json for use during the run.
23+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
24+
gcsPath: ""
25+
26+
# hfToken optionally creates a secret with the specified token.
27+
# Can be set using helm install --set hftoken=<token>
28+
hfToken: ""
29+
30+
config:
31+
load:
32+
type: constant
33+
interval: 15
34+
stages:
35+
- rate: 100
36+
duration: 30
37+
- rate: 200
38+
duration: 30
39+
- rate: 300
40+
duration: 30
41+
- rate: 400
42+
duration: 30
43+
- rate: 500
44+
duration: 30
45+
- rate: 600
46+
duration: 30
47+
- rate: 700
48+
duration: 30
49+
- rate: 800
50+
duration: 30
51+
worker_max_concurrency: 1000
52+
api:
53+
type: completion
54+
streaming: true
55+
server:
56+
type: vllm
57+
model_name: meta-llama/Llama-3.1-8B-Instruct
58+
base_url: http://0.0.0.0:8000
59+
ignore_eos: true
60+
tokenizer:
61+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
62+
data:
63+
type: shared_prefix
64+
shared_prefix:
65+
num_groups: 256
66+
num_prompts_per_group: 16
67+
system_prompt_len: 2048
68+
question_len: 256
69+
output_len: 256
70+
metrics:
71+
type: prometheus
72+
prometheus:
73+
google_managed: true
74+
report:
75+
request_lifecycle:
76+
summary: true
77+
per_stage: true
78+
per_request: true
79+
prometheus:
80+
summary: true
81+
per_stage: true
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Low-Cache Configuration
2+
job:
3+
image:
4+
repository: quay.io/inference-perf/inference-perf
5+
tag: "0.2.0" # Defaults to .Chart.AppVersion
6+
serviceAccountName: ""
7+
nodeSelector: {}
8+
# Example resources:
9+
# resources:
10+
# requests:
11+
# cpu: "1"
12+
# memory: "4Gi"
13+
# limits:
14+
# cpu: "2"
15+
# memory: "8Gi"
16+
resources: {}
17+
18+
logLevel: INFO
19+
20+
# A GCS bucket path that points to the dataset file.
21+
# The file will be copied from this path to the local file system
22+
# at /dataset/dataset.json for use during the run.
23+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
24+
gcsPath: ""
25+
26+
# hfToken optionally creates a secret with the specified token.
27+
# Can be set using helm install --set hftoken=<token>
28+
hfToken: ""
29+
30+
config:
31+
load:
32+
type: constant
33+
interval: 15
34+
stages:
35+
- rate: 100
36+
duration: 30
37+
- rate: 200
38+
duration: 30
39+
- rate: 300
40+
duration: 30
41+
- rate: 400
42+
duration: 30
43+
- rate: 500
44+
duration: 30
45+
- rate: 600
46+
duration: 30
47+
- rate: 700
48+
duration: 30
49+
- rate: 800
50+
duration: 30
51+
worker_max_concurrency: 1000
52+
api:
53+
type: completion
54+
streaming: true
55+
server:
56+
type: vllm
57+
model_name: meta-llama/Llama-3.1-8B-Instruct
58+
base_url: http://0.0.0.0:8000
59+
ignore_eos: true
60+
tokenizer:
61+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
62+
data:
63+
type: shared_prefix
64+
shared_prefix:
65+
num_groups: 256
66+
num_prompts_per_group: 16
67+
system_prompt_len: 256 # Low-cache setting
68+
question_len: 2048 # Low-cache setting
69+
output_len: 256
70+
metrics:
71+
type: prometheus
72+
prometheus:
73+
google_managed: true
74+
report:
75+
request_lifecycle:
76+
summary: true
77+
per_stage: true
78+
per_request: true
79+
prometheus:
80+
summary: true
81+
per_stage: true
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Prefix Cache Aware Benchmarking
2+
This guide shows how to deploy a prefix-cache-aware benchmarking config using inference-perf. Follow [benchmarking guide](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#benchmark) for more information on how to run and validate benchmark results.
3+
4+
## Prerequisites
5+
6+
Before you begin, ensure you have the following:
7+
8+
* **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
9+
* **Kubernetes Cluster**: Access to a Kubernetes cluster
10+
* **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
11+
* **Hugging Face Token Secret**: A Hugging Face token to pull models.
12+
13+
## Shared Prefix Dataset Configuration
14+
15+
The chart uses the `shared_prefix` dataset type, which is designed to test caching efficiency. These parameters are located under config.data.shared_prefix:
16+
17+
* `num_groups`: The number of shared prefix groups.
18+
* `num_prompts_per_group`: The number of prompts within each shared prefix group.
19+
* `system_prompt_len`: The length of the system prompt.
20+
* `question_len`: The length of the question part of the prompt.
21+
* `output_len`: The desired length of the model's output.
22+
23+
The default values for the dataset are defined in the chart, but you can override them using `--set config.data.shared_prefix.<parameter>` flags.
24+
25+
Example:
26+
27+
```bash
28+
helm install my-release ../inference-perf -f high-cache-values.yaml --set config.data.shared_prefix.num_groups=512
29+
```
30+
31+
## Deployment
32+
33+
This chart supports two main configurations, defined in `high-cache-values.yaml` and `low-cache-values.yaml`.
34+
35+
### 1. Check out the repo.
36+
37+
```bash
38+
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
39+
cd gateway-api-inference-extension/benchmarking/prefix-cache-aware
40+
```
41+
42+
### 2. Get the target IP.
43+
44+
The examples below shows how to get the IP of a gateway or a k8s service.
45+
46+
```bash
47+
# Get gateway IP
48+
GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
49+
# Get LoadBalancer k8s service IP
50+
SVC_IP=$(kubectl get service/vllm-llama3-8b-instruct -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
51+
52+
echo $GW_IP
53+
echo $SVC_IP
54+
```
55+
56+
### 3. Deploying the High-Cache Configuration
57+
58+
This configuration is optimized for scenarios where a high cache hit rate is expected. It uses the `high-cache-values.yaml` file.
59+
60+
```bash
61+
cd gateway-api-inference-extension/benchmarking/prefix-cache-aware
62+
export IP='<YOUR_IP>'
63+
export PORT='<YOUR_PORT>'
64+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
65+
helm install high-cache ../inference-perf -f high-cache-values.yaml \
66+
--set hfToken=${HF_TOKEN} \
67+
--set "config.server.base_url=http://${IP}:${PORT}"
68+
```
69+
70+
**Parameters to customize:**
71+
72+
* `high-cache`: A unique name for this deployment.
73+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
74+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
75+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
76+
77+
### 4. Deploying the Low-Cache Configuration
78+
79+
This configuration is designed for scenarios with a lower cache hit rate. It uses the `low-cache-values.yaml` file.
80+
81+
```bash
82+
cd gateway-api-inference-extension/benchmarking/prefix-cache-aware
83+
export IP='<YOUR_IP>'
84+
export PORT='<YOUR_PORT>'
85+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
86+
helm install low-cache ../inference-perf -f low-cache-values.yaml \
87+
--set hfToken=${HF_TOKEN} \
88+
--set "config.server.base_url=http://${IP}:${PORT}"
89+
```
90+
91+
**Parameters to customize:**
92+
93+
* `low-cache`: A unique name for this deployment.
94+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
95+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
96+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
97+
98+
## Clean Up
99+
100+
To uninstall the deployed charts:
101+
102+
```bash
103+
helm uninstall my-high-cache-release
104+
helm uninstall my-low-cache-release
105+
```
106+
107+
## Post Benchmark Analysis
108+
Follow the benchmarking guide instructions to [compare benchmark results](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#analyze-the-results).

0 commit comments

Comments
 (0)