Add prefix cache aware benchmarking config

rlakhtakia · rlakhtakia · commit 91bd33ae5690 · 2025-11-06T23:19:09.000Z
diff --git a/benchmarking/prefix-cache-aware/high-cache-values.yaml b/benchmarking/prefix-cache-aware/high-cache-values.yaml
@@ -0,0 +1,81 @@
+# High-Cache Configuration
+job:
+  image:
+    repository: quay.io/inference-perf/inference-perf
+    tag: "0.2.0" # Defaults to .Chart.AppVersion
+  serviceAccountName: ""
+  nodeSelector: {}
+  # Example resources:
+  # resources:
+  #   requests:
+  #     cpu: "1"
+  #     memory: "4Gi"
+  #   limits:
+  #     cpu: "2"
+  #     memory: "8Gi"
+  resources: {}
+
+logLevel: INFO
+
+# A GCS bucket path that points to the dataset file.
+# The file will be copied from this path to the local file system
+# at /dataset/dataset.json for use during the run.
+# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
+gcsPath: ""
+
+# hfToken optionally creates a secret with the specified token.
+# Can be set using helm install --set hftoken=<token>
+hfToken: ""
+
+config:
+  load:
+    type: constant
+    interval: 15
+    stages:
+    - rate: 100
+      duration: 30
+    - rate: 200
+      duration: 30
+    - rate: 300
+      duration: 30
+    - rate: 400
+      duration: 30
+    - rate: 500
+      duration: 30
+    - rate: 600
+      duration: 30
+    - rate: 700
+      duration: 30
+    - rate: 800
+      duration: 30
+    worker_max_concurrency: 1000
+  api:
+    type: completion
+    streaming: true
+  server:
+    type: vllm
+    model_name: meta-llama/Llama-3.1-8B-Instruct
+    base_url: http://0.0.0.0:8000
+    ignore_eos: true
+  tokenizer:
+    pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
+  data:
+    type: shared_prefix
+    shared_prefix:
+      num_groups: 256
+      num_prompts_per_group: 16
+      system_prompt_len: 2048
+      question_len: 256
+      output_len: 256
+  metrics:
+    type: prometheus
+    prometheus:
+      google_managed: true
+  report:
+    request_lifecycle:
+      summary: true
+      per_stage: true
+      per_request: true
+    prometheus:
+      summary: true
+      per_stage: true
diff --git a/benchmarking/prefix-cache-aware/low-cache-values.yaml b/benchmarking/prefix-cache-aware/low-cache-values.yaml
@@ -0,0 +1,81 @@
+# Low-Cache Configuration
+job:
+  image:
+    repository: quay.io/inference-perf/inference-perf
+    tag: "0.2.0" # Defaults to .Chart.AppVersion
+  serviceAccountName: ""
+  nodeSelector: {}
+  # Example resources:
+  # resources:
+  #   requests:
+  #     cpu: "1"
+  #     memory: "4Gi"
+  #   limits:
+  #     cpu: "2"
+  #     memory: "8Gi"
+  resources: {}
+
+logLevel: INFO
+
+# A GCS bucket path that points to the dataset file.
+# The file will be copied from this path to the local file system
+# at /dataset/dataset.json for use during the run.
+# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
+gcsPath: ""
+
+# hfToken optionally creates a secret with the specified token.
+# Can be set using helm install --set hftoken=<token>
+hfToken: ""
+
+config:
+  load:
+    type: constant
+    interval: 15
+    stages:
+    - rate: 100
+      duration: 30
+    - rate: 200
+      duration: 30
+    - rate: 300
+      duration: 30
+    - rate: 400
+      duration: 30
+    - rate: 500
+      duration: 30
+    - rate: 600
+      duration: 30
+    - rate: 700
+      duration: 30
+    - rate: 800
+      duration: 30
+    worker_max_concurrency: 1000
+  api:
+    type: completion
+    streaming: true
+  server:
+    type: vllm
+    model_name: meta-llama/Llama-3.1-8B-Instruct
+    base_url: http://0.0.0.0:8000
+    ignore_eos: true
+  tokenizer:
+    pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
+  data:
+    type: shared_prefix
+    shared_prefix:
+      num_groups: 256
+      num_prompts_per_group: 16
+      system_prompt_len: 256      # Low-cache setting
+      question_len: 2048      # Low-cache setting
+      output_len: 256
+  metrics:
+    type: prometheus
+    prometheus:
+      google_managed: true
+  report:
+    request_lifecycle:
+      summary: true
+      per_stage: true
+      per_request: true
+    prometheus:
+      summary: true
+      per_stage: true
diff --git a/site-src/performance/benchmark/advanced-configs/prefix-cache-aware b/site-src/performance/benchmark/advanced-configs/prefix-cache-aware
@@ -0,0 +1,108 @@
+# Prefix Cache Aware Benchmarking
+This guide shows how to deploy a prefix-cache-aware benchmarking config using inference-perf. Follow [benchmarking guide](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#benchmark) for more information on how to run and validate benchmark results.
+
+## Prerequisites
+
+Before you begin, ensure you have the following:
+
+*   **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
+*   **Kubernetes Cluster**: Access to a Kubernetes cluster
+*   **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
+*   **Hugging Face Token Secret**: A Hugging Face token to pull models. 
+
+## Shared Prefix Dataset Configuration
+
+The chart uses the `shared_prefix` dataset type, which is designed to test caching efficiency. These parameters are located under config.data.shared_prefix:
+
+*   `num_groups`: The number of shared prefix groups.
+*   `num_prompts_per_group`: The number of prompts within each shared prefix group.
+*   `system_prompt_len`: The length of the system prompt.
+*   `question_len`: The length of the question part of the prompt.
+*   `output_len`: The desired length of the model's output.
+
+The default values for the dataset are defined in the chart, but you can override them using `--set config.data.shared_prefix.<parameter>` flags. 
+
+Example:
+
+```bash
+helm install my-release ../inference-perf -f high-cache-values.yaml --set config.data.shared_prefix.num_groups=512
+```
+
+## Deployment
+
+This chart supports two main configurations, defined in `high-cache-values.yaml` and `low-cache-values.yaml`.
+
+### 1. Check out the repo.
+
+```bash
+git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
+cd gateway-api-inference-extension/benchmarking/prefix-cache-aware
+```
+
+### 2. Get the target IP. 
+
+  The examples below shows how to get the IP of a gateway or a k8s service.
+
+  ```bash
+  # Get gateway IP
+  GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
+  # Get LoadBalancer k8s service IP
+  SVC_IP=$(kubectl get service/vllm-llama3-8b-instruct -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
+
+  echo $GW_IP
+  echo $SVC_IP
+  ```
+
+### 3. Deploying the High-Cache Configuration
+
+This configuration is optimized for scenarios where a high cache hit rate is expected. It uses the `high-cache-values.yaml` file.
+
+```bash
+cd gateway-api-inference-extension/benchmarking/prefix-cache-aware
+export IP='<YOUR_IP>'
+export PORT='<YOUR_PORT>'
+export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
+helm install high-cache ../inference-perf -f high-cache-values.yaml \
+  --set hfToken=${HF_TOKEN} \
+  --set "config.server.base_url=http://${IP}:${PORT}"
+```
+
+**Parameters to customize:**
+
+*   `high-cache`: A unique name for this deployment.
+*   `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
+*   `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
+*   `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
+
+### 4. Deploying the Low-Cache Configuration
+
+This configuration is designed for scenarios with a lower cache hit rate. It uses the `low-cache-values.yaml` file.
+
+```bash
+cd gateway-api-inference-extension/benchmarking/prefix-cache-aware
+export IP='<YOUR_IP>'
+export PORT='<YOUR_PORT>'
+export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
+helm install low-cache ../inference-perf -f low-cache-values.yaml \
+  --set hfToken=${HF_TOKEN} \
+  --set "config.server.base_url=http://${IP}:${PORT}"
+```
+
+**Parameters to customize:**
+
+*   `low-cache`: A unique name for this deployment.
+*   `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
+*   `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
+*   `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
+
+## Clean Up
+
+To uninstall the deployed charts:
+
+```bash
+helm uninstall my-high-cache-release
+helm uninstall my-low-cache-release
+```
+
+## Post Benchmark Analysis
+Follow the benchmarking guide instructions to [compare benchmark results](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/#analyze-the-results).
diff --git a/site-src/performance/benchmark/index.md b/site-src/performance/benchmark/index.md