Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/proposals/003-model-server-protocol/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ Note the requirements here are aligned with the
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
effort.

The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
into the reference endpoint picker implementation.

| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang |
| ----- | ---- | ---- | ---- | ---- | ---- |
| ----- | ---- | ------------ | ---- | ---- | ---- |
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
| [Optional] BlockSize | Labeled | The block size in tokens to allocate memory, used by the prefix cache scorer. If this metric is not available, the BlockSize will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| |
| [Optional] NumGPUBlocks| Labeled | The total number of blocks in the HBM KV cache, used by the prefix cache scorer. If this metric is not available, the NumGPUBlocks will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| |


### LoRA Adapter Serving
Expand Down Expand Up @@ -60,4 +60,4 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
Starting from [v0.4.0](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.4.0),
the EPP supports [prefix cache optimized request scheduling](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
To benefit from the optimal prefix aware request scheduling, model servers SHOULD support prefix
cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
67 changes: 39 additions & 28 deletions site-src/guides/epp-configuration/prefix-aware.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,43 +15,54 @@ Like any other plugins, the prefix cache aware plugin can be enabled/disabled vi
The prefix cache plugin exposes the following advanced configuration parameters:

* `blockSize`: The plugin matches prefixes in the unit of blocks. This is the size
of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default
is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
extremely long inputs.
of each block in number of bytes. At runtime, EPP can dynamically fetch this information from the
inference engine metrics, therefore this config is only used when such metric is not available. In
vLLM, the metric name is `vllm:cache_config_info` and the metric label is `block_size`. See the
[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
for more details.

vLLM default block size is 16 tokens. Assume 4 characters per token, the default
is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
extremely long inputs.

* `maxPrefixBlocksToMatch`: The maximum number of blocks to find prefix match. The default is
256 (or 256*64=16384 characters, or roughly 4096 tokens). This is useful to tradeoff prefix match accuracy
for performance.

* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). Below
shows a detailed analysis on how to estimate this.
* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod).
Similar to `blockSize`, EPP can dynamically fetch this from the inference engine metrics endpoints.
In vLLM, the metric name is `vllm:cache_config_info` and the metric label is `num_gpu_blocks`. See the
[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
for more details.

If such metric is not available, you can follow the guide below on how to estimate this.

The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect
scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect
scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**

NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
in tokens, a conversion between character <-> token is needed.
NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
in tokens, a conversion between character <-> token is needed.

Below are the formulas to estimate the EPP prefix indexer size:
Below are the formulas to estimate the EPP prefix indexer size:

```
max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
```
```
max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
```

Let's take an example:
Let's take an example:

* Model: llama3 8B
* Accelerator: Nvidia H100 80GB
* Num replicas: 3
* Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
* Model: llama3 8B
* Accelerator: Nvidia H100 80GB
* Num replicas: 3
* Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))

```
max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
# assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
# each entry is about 358KB, so the memory footrpint is abut 11 MB per server
lru_indexer_capacity_per_server = 500,000*4/64 = 31250
```
```
max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
# assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
# each entry is about 358KB, so the memory footrpint is abut 11 MB per server
lru_indexer_capacity_per_server = 500,000*4/64 = 31250
```