Skip to content

Commit aa8be26

Browse files
committed
Updates to documentation
Signed-off-by: Shmuel Kallner <[email protected]>
1 parent 2c32461 commit aa8be26

File tree

3 files changed

+110
-21
lines changed

3 files changed

+110
-21
lines changed

mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ nav:
7373
- InferencePool Rollout: guides/inferencepool-rollout.md
7474
- Metrics and Observability: guides/metrics-and-observability.md
7575
- Configuration Guide:
76-
- Configuring the plugins via configuration YAML file: guides/epp-configuration/config-text.md
76+
- Configuring the EndPoint Picker via configuration YAML file: guides/epp-configuration/config-text.md
7777
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7878
- Migration Guide: guides/ga-migration.md
7979
- Troubleshooting Guide: guides/troubleshooting.md

site-src/guides/epp-configuration/config-text.md

Lines changed: 98 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,14 @@
1-
# Configuring Plugins via YAML
1+
# Configuring via YAML
22

3-
The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how
4-
it is configured. The IGW is primarily configured via a configuration file.
3+
The Inference Gateway (IGW) can be configured via a YAML file.
54

6-
The YAML file can either be specified as a path to a file or in-line as a parameter. The configuration defines the set of
7-
plugins to be instantiated along with their parameters. Each plugin can also be given a name, enabling
8-
the same plugin type to be instantiated multiple times, if needed (such as when configuring multiple scheduling profiles).
5+
At this time the YAML file based configuration allows for:
96

10-
Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling a request.
11-
If no scheduling profile is specified, a default profile, named `default` will be added and will reference all of the
12-
instantiated plugins.
7+
1. The set of the lifecycle hooks (plugins) that are used by the IGW.
8+
2. The configuration of the saturation detector
9+
3. A set of feature gates that are used to enable experimental features.
1310

14-
The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles
15-
will be used for a particular request. A Profile Handler must be specified, unless the configuration only
16-
contains one profile, in which case the `SingleProfileHandler` will be used.
17-
18-
In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which
19-
the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an
20-
instance of `MaxScorePicker` will be added to the SchedulingProfile in question.
11+
The YAML file can either be specified as a path to a file or in-line as a parameter.
2112

2213
***NOTE***: While the configuration text looks like a Kubernetes CRD, it is
2314
**NOT** a Kubernetes CRD. Specifically, the config is not reconciled upon, and is only read on startup.
@@ -33,10 +24,46 @@ plugins:
3324
schedulingProfiles:
3425
- ....
3526
- ....
27+
saturationDetector:
28+
...
29+
featureGates:
30+
...
3631
```
3732
3833
The first two lines of the configuration are constant and must appear as is.
3934
35+
The plugins section defines the set of plugins that will be instantiated and their parameters. This section is described in more detail in the section [Configuring Plugins via text](#configuring-plugins-via-text)
36+
37+
The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling
38+
requests to pods. This section is described in more detail in the section [Configuring Plugins via YAML](#configuring-plugins-via-yaml)
39+
40+
The saturationDetector section configures the saturation detector, which is used to determine if special
41+
action needs to eb taken due to the system being overloaded or saturated. This section is described in more detail in the section [Saturation Detector configuration](#saturation-detector-configuration)
42+
43+
The featureGates sections allows the enablement of experimental features of the IGW. This section is
44+
described in more detail in the section [Feature Gates](#feature-gates)
45+
46+
## Configuring Plugins via YAML
47+
48+
The set of plugins that are used by the IGW is determined by how it is configured. The IGW is
49+
primarily configured via a configuration file.
50+
51+
The configuration defines the set of plugins to be instantiated along with their parameters.
52+
Each plugin can also be given a name, enabling the same plugin type to be instantiated multiple
53+
times, if needed (such as when configuring multiple scheduling profiles).
54+
55+
Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling
56+
a request. If one is not defined, a default one names `default` will be added and will reference all of
57+
the instantiated plugins.
58+
59+
The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles
60+
will be used for a particular request. A Profile Handler must be specified, unless the configuration only
61+
contains one profile, in which case the `SingleProfileHandler` will be used.
62+
63+
In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which
64+
the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an
65+
instance of `MaxScorePicker` will be added to the SchedulingProfile in question.
66+
4067
The plugins section defines the set of plugins that will be instantiated and their parameters.
4168
Each entry in this section has the following form:
4269

@@ -184,7 +211,7 @@ schedulingProfiles:
184211
-pluginRef: max-score-picker
185212
```
186213

187-
## Plugin Configuration
214+
### Plugin Configuration
188215

189216
This section describes how to setup the various plugins that are available with the IGW.
190217

@@ -269,3 +296,57 @@ scored higher (since it's more available to serve new request).
269296

270297
- *Type*: lora-affinity-scorer
271298
- *Parameters*: none
299+
300+
## Saturation Detector configuration
301+
302+
The Saturation Detector is used to determine if the the cluster is overloaded, i.e. saturated. When
303+
the cluster is saturated special actions will be taken depending what has been enabled. At this time, sheddable requests will be dropped.
304+
305+
The Saturation Detector determines that the cluster is saturated by looking at the following metrics provided by the inference servers:
306+
307+
- Backed waiting queue size
308+
- KV cache utilization
309+
- Metrics staleness
310+
311+
The Saturation Detector is configured via the saturationDetector section of the overall configuration.
312+
It has the following form:
313+
314+
```yaml
315+
saturationDetector:
316+
queueDepthThreshold: 8
317+
kvCacheUtilThreshold: 0.75
318+
metricsStalenessThreshold: 150ms
319+
```
320+
321+
The various sub-fields of the saturationDetector section are:
322+
323+
- The `queueDepthThreshold` field which defines the backend waiting queue size above which a
324+
pod is considered to have insufficient capacity for new requests. This field is optional, if
325+
omitted a value of `5` will be used.
326+
- The `kvCacheUtilThreshold` field which defines the KV cache utilization (0.0 to 1.0) above
327+
which a pod is considered to have insufficient capacity. This field is optional, if omitted
328+
a value of `0.8` will be used.
329+
- The `metricsStalenessThreshold` field which defines how old a pod's metrics can be. If a pod's
330+
metrics are older than this, it might be excluded from "good capacity" considerations or treated
331+
as having no capacity for safety. This field is optional, if omitted a value of `200ms` will be used.
332+
333+
## Feature Gates
334+
335+
The Feature Gates section allows for the enabling of experimental features of the IGW. These experimental
336+
features are all disabled unless you explicitly enable them one by one.
337+
338+
The Feature Gates section has the follwoing form:
339+
340+
```yaml
341+
featureGates:
342+
- dataLayer
343+
- flowControl
344+
```
345+
346+
The Feature Gates section is an array of flags, each of which enables one experimental feature.
347+
The available values for these elements are:
348+
349+
- `dataLayer` which, if present, enables the experimental Datalayer APIs.
350+
- `flowControl` which, if present, enables the experimental FlowControl feature.
351+
352+
In all cases if the appropriate element isn't present, that experimental feature will be disabled.

site-src/guides/troubleshooting.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,17 @@ This error indicates that the entire inference pool has exceeded its saturation
2323
* **v1.0.0 and later**: Ensure the `InferenceObjective` you're using has a `priority` greater than or equal to 0. A negative priority can cause requests to be dropped.
2424

2525
* Pool Thresholds: Check the defined pool [thresholds](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/f36111cab0ed5a309d1eafade896d4f37ab623a6/pkg/epp/saturationdetector/config.go#L41) to understand the saturation limits. Currently, we use three main metrics to assess the system's load:
26-
* `DefaultQueueDepthThreshold`: This is the maximum number of requests waiting in the queue for a backend. The default value is 5. If the queue for a model server exceeds this number, the saturation detector may consider the system under pressure. To override this, set the `SD_QUEUE_DEPTH_THRESHOLD` environment variable.
27-
* `DefaultKVCacheUtilThreshold`: This is the maximum utilization of the Key-Value (KV) cache on the model server, expressed as a decimal from 0.0 to 1.0. The default is 0.8, or 80%. The KV cache stores attention keys and values to speed up inference for subsequent tokens. When its utilization exceeds this threshold, it's an indication that the model server is nearing its memory capacity and may be becoming saturated. To override this, set the `SD_KV_CACHE_UTIL_THRESHOLD` environment variable.
28-
* `DefaultMetricsStalenessThreshold`: This defines the maximum age of metrics data before it's considered outdated. The default is 200 milliseconds. The saturation detector needs up-to-date metrics to make accurate decisions about system load. If the metrics are older than this threshold, the detector won't use them. This value is tied to how often metrics are refreshed, and setting it slightly higher ensures that there's always fresh data available. To override this, set the `SD_METRICS_STALENESS_THRESHOLD` environment variable.
26+
* `DefaultQueueDepthThreshold`: This is the maximum number of requests waiting in the queue for a backend. The default value is 5. If the queue for a model server exceeds this number, the saturation detector may consider the system under pressure.
27+
To override this, set the `queueDepthThreshold` field in the `saturationDetector` section of the text based configuration.
28+
See [Saturation Detector configuration](../epp-configuration/config-text#saturation-detector-configuration).
29+
<br>**Note:** The use of the `SD_QUEUE_DEPTH_THRESHOLD` environment variable to override this is now deprecated.
30+
* `DefaultKVCacheUtilThreshold`: This is the maximum utilization of the Key-Value (KV) cache on the model server, expressed as a decimal from 0.0 to 1.0. The default is 0.8, or 80%. The KV cache stores attention keys and values to speed up inference for subsequent tokens. When its utilization exceeds this threshold, it's an indication that the model server is nearing its memory capacity and may be becoming saturated.
31+
To override this, set the `kvCacheUtilThreshold` field in the `saturationDetector` section of the text based configuration.
32+
See [Saturation Detector configuration](../epp-configuration/config-text#saturation-detector-configuration).
33+
<br>**Note:** The use of the `SD_KV_CACHE_UTIL_THRESHOLD` environment variable to override this is now deprecated.
34+
* `DefaultMetricsStalenessThreshold`: This defines the maximum age of metrics data before it's considered outdated. The default is 200 milliseconds. The saturation detector needs up-to-date metrics to make accurate decisions about system load. If the metrics are older than this threshold, the detector won't use them. This value is tied to how often metrics are refreshed, and setting it slightly higher ensures that there's always fresh data available.
35+
To override this, set the `metricsStalenessThreshold` field in the `saturationDetector` section of the text based configuration. See [Saturation Detector configuration](../epp-configuration/config-text#saturation-detector-configuration).
36+
<br>**Note:** The use of the `SD_METRICS_STALENESS_THRESHOLD` environment variable to override this is now deprecated.
2937

3038
## 500 Internal Server Error
3139
### `fault filter abort`

0 commit comments

Comments
 (0)