-
Notifications
You must be signed in to change notification settings - Fork 7
ETCD-less Dynamo in K8s #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mohammedabdulwahhab
wants to merge
19
commits into
main
Choose a base branch
from
mohammedabdulwahhab-patch-2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 9 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
eb13577
Add document on ETCD alternatives for Dynamo in K8s
mohammedabdulwahhab 23fc0d5
Revise mermaid diagram for Kubernetes service flow
mohammedabdulwahhab 636197a
Revise trade-offs and notes in etcd-k8s.md
mohammedabdulwahhab 4d1349c
Update etcd-k8s.md
mohammedabdulwahhab 2f2d76f
Refine etcd-k8s.md with updated notes and trade-offs
mohammedabdulwahhab 8bf5d6c
Revise DynamoEndpoint management approach
mohammedabdulwahhab 73902af
Update etcd-k8s.md
mohammedabdulwahhab 372e600
Update etcd-k8s.md
mohammedabdulwahhab 074687b
Update title for etcd-k8s documentation
mohammedabdulwahhab 5c79080
Refine EndpointSlice discovery documentation
mohammedabdulwahhab f6ad4f3
Update etcd-k8s.md
mohammedabdulwahhab 4a93fc7
Update etcd-k8s.md
mohammedabdulwahhab 9e1004f
Update etcd-k8s.md
mohammedabdulwahhab 63f391a
Update deps/etcd-k8s.md
mohammedabdulwahhab df9ddfa
Update etcd-k8s.md
mohammedabdulwahhab 388dca1
Update etcd-k8s.md
mohammedabdulwahhab 5aef0b0
Rename document title to Service and Model Discovery Interface
mohammedabdulwahhab bab299a
Update etcd-k8s.md
mohammedabdulwahhab 981844c
Update implementation breakdown in etcd-k8s.md
mohammedabdulwahhab File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,372 @@ | ||
| # ETCD-less Dynamo setup in Kubernetes | ||
|
|
||
| ## Problem | ||
|
|
||
| Customers are hesitant to stand up and maintain dedicated etcd clusters to deploy Dynamo. ETCD, however, is a hard dependency of the Dynamo Runtime. | ||
|
|
||
| It enables the following functionalities within DRT: | ||
|
|
||
| - Heartbeats/leases | ||
| - Component Registry/Service Discovery | ||
| - Cleanup on Shutdown | ||
| - General Key-value storage for various purposes (KVCache Metadata, Model Metadata, etc.) | ||
|
|
||
| ## ETCD and Kubernetes | ||
|
|
||
| Kubernetes stores its own state in ETCD. With a few tradeoffs, we can use native Kubernetes APIs and Resources to achieve the same functionality. Under the hood, these operations are still backed by etcd. This amounts to an alternative implementation of the Dynamo Runtime interface that can run without needing ETCD in K8s environments. | ||
|
|
||
| This document explores a few approaches to achieve this. | ||
|
|
||
| Note: This document is only to explore alternative approaches to eliminating the ETCD dependency. Decoupling from NATS (the transport layer) is a separate concern. | ||
|
|
||
| ## DRT Component Registry Primer | ||
|
|
||
| Here is a primer on the core workflow that ETCD is used for in DRT: | ||
|
|
||
| ```python | ||
| # server.py - Server side workflow | ||
| from dynamo.runtime import DistributedRuntime, dynamo_worker | ||
|
|
||
| @dynamo_worker(static=False) | ||
| async def worker(runtime: DistributedRuntime): | ||
| component = runtime.namespace(ns).component("backend") | ||
| await component.create_service() | ||
| endpoint = component.endpoint("generate") | ||
|
|
||
| await endpoint.serve_endpoint(RequestHandler().generate) | ||
| ``` | ||
|
|
||
| ```python | ||
| # client.py - Client side workflow | ||
| from dynamo.runtime import DistributedRuntime, dynamo_worker | ||
|
|
||
| @dynamo_worker(static=False) | ||
| async def worker(runtime: DistributedRuntime): | ||
| await init(runtime, "dynamo") | ||
|
|
||
| endpoint = runtime.namespace(ns).component("backend").endpoint("generate") | ||
|
|
||
| # Create client and wait for an endpoint to be ready | ||
| client = await endpoint.client() | ||
| await client.wait_for_instances() | ||
|
|
||
| stream = await client.generate("hello world") | ||
| async for char in stream: | ||
| print(char) | ||
| ``` | ||
|
|
||
| ### Server Side: | ||
| 1. Server registers with DRT, receiving a primary lease (unique instance identifier) from ETCD | ||
| 2. Server creates an entry in ETCD advertising its endpoint. The entry is tied to the lease of the instance. | ||
| 3. Server continues to renew the lease. This keeps its associated endpoint entries alive. If a lease fails to renew, associated endpoint entries are automatically removed by ETCD. | ||
|
|
||
| ### Client Side: | ||
| 1. Client asks ETCD for a watch of endpoints. | ||
| 2. Client maintains a cache of such entries, updating its cache as it receives updates from ETCD. | ||
| 3. Client uses the transport to target one of the instances in its cache. | ||
|
|
||
| ```bash | ||
| # Example etcd keys showing endpoints | ||
| $ etcdctl get --prefix instances/ | ||
| instances/dynamo/worker/generate:5fc98e41ac8ce3b | ||
| { | ||
| "endpoint":"generate", | ||
| "namespace":"dynamo", | ||
| "component":"worker", | ||
| "endpoint_id":"generate", | ||
| "instance_id":"worker-abc123", | ||
| "transport":"nats..." | ||
| } | ||
| ``` | ||
|
|
||
| Summary of entities: | ||
| - Leases: Unique identifier for an instance with TTL and auto-renewal | ||
| - Endpoints: Service registration entries associated with a lease | ||
| - Watches: Real-time subscription to key prefix changes | ||
|
|
||
| Summary of operations: | ||
| - Creating/renewing leases: create_lease(), keep_alive background task | ||
| - Creating endpoints: kv_create() with lease attachment | ||
| - Watching endpoints: kv_get_and_watch_prefix() for real-time updates | ||
| - Automatic cleanup: Lease expiration removes associated keys | ||
|
|
||
| Key ETCD APIs used in Dynamo Runtime (from lib/runtime/src/transports/etcd.rs): | ||
|
|
||
| Lease Management: | ||
| - create_lease() - Create lease with TTL | ||
| - revoke_lease() - Revoke lease explicitly | ||
|
|
||
| Key-Value Operations: | ||
| - kv_create() - Create key if it doesn't exist | ||
| - kv_create_or_validate() - Create or validate existing key | ||
| - kv_get_and_watch_prefix() - Get initial values and watch for changes | ||
|
|
||
|
|
||
| ## Approach 1: Lease-Based Endpoint Registry | ||
|
|
||
| We use the kubectl `watch` API and Custom Resources for Lease and DynamoEndpoint to achieve similar functionality. | ||
|
|
||
| ### Server side: | ||
| 1. Server registers with DRT, creating a Lease resource in K8s | ||
| 2. Server creates a DynamoEndpoint CR in K8s with the owner ref set to its Lease resource | ||
| 3. Server renews its Lease, keeping the controller/operator from terminating it (and the associated DynamoEndpoint CR) | ||
|
|
||
| ### Client side: | ||
| 1. Client asks K8s for a watch of DynamoEndpoints (using the kubectl `watch` API). | ||
| 2. Client maintains a cache of such entries, updating its cache as it receives updates from K8s | ||
| 3. Client uses the transport to target one of the instances in its cache. | ||
|
|
||
| ### Controller: | ||
| 1. Dynamo controller is responsible for deleting leases that have not been renewed. | ||
| 2. When a lease is deleted, all associated DynamoEndpoint CRs are also deleted. | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| participant Server | ||
| participant K8s API | ||
| participant Controller | ||
| participant Client | ||
|
|
||
| Note over Server: Dynamo Instance Startup | ||
| Server->>K8s API: Create Lease resource | ||
| Server->>K8s API: Create DynamoEndpoint CR (ownerRef=Lease) | ||
| Server->>K8s API: Renew Lease periodically | ||
|
|
||
| Note over Controller: Lease Management | ||
| Controller->>K8s API: Watch Lease resources | ||
| Controller->>Controller: Check lease expiry | ||
| Controller->>K8s API: Delete expired Leases | ||
| K8s API->>K8s API: Auto-delete DynamoEndpoints (ownerRef) | ||
|
|
||
| Note over Client: Service Discovery | ||
| Client->>K8s API: Watch DynamoEndpoint resources | ||
| K8s API-->>Client: Stream endpoint changes | ||
| Client->>Client: Update instance cache | ||
| Client->>Server: Route requests via NATS | ||
| ``` | ||
|
|
||
| ```yaml | ||
| # Example Lease resource | ||
| apiVersion: coordination.k8s.io/v1 | ||
| kind: Lease | ||
mohammedabdulwahhab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| metadata: | ||
| name: dynamo-lease | ||
| namespace: dynamo | ||
| ownerReferences: | ||
| - apiVersion: pods/v1 | ||
| kind: Pod | ||
| name: dynamo-pod-abc123 | ||
| spec: | ||
| holderIdentity: dynamo-pod-abc123 | ||
| leaseDurationSeconds: 30 | ||
| renewDeadlineSeconds: 20 | ||
| ``` | ||
|
|
||
| ```yaml | ||
| # Example DynamoEndpoint CR | ||
| apiVersion: dynamo.nvidia.com/v1alpha1 | ||
| kind: DynamoEndpoint | ||
mohammedabdulwahhab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| metadata: | ||
| name: dynamo-endpoint | ||
| namespace: dynamo | ||
| ownerReferences: | ||
| - apiVersion: coordination.k8s.io/v1 | ||
| kind: Lease | ||
| name: dynamo-lease | ||
| labels: | ||
| dynamo-namespace: dynamo | ||
| dynamo-component: backend | ||
| dynamo-endpoint: generate | ||
| spec: | ||
| ... | ||
| ``` | ||
|
|
||
| Kubernetes concepts: | ||
| - Lease: Existing Kubernetes resource in the coordination.k8s.io API group | ||
| - Owner references: Metadata establishing parent-child relationships for automatic cleanup | ||
| - kubectl watch API: Real-time subscription to resource changes | ||
| - Custom resources: Extension mechanism for arbitrary data storage. We can introduce a new CRD for DynamoEndpoint and DynamoModel and similar to store state associated with a lese. | ||
|
|
||
| Notes: | ||
| - Requires a controller to delete leases on expiry. This is not something k8s automatically handles for us. | ||
| - prefix-based watching for changes is not supported by the kubectl `watch` api. We can however watch on a set of labels that correspond to the endpoints we are interested in. | ||
| - Unavoidable: overhead of going through kube api as opposed to direct etcd calls. | ||
| - Need to work out atomicity of operations | ||
|
|
||
|
|
||
| ## Approach 2: Controller-managed DynamoEndpoint Resources | ||
|
|
||
| Pods create DynamoEndpoint resources directly, but a controller keeps status in sync with underlying pod readiness status. Instead of using leases, we sync with the readiness status of the pod (as advertised by the probe). | ||
|
|
||
| ### Server side: | ||
| 1. Dynamo pods create DynamoEndpoint CRs directly when they start serving an endpoint | ||
| 2. Pods set ownerReferences to themselves on the DynamoEndpoint resources | ||
| 3. DynamoEndpoint resources are automatically cleaned up when pods terminate | ||
|
|
||
| ### Controller: | ||
| 1. Dynamo controller watches pod lifecycle events and readiness status | ||
| 2. Controller updates the status field of DynamoEndpoint resources based on pod readiness | ||
| 3. Controller maintains instance lists and transport information in DynamoEndpoint status | ||
|
|
||
| ### Client side: | ||
| 1. Client watches DynamoEndpoint resources using kubectl watch API | ||
mohammedabdulwahhab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 2. Client maintains cache of available instances from DynamoEndpoint status | ||
| 3. Client routes requests to healthy instances via NATS transport | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| participant Pod | ||
| participant K8s API | ||
| participant Controller | ||
| participant Client | ||
|
|
||
| Note over Pod: Dynamo Pod Startup | ||
| Pod->>K8s API: Create DynamoEndpoint CR | ||
| Pod->>K8s API: Set ownerReference to pod | ||
|
|
||
| Note over Controller: Status Management | ||
| Controller->>K8s API: Watch Pod lifecycle events | ||
| Controller->>K8s API: Watch Pod readiness changes | ||
| Controller->>K8s API: Update DynamoEndpoint status | ||
| K8s API->>K8s API: Auto-delete DynamoEndpoint (ownerRef) | ||
|
|
||
| Note over Client: Service Discovery | ||
| Client->>K8s API: Watch DynamoEndpoint resources | ||
| K8s API-->>Client: Stream status changes | ||
| Client->>Client: Update instance cache | ||
| Client->>Pod: Route requests via NATS | ||
| ``` | ||
|
|
||
| ```yaml | ||
| # Example DynamoEndpoint resource | ||
| apiVersion: dynamo.nvidia.com/v1alpha1 | ||
| kind: DynamoEndpoint | ||
| metadata: | ||
| name: dynamo-generate-endpoint | ||
| namespace: dynamo | ||
| labels: | ||
| dynamo-namespace: dynamo | ||
| dynamo-component: backend | ||
| dynamo-endpoint: generate | ||
| ownerReferences: | ||
| - apiVersion: v1 | ||
| kind: Pod | ||
| name: dynamo-pod-abc123 | ||
| uid: abc123-def456 | ||
| spec: | ||
| endpoint: generate | ||
| namespace: dynamo | ||
| component: backend | ||
| status: | ||
| ready: true # controller updates this on readiness change events | ||
| ``` | ||
|
|
||
| Kubernetes concepts: | ||
| - Custom Resource Definitions: Define the schema for DynamoEndpoint resources | ||
| - Owner references: Automatic cleanup when pods terminate | ||
|
|
||
| Notes: | ||
| - Controller is in charge of updating the status of the DynamoEndpoint as underlying pod readiness changes. | ||
|
|
||
|
|
||
| ## Approach 3: EndpointSlice based discovery | ||
|
|
||
| Disclaimer: This idea is still WIP. It is similar to Approach 2, but eliminates the need for a controller relying on the Kubernetes Service controller to keep EndpointSlices up to date. | ||
|
|
||
| ### Server side: | ||
| 1. Dynamo operator creates server instances. Pods are labeled with `dynamo-namespace` and `dynamo-component`. | ||
| 2. When a pod wants to serve an endpoint, it performs two actions: | ||
mohammedabdulwahhab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - Creates a Service for that endpoint (if it doesn't exist) with selectors: | ||
| - `dynamo-namespace` | ||
| - `dynamo-component` | ||
| - `dynamo-endpoint-<NAME>: true` | ||
| - Patches its own labels to add `dynamo-endpoint-<NAME>: true` | ||
| 3. Readiness probe status reflects the health of this specific endpoint. | ||
|
|
||
| ### Client side: | ||
| 1. Client watches EndpointSlices associated with the target Kubernetes Service. | ||
| 2. EndpointSlices maintain the current state of pods serving the endpoint and their readiness status. | ||
| 3. Client maintains a cache of available instances, updating as EndpointSlice changes arrive. | ||
| 4. Client routes requests to healthy instances via NATS transport. | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| participant Pod | ||
| participant K8s API | ||
| participant Client | ||
|
|
||
| Note over Pod: Dynamo Pod Lifecycle | ||
| Pod->>K8s API: Create Service (if doesn't exist) | ||
| Pod->>K8s API: Update pod labels for endpoint | ||
| Pod->>Pod: Readiness probe health check | ||
|
|
||
| Note over K8s API: Service Management | ||
| K8s API->>K8s API: Create EndpointSlice for Service | ||
| K8s API->>K8s API: Update EndpointSlice with pod readiness | ||
| K8s API->>K8s API: Remove failed pods from EndpointSlice | ||
|
|
||
| Note over Client: Service Discovery | ||
| Client->>K8s API: Watch EndpointSlice for target Service | ||
| K8s API-->>Client: Stream readiness changes | ||
| Client->>Client: Update instance cache | ||
| Client->>Pod: Route requests via NATS | ||
| ``` | ||
|
|
||
| Kubernetes concepts: | ||
| - Services and EndpointSlices: Services define pod sets, EndpointSlices track pod addresses and readiness | ||
| - Readiness probes: Health checks that determine pod readiness for traffic | ||
|
|
||
| ```yaml | ||
| # Example Service and EndpointSlice | ||
| apiVersion: v1 | ||
| kind: Service | ||
| metadata: | ||
| name: dynamo-generate-service | ||
| namespace: dynamo | ||
| labels: | ||
| dynamo-namespace: dynamo | ||
| dynamo-component: backend | ||
| dynamo-endpoint: generate | ||
| spec: | ||
| selector: | ||
| dynamo-namespace: dynamo | ||
| dynamo-component: backend | ||
| dynamo-endpoint-generate: "true" | ||
| ports: # dummy port since transport isn't actually taking place through this | ||
| - ... | ||
| type: ClusterIP | ||
|
|
||
| --- | ||
| apiVersion: discovery.k8s.io/v1 | ||
| kind: EndpointSlice | ||
| metadata: | ||
| name: dynamo-generate-service-abc12 | ||
| namespace: dynamo | ||
| labels: | ||
| kubernetes.io/service-name: dynamo-generate-service | ||
| addressType: IPv4 | ||
| ports: | ||
| - ... # dummy port since transport isn't actually taking place through this | ||
| endpoints: | ||
| - addresses: | ||
| - "10.0.1.100" | ||
| conditions: | ||
| ready: true | ||
| targetRef: | ||
| apiVersion: pods/v1 | ||
| kind: Pod | ||
| name: dynamo-pod-abc123 | ||
| - addresses: | ||
| - "10.0.1.101" | ||
| conditions: | ||
| ready: false # Pod failed readiness probe | ||
| targetRef: | ||
| apiVersion: pods/v1 | ||
| kind: Pod | ||
| name: dynamo-pod-def456 | ||
| ``` | ||
|
|
||
| Notes: | ||
| - Pro: We don't need a dedicated controller to delete leases on expiry. (No leases) | ||
| - We need to find a better pattern for a pod to influence the services it is part of than mutating its label set. Potentially a controller could be involved. | ||
| - The service is not actually used for transport here. Only to manage the EndpointSlices which are doing book keeping for which pods are backing the endpoint. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Figure out if there is a compelling reason to keep a separate entity for MDC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if each instance has MDC embedded, there are consistency and race conditions to watch for