[k8s] Add more observability when pods get preempted or failed during provisioning #7726

kevinmingtarja · 2025-10-24T06:42:34Z

When a pod is preempted (by Kueue for instance) or fails during launch (say a docker image is not compatible to be used for running our skypilot runtime, i.e. busybox), we would hang indefinitely in the _wait_for_pods_to_run() loop, only logging generic "Missing pods" messages, without useful information.

This PR improves a few things:

Check for pod status conditions inside _get_pod_termination_reason, which will indicate if the pod is preempted by Kueue -> This improves the debug message we log to the cluster events table
In query_instances (which gets invoked as part of sky status --refresh), check not only for pods that are in Failed or Unknown phase, but also Terminating (pod which has a deletion_timestamp, but still visible on the k8s API, which may be because they have Finalizers)
Fail early on the launch path, rather than waiting forever, with a hint to show users where to look for info on where exactly the failure happened

Note: For the pods that are preempted and have already gone missing from the k8s API, _get_pod_missing_reason should already take care of getting the events, which should contain the preemption event from Kueue, the thing is it may not show up as the last cluster event, so you might have to dig through the table manually to find the event.

How to reproduce Kueue preemption easily:

In your Kueue config, enable waitForPodsReady, and set waitForPodsReady.timeout to something small, say 20s
Launch a cluster using a big container image, for example: sky launch --image-id docker:nvidia/cuda:13.0.1-runtime-ubuntu24.04 --infra k8s -c cuda (This is around 1.5GB)
Pulling this image should take a while, depending on your k8s cluster's network bandwidth,
So after 20s, Kueue will preempt it, and the pod will go to Terminating state
Provisioning will fail fast, and in the provision logs, we will be able to see:

D 10-27 12:56:59 instance.py:986] run_instances: calling create_namespaced_pod (count=1).
D 10-27 12:57:00 instance.py:1186] run_instances: waiting for 30s for pods to schedule and run: ['cuda-7a2eebbf-head']
D 10-27 12:57:00 instance.py:1199] run_instances: waiting for pods to be running (pulling images): ['cuda-7a2eebbf-head']
W 10-27 12:57:21 instance.py:574] Pod cuda-7a2eebbf-head terminated: Preempted by Kueue: WorkloadEvictedDueToPodsReadyTimeout (Exceeded the PodsReady timeout default/cuda-7a2eebbf).
W 10-27 12:57:21 instance.py:574] Last known state: ContainersNotReady (containers with unready status: [ray-node]).
W 10-27 12:57:21 instance.py:1224] run_instances: Error occurred when creating pods: sky.provision.kubernetes.config.KubernetesError: Pod cuda-7a2eebbf-head has terminated or failed unexpectedly. Run `sky logs --provision cuda` for more details.
D 10-27 12:57:21 provisioner.py:176] Failed to provision 'cuda' on Kubernetes (all zones).
D 10-27 12:57:21 provisioner.py:178] bulk_provision for 'cuda' failed. Stacktrace:
...
...

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
- Two unit tests, one that mocks a pod in Terminating state, one in Failed state
- Smoke test that uses busybox as the container image on k8s, which will reliably fail because busybox doesn't have bash, thus exercising our error handling code path
- TODO: I also want to add a test that exercises a real Kueue preemption scenario (similar to the manual repro above), but we need to enable Kueue in our test infra first
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

kevinmingtarja · 2025-10-27T04:43:36Z

/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes 🟢

Before this PR, this would hang forever and the test will eventually timeout, see https://buildkite.com/skypilot-1/smoke-tests/builds/4841/steps/canvas?jid=019a23ff-5c0c-48b2-91d7-ab866cc53d73.

kevinmingtarja · 2025-10-27T06:48:55Z

/smoke-test --kubernetes

kevinmingtarja · 2025-10-27T18:55:28Z

/quicktest-core --kubernetes

kevinmingtarja · 2025-10-27T21:27:21Z

/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes

… provision_timeout

sky/provision/kubernetes/utils.py

kevinmingtarja · 2025-10-28T01:57:07Z

/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes

SeungjinYang

Thanks @kevinmingtarja for the detailed explanation!

kyuds

lgtm

kevinmingtarja added 5 commits October 23, 2025 23:23

[k8s] Add more observability for preempted pods

e580adc

suggest status refresh daemon logs too

788a998

write to cluster events in the launch path

993e3ae

add link for TerminationTarget to kueue pkg docs

609354b

code cleanup, add smoke test for pod failure detection

b791765

add ut for _get_pod_termination_reason

b37a150

kevinmingtarja changed the title ~~[k8s] Add more observability for preempted pods~~ [k8s] Add more observability when pods get preempted or failed during provisioning Oct 27, 2025

Merge branch 'master' into k8s-missing-pods-info

e62b2a2

kevinmingtarja marked this pull request as ready for review October 27, 2025 21:28

kevinmingtarja requested review from DanielZhangQD, SeungjinYang and kyuds October 27, 2025 21:29

add clarifying comments in relation to _wait_for_pods_to_schedule and…

4391d94

… provision_timeout

DanielZhangQD reviewed Oct 28, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

take cluster_name from caller

fcc8094

SeungjinYang approved these changes Oct 28, 2025

View reviewed changes

kyuds approved these changes Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[k8s] Add more observability when pods get preempted or failed during provisioning #7726

[k8s] Add more observability when pods get preempted or failed during provisioning #7726

kevinmingtarja commented Oct 24, 2025 •

edited

Loading

Uh oh!

kevinmingtarja commented Oct 27, 2025 •

edited

Loading

Uh oh!

kevinmingtarja commented Oct 27, 2025

Uh oh!

kevinmingtarja commented Oct 27, 2025

Uh oh!

kevinmingtarja commented Oct 27, 2025

Uh oh!

Uh oh!

kevinmingtarja commented Oct 28, 2025

Uh oh!

SeungjinYang left a comment

Uh oh!

kyuds left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[k8s] Add more observability when pods get preempted or failed during provisioning #7726

Are you sure you want to change the base?

[k8s] Add more observability when pods get preempted or failed during provisioning #7726

Conversation

kevinmingtarja commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Oct 27, 2025

Uh oh!

kevinmingtarja commented Oct 27, 2025

Uh oh!

kevinmingtarja commented Oct 27, 2025

Uh oh!

Uh oh!

kevinmingtarja commented Oct 28, 2025

Uh oh!

SeungjinYang left a comment

Choose a reason for hiding this comment

Uh oh!

kyuds left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kevinmingtarja commented Oct 24, 2025 •

edited

Loading

kevinmingtarja commented Oct 27, 2025 •

edited

Loading