Skip to content

Conversation

@kevinmingtarja
Copy link
Collaborator

@kevinmingtarja kevinmingtarja commented Oct 24, 2025

When a pod is preempted (by Kueue for instance) or fails during launch (say a docker image is not compatible to be used for running our skypilot runtime, i.e. busybox), we would hang indefinitely in the _wait_for_pods_to_run() loop, only logging generic "Missing pods" messages, without useful information.

This PR improves a few things:

  1. Check for pod status conditions inside _get_pod_termination_reason, which will indicate if the pod is preempted by Kueue -> This improves the debug message we log to the cluster events table
  2. In query_instances (which gets invoked as part of sky status --refresh), check not only for pods that are in Failed or Unknown phase, but also Terminating (pod which has a deletion_timestamp, but still visible on the k8s API, which may be because they have Finalizers)
  3. Fail early on the launch path, rather than waiting forever, with a hint to show users where to look for info on where exactly the failure happened

Note: For the pods that are preempted and have already gone missing from the k8s API, _get_pod_missing_reason should already take care of getting the events, which should contain the preemption event from Kueue, the thing is it may not show up as the last cluster event, so you might have to dig through the table manually to find the event.


How to reproduce Kueue preemption easily:

  1. In your Kueue config, enable waitForPodsReady, and set waitForPodsReady.timeout to something small, say 20s
  2. Launch a cluster using a big container image, for example: sky launch --image-id docker:nvidia/cuda:13.0.1-runtime-ubuntu24.04 --infra k8s -c cuda (This is around 1.5GB)
  3. Pulling this image should take a while, depending on your k8s cluster's network bandwidth,
  4. So after 20s, Kueue will preempt it, and the pod will go to Terminating state
  5. Provisioning will fail fast, and in the provision logs, we will be able to see:
D 10-27 12:56:59 instance.py:986] run_instances: calling create_namespaced_pod (count=1).
D 10-27 12:57:00 instance.py:1186] run_instances: waiting for 30s for pods to schedule and run: ['cuda-7a2eebbf-head']
D 10-27 12:57:00 instance.py:1199] run_instances: waiting for pods to be running (pulling images): ['cuda-7a2eebbf-head']
W 10-27 12:57:21 instance.py:574] Pod cuda-7a2eebbf-head terminated: Preempted by Kueue: WorkloadEvictedDueToPodsReadyTimeout (Exceeded the PodsReady timeout default/cuda-7a2eebbf).
W 10-27 12:57:21 instance.py:574] Last known state: ContainersNotReady (containers with unready status: [ray-node]).
W 10-27 12:57:21 instance.py:1224] run_instances: Error occurred when creating pods: sky.provision.kubernetes.config.KubernetesError: Pod cuda-7a2eebbf-head has terminated or failed unexpectedly. Run `sky logs --provision cuda` for more details.
D 10-27 12:57:21 provisioner.py:176] Failed to provision 'cuda' on Kubernetes (all zones).
D 10-27 12:57:21 provisioner.py:178] bulk_provision for 'cuda' failed. Stacktrace:
...
...

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Two unit tests, one that mocks a pod in Terminating state, one in Failed state
    • Smoke test that uses busybox as the container image on k8s, which will reliably fail because busybox doesn't have bash, thus exercising our error handling code path
    • TODO: I also want to add a test that exercises a real Kueue preemption scenario (similar to the manual repro above), but we need to enable Kueue in our test infra first
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kevinmingtarja
Copy link
Collaborator Author

kevinmingtarja commented Oct 27, 2025

@kevinmingtarja kevinmingtarja changed the title [k8s] Add more observability for preempted pods [k8s] Add more observability when pods get preempted or failed during provisioning Oct 27, 2025
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test --kubernetes

@kevinmingtarja
Copy link
Collaborator Author

/quicktest-core --kubernetes

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes

@kevinmingtarja kevinmingtarja marked this pull request as ready for review October 27, 2025 21:28
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes

Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinmingtarja for the detailed explanation!

Copy link
Collaborator

@kyuds kyuds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants