-
Couldn't load subscription status.
- Fork 824
[k8s] Add more observability when pods get preempted or failed during provisioning #7726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes 🟢 Before this PR, this would hang forever and the test will eventually timeout, see https://buildkite.com/skypilot-1/smoke-tests/builds/4841/steps/canvas?jid=019a23ff-5c0c-48b2-91d7-ab866cc53d73. |
|
/smoke-test --kubernetes |
|
/quicktest-core --kubernetes |
|
/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes |
… provision_timeout
|
/smoke-test -k test_kubernetes_pod_failure_detection --kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevinmingtarja for the detailed explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
When a pod is preempted (by Kueue for instance) or fails during launch (say a docker image is not compatible to be used for running our skypilot runtime, i.e. busybox), we would hang indefinitely in the
_wait_for_pods_to_run()loop, only logging generic "Missing pods" messages, without useful information.This PR improves a few things:
_get_pod_termination_reason, which will indicate if the pod is preempted by Kueue -> This improves the debug message we log to the cluster events tablequery_instances(which gets invoked as part ofsky status --refresh), check not only for pods that are in Failed or Unknown phase, but also Terminating (pod which has adeletion_timestamp, but still visible on the k8s API, which may be because they have Finalizers)Note: For the pods that are preempted and have already gone missing from the k8s API,
_get_pod_missing_reasonshould already take care of getting the events, which should contain the preemption event from Kueue, the thing is it may not show up as the last cluster event, so you might have to dig through the table manually to find the event.How to reproduce Kueue preemption easily:
waitForPodsReady.timeoutto something small, say 20ssky launch --image-id docker:nvidia/cuda:13.0.1-runtime-ubuntu24.04 --infra k8s -c cuda(This is around 1.5GB)Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)