Skip to content

test_managed_jobs_ha_kill_starting failure on master #7814

@zpoint

Description

@zpoint

https://buildkite.com/skypilot-1/smoke-tests/builds/5052#019a3a85-5875-43aa-a3b2-f43f4196beb0

After limit --cpu=4 in #7794 We found ValueError during test_managed_jobs_ha_kill_starting

sky-jobs-controller-3bde5781-3bde5781-deployment-689768d45j6g4z   1/1     Running   0          3m4s
(base) buildkite@54:~/.buildkite-agent/builds/kind/54-226-105-94-buildkite-kind5-1/skypilot-1/smoke-tests$ kubectl logs -n test-namespace sky-jobs-controller-3bde5781-3bde5781-deployment-689768d45j6g4z --tail=100
Defaulted container "ray-node" out of: ray-node, init-copy-home (init)
++ id -u
+ '[' 1000 -eq 0 ']'
+ true
+ STEPS=("apt-ssh-setup" "runtime-setup" "env-setup")
+ mkdir -p /home/sky/.sky/.controller_recovery_task_run
+ '[' -f /home/sky/k8s_container_ready ']'
+ SKYPILOT_HA_RECOVERY_LOG=/tmp/ha_recovery.log
++ date
+ echo 'Starting HA recovery at Fri Oct 31 14:14:59 UTC 2025'
+ start_time=0
+ retry_count=0
+++ '[' -s /home/sky/.sky/python_path ']'
+++ cat /home/sky/.sky/python_path
++ env -u PYTHONPATH /home/sky/skypilot-runtime/bin/python -c 'from sky.provision import instance_setup; print(instance_setup.RAY_STATUS_WITH_SKY_RAY_PORT_COMMAND)'
+ '[' 0 -ne 0 ']'
+ '[' 0 -ne 0 ']'
+ GET_RAY_STATUS_CMD='RAY_PORT=$(env -u PYTHONPATH $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c "from sky import sky_logging
with sky_logging.silent(): from sky.skylet import job_lib; print(job_lib.get_ray_port())" 2> /dev/null || echo 6379);env -u PYTHONPATH $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c "from sky.utils import message_utils; print(message_utils.encode_payload({'\''ray_port'\'': $RAY_PORT}))"; RAY_ADDRESS=127.0.0.1:$RAY_PORT env -u PYTHONPATH $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) status'
+ true
+ retry_count=1
+ current_duration=1
+ echo 'Attempt 1 to get Ray status after 1 seconds...'
+ bash --login -c 'RAY_PORT=$(env -u PYTHONPATH $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c "from sky import sky_logging
with sky_logging.silent(): from sky.skylet import job_lib; print(job_lib.get_ray_port())" 2> /dev/null || echo 6379);env -u PYTHONPATH $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c "from sky.utils import message_utils; print(message_utils.encode_payload({'\''ray_port'\'': $RAY_PORT}))"; RAY_ADDRESS=127.0.0.1:$RAY_PORT env -u PYTHONPATH $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) status'
<sky-payload>{"ray_port": 6380}</sky-payload>

No cluster status. It may take a few seconds for the Ray internal services to start up.
+ '[' 0 -eq 0 ']'
+ wait_duration=5
+ echo 'Ray ready after waiting 5 seconds (took 1 attempts)'
+ break
+ chmod +x /home/sky/.sky/.controller_recovery_setup_commands.sh
+ /bin/bash --login -c 'true && export OMP_NUM_THREADS=1 PYTHONWARNINGS='\''ignore'\'' && ~/.sky/.controller_recovery_setup_commands.sh > /tmp/controller_recovery_setup_commands.log 2>&1'
+ '[' 0 -ne 0 ']'
++ date
+ echo '=== Controller setup commands completed for recovery at Fri Oct 31 14:15:46 UTC 2025 ==='
+ touch /home/sky/.sky/.controller_recovery_restarting_signal
+++ '[' -s /home/sky/.sky/python_path ']'
+++ cat /home/sky/.sky/python_path
++ env -u PYTHONPATH /home/sky/skypilot-runtime/bin/python -c 'from sky.jobs import state; jobs, _ = state.get_managed_jobs_with_filters(fields=['\''job_id'\'', '\''schedule_state'\'']); print('\'' '\''.join({str(job['\''job_id'\'']) for job in jobs if job['\''schedule_state'\''] not in [state.ManagedJobScheduleState.DONE, state.ManagedJobScheduleState.WAITING]}) if jobs else None)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/state.py", line 273, in wrapper
    return func(*args, **kwargs)
  File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/state.py", line 1468, in get_managed_jobs_with_filters
    job_dict['status'] = ManagedJobStatus(job_dict['status'])
  File "/opt/conda/lib/python3.10/enum.py", line 385, in __call__
    return cls.__new__(cls, value)
  File "/opt/conda/lib/python3.10/enum.py", line 710, in __new__
    raise ve_exc
ValueError: None is not a valid ManagedJobStatus
+ ALL_IN_PROGRESS_JOBS=
+ '[' '' '!=' None ']'
+ read -ra ALL_IN_PROGRESS_JOBS_SEQ
+ for file in ~/.sky/.controller_recovery_task_run/*
++ basename /home/sky/.sky/.controller_recovery_task_run/sky_job_1
++ sed s/sky_job_//
+ JOB_ID=1
+ '[' '' '!=' None ']'
+ [[ !    =~  1  ]]
+ continue
+ rm /home/sky/.sky/.controller_recovery_restarting_signal
+ duration=48
++ date
+ echo 'HA recovery completed at Fri Oct 31 14:15:47 UTC 2025'
+ echo 'Total recovery time: 48 seconds'
+ touch /home/sky/k8s_container_ready
+ set +x
D 10-31 14:14:40 skypilot_config.py:514]       infra: kubernetes
D 10-31 14:14:40 skypilot_config.py:514]       cpus: 2
D 10-31 14:14:40 skypilot_config.py:514]     high_availability: true
D 10-31 14:14:40 skypilot_config.py:514]
D 10-31 14:14:40 skypilot_config.py:521] Config syntax check passed for path: /home/sky/.sky/managed_jobs/t-jobs-ha-kill-st-d7-6c-f811.config_yaml
/opt/conda/lib/python3.10/runpy.py:126: RuntimeWarning: 'sky.jobs.scheduler' found in sys.modules after import of package 'sky.jobs', but prior to execution of 'sky.jobs.scheduler'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
D 10-31 14:14:41 scheduler.py:299] Storing job 1 file contents in database (DAG bytes=734, original user yaml bytes=259, env bytes=526).
I 10-31 14:14:41 scheduler.py:180] Running controller with command: source ~/skypilot-runtime/bin/activate;/home/sky/skypilot-runtime/bin/python -u -msky.jobs.controller ca2b5967-9cfc-4c89-9ef9-4c2f6ba4c2d2
I 10-31 14:14:41 scheduler.py:265] Started 1 controllers
(base) buildkite@54:~/.buildkite-agent/builds/kind/54-226-105-94-buildkite-kind5-1/skypilot-1/smoke-tests$ kubectl exec -n test-namespace sky-jobs-controller-3bde5781-3bde5781-deployment-689768d45j6g4z -- cat /tmp/ha_recovery.log
Defaulted container "ray-node" out of: ray-node, init-copy-home (init)
Starting HA recovery at Fri Oct 31 14:14:59 UTC 2025
Attempt 1 to get Ray status after 1 seconds...
Ray ready after waiting 5 seconds (took 1 attempts)
=== Controller setup commands completed for recovery at Fri Oct 31 14:15:46 UTC 2025 ===
HA recovery completed at Fri Oct 31 14:15:47 UTC 2025
Total recovery time: 48 seconds
(base) buildkite@54:~/.buildkite-agent/builds/kind/54-226-105-94-buildkite-kind5-1/skypilot-1/smoke-tests$ kubectl exec -n test-namespace sky-jobs-controller-3bde5781-3bde5781-deployment-689768d45j6g4z -- cat /home/sky/sky_logs/jobs_controller/1.log 2>/dev/null || echo "Log file doesn't exist yet"
I 10-31 14:14:42 controller.py:931] Starting job loop for 1
I 10-31 14:14:42 controller.py:932]   log_file=/home/sky/sky_logs/jobs_controller/1.log
I 10-31 14:14:42 controller.py:933]   pool=None
I 10-31 14:14:42 controller.py:934] From controller ca2b5967-9cfc-4c89-9ef9-4c2f6ba4c2d2
I 10-31 14:14:42 controller.py:935]   pid=2869
I 10-31 14:14:42 controller.py:941] Loading 13 environment variables for job 1
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_DEV=0
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_DEBUG=1
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_DISABLE_USAGE_COLLECTION=1
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_MINIMIZE_LOGGING=0
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_SUPPRESS_SENSITIVE_LOG=0
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1
D 10-31 14:14:42 controller.py:947] Set environment variable: BUILDKITE=1
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_ENABLE_GRPC=0
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_ALLOW_ALL_KUBERNETES_CONTEXTS=0
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_USER=buildkite
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_USER_ID=3bde5781
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_USING_REMOTE_API_SERVER=False
D 10-31 14:14:42 controller.py:947] Set environment variable: SKYPILOT_CONFIG=~/.sky/managed_jobs/t-jobs-ha-kill-st-d7-6c-f811.config_yaml
D 10-31 14:14:42 skypilot_config.py:563] Using config path: /home/sky/.sky/managed_jobs/t-jobs-ha-kill-st-d7-6c-f811.config_yaml
D 10-31 14:14:42 skypilot_config.py:514] Config loaded from /home/sky/.sky/managed_jobs/t-jobs-ha-kill-st-d7-6c-f811.config_yaml:
D 10-31 14:14:42 skypilot_config.py:514] azure:
D 10-31 14:14:42 skypilot_config.py:514]   storage_account: buildkitestorage
D 10-31 14:14:42 skypilot_config.py:514]
D 10-31 14:14:42 skypilot_config.py:514] jobs:
D 10-31 14:14:42 skypilot_config.py:514]   controller:
D 10-31 14:14:42 skypilot_config.py:514]     resources:
D 10-31 14:14:42 skypilot_config.py:514]       infra: kubernetes
D 10-31 14:14:42 skypilot_config.py:514]       cpus: 2
D 10-31 14:14:42 skypilot_config.py:514]     high_availability: true
D 10-31 14:14:42 skypilot_config.py:514]
D 10-31 14:14:42 skypilot_config.py:521] Config syntax check passed for path: /home/sky/.sky/managed_jobs/t-jobs-ha-kill-st-d7-6c-f811.config_yaml
I 10-31 14:14:42 controller.py:144] Initializing JobsController for job_id=1
I 10-31 14:14:42 controller.py:149] Loaded DAG: DAG:
I 10-31 14:14:42 controller.py:149] [Task<name=t-jobs-ha-kill-st-d7-6c>(run='conda env list\necho...')
I 10-31 14:14:42 controller.py:149]   resources: Kubernetes(cpus=2+, mem=4+)]
I 10-31 14:14:42 controller.py:700] Starting JobsController run for job 1
I 10-31 14:14:42 controller.py:708] Processing task 0/0: t-jobs-ha-kill-st-d7-6c
I 10-31 14:14:42 controller.py:267] Starting task 0 (t-jobs-ha-kill-st-d7-6c) for job 1
I 10-31 14:14:42 state.py:1961] Launching the spot cluster...
I 10-31 14:14:42 controller.py:340] Submitted managed job 1 (task: 0, name: 't-jobs-ha-kill-st-d7-6c'); SKYPILOT_TASK_ID: sky-managed-2025-10-31-14-14-42-717307_t-jobs-ha-kill-st-d7-6c_1-0
I 10-31 14:14:42 controller.py:344] Started monitoring.
I 10-31 14:14:42 scheduler.py:363] Starting job 1
D 10-31 14:14:42 recovery_strategy.py:388] Cleared env var: SKYPILOT_CONFIG
D 10-31 14:14:42 recovery_strategy.py:388] Cleared env var: SKYPILOT_USER_ID
D 10-31 14:14:42 recovery_strategy.py:388] Cleared env var: SKYPILOT_USER
D 10-31 14:14:42 recovery_strategy.py:388] Cleared env var: SKYPILO

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions