Skip to content
This repository was archived by the owner on Jun 27, 2025. It is now read-only.
This repository was archived by the owner on Jun 27, 2025. It is now read-only.

Checking batch job status fails #456

@sokil

Description

@sokil

I have batch job that perform some one-time short-running task. Successfull deploument looks like:

2022-06-29T16:00:17Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/deploy: evaluation e9d76b4c-8f4b-68e5-05e3-eee20a82d225 finished successfully job_id=some_nomad_job_name
2022-06-29T16:00:18Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: job has status running job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in pending state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in running state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: all allocations in deployment of job are running job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/deploy: job deployment successful job_id=some_nomad_job_name

Today i'v got error:

2022-07-06T14:57:01Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-07-06T14:57:03Z |INFO| levant/deploy: evaluation ffa905f9-e937-e178-2e1a-d2b3d18ed8a8 finished successfully job_id=some_nomad_job_name
2022-07-06T14:57:03Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/job_status_checker: job has status dead job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/deploy: job deployment failed job_id=some_nomad_job_name

In successful deployment time between "levant/job_status_checker: running job status checker for job" and first status is 0 seconds.
In failed - 4 seconds. During this time my job was successfully finished and has status 'dead' but levant thinks that this task is just dead so it exited with non zero code and fails by CI pipeline.

As i see, levant have some problems with communication to nomad and its tooks to long time to get job status.
Is it possible to disable check of job? because asynchronous checking of short lived tasks may fail unexpectedly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions