Skip to content

Conversation

@BryonLewis
Copy link
Collaborator

resolves #1560

Health check that will check nvidia-smi for the Docker-enabled containers (pipelines, training) and check every 15 minutes to sse if nvidia-smi returns a 0 exit code. If the exit code is something else it should mark it as unhealthy and restart the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Health Check for Nvidia Containers

1 participant