Skip to content

Conversation

@lloyd-brown
Copy link
Collaborator

@lloyd-brown lloyd-brown commented Nov 4, 2025

We saw an error last week #7761 where using rsync to set up the files required by a cluster In Kubernetes would fail with OCI runtime exec failed: exec failed: unable to start container process: exec: "rsync": executable file not found in $PATH: unknown showing that rsync was not available on the kubernetes pod.

After some digging I found that while we do use a file to denote the end of the package installation process and we do check if the file exists before running cluster setup we don't check the existence of this file in internal_file_mounts. So if package installation is sufficiently slow we will run this function and error out with rsync not being found.

Now when we exec into the pod before we start running rsync we will check on the pod if rsync exists. This should help cases where the package installation is slightly slow.

I tested this by

  • Injecting a long sleep into the package installation code and verifying that we wait until
  • Making sure we don’t wait in the case rsync is immediately ready.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@lloyd-brown
Copy link
Collaborator Author

/smoke-test -k test_docker_storage_mounts --kubernetes

@lloyd-brown lloyd-brown force-pushed the lloyd/fix-rsync-not-found branch from 86e64fa to e8ee5c7 Compare November 4, 2025 01:18
@lloyd-brown
Copy link
Collaborator Author

/smoke-test -k test_docker_storage_mounts --kubernetes

1 similar comment
@lloyd-brown
Copy link
Collaborator Author

/smoke-test -k test_docker_storage_mounts --kubernetes

@lloyd-brown lloyd-brown marked this pull request as ready for review November 4, 2025 02:01
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the happy path (rsync already installed), this will introduce an additional roundtrip while uploading files - what is the overhead of that?
If significant, can we instead execute a bash script that does the waiting on the remote cluster and than execs `"$@" once rsync is installed?
Also, doesn't block the PR but we should understand why we are not already waiting for rsync to be installed using the signal files in /tmp

@lloyd-brown
Copy link
Collaborator Author

In the happy path (rsync already installed), this will introduce an additional roundtrip while uploading files - what is the overhead of that? If significant, can we instead execute a bash script that does the waiting on the remote cluster and than execs `"$@" once rsync is installed?

The overhead was pretty significant. So I switched to your suggestion waiting on the remote cluster and checking if rsync is available. For your suggestion I saw no difference between our current code and this code in the rsync time in the happy path case. Both took within 10ms of each other.

Also, doesn't block the PR but we should understand why we are not already waiting for rsync to be installed using the signal files in /tmp

So yes we do currently touch a file after we’ve finished installing packages. In the setup commands in the kubernetes ray yaml file we will wait until this file has been created. So setting up the ray runtime will wait for all of the packages to be installed. But the rsync code comes from internal_file_mounts which does not worry about that file.

@lloyd-brown
Copy link
Collaborator Author

/smoke-test -k test_docker_storage_mounts --kubernetes

@lloyd-brown lloyd-brown requested a review from cg505 November 17, 2025 21:52
Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lloyd-brown for the catch!

@lloyd-brown lloyd-brown merged commit 958b91a into master Nov 19, 2025
21 checks passed
@lloyd-brown lloyd-brown deleted the lloyd/fix-rsync-not-found branch November 19, 2025 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants