-
Notifications
You must be signed in to change notification settings - Fork 25
[WIP] support podman on nydus #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: upstream_2.46.0
Are you sure you want to change the base?
Conversation
Reviewer's GuideIntroduces conditional support for Podman with Nydus by detecting the RAY_PODMAN_USE_NYDUS flag and switching to a rootfs-based workflow: pull commands mount auxiliary scripts and invoke Python via a mounted entrypoint, while container run commands dynamically assemble mounts for devices, libraries, and environment variables. Sequence diagram for Podman container creation with Nydus supportsequenceDiagram
participant Caller
participant ImageManager
participant Podman
participant HostOS
Caller->>ImageManager: _create_impl(image_uri)
ImageManager->>HostOS: Check RAY_PODMAN_USE_NYDUS env
alt Nydus enabled
ImageManager->>HostOS: Get python_executable & get_worker_script
ImageManager->>Podman: podman run --rootfs image_uri:O /tmp/run_python.sh /tmp/get_worker_path.py
Podman->>ImageManager: Return worker_path
ImageManager->>HostOS: Parse worker_path output
else Nydus disabled
ImageManager->>Podman: podman run image_uri python -c "import ..."
Podman->>ImageManager: Return worker_path
end
ImageManager->>Caller: Return worker_path
Entity relationship diagram for environment variable propagation with NyduserDiagram
HOST_ENV_VARS {
string RAY_*
string CLUSTER_NAME
string POD_IP
string NODE_IP
string OTHER
}
CONTAINER_ENV_VARS {
string RAY_*
string CLUSTER_NAME
string POD_IP
string NODE_IP
string OTHER
}
HOST_ENV_VARS ||--|{ CONTAINER_ENV_VARS : "propagates"
CONTEXT_ENV_VARS {
string env_vars
}
CONTEXT_ENV_VARS ||--|{ CONTAINER_ENV_VARS : "merges"
CONTAINER_ENV_VARS {
string nydus_env_keys
}
Class diagram for updated image_uri.py container command assemblyclassDiagram
class ImageManager {
+_create_impl(image_uri, logger)
+_modify_context_impl(context, image_uri, logger)
}
class Context {
+env_vars
}
ImageManager -- Context: uses
class PodmanCommandBuilder {
+assemble_run_command()
+assemble_mounts()
}
ImageManager -- PodmanCommandBuilder: uses
PodmanCommandBuilder : +mount_commands
PodmanCommandBuilder : +env_vars
PodmanCommandBuilder : +run_options
PodmanCommandBuilder : +podman_use_nydus
PodmanCommandBuilder : +python_executable
PodmanCommandBuilder : +image_uri
PodmanCommandBuilder : +dynamic device/library discovery
PodmanCommandBuilder : +conditional sudo
PodmanCommandBuilder : +rootfs workflow
PodmanCommandBuilder : +entrypoint selection
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @xsuler, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the foundational support for integrating Nydus with Podman within Ray's runtime environment. The primary goal is to accelerate container image loading and execution, particularly for workloads that leverage GPUs. The changes involve adapting Podman commands for image handling and container setup, with a strong focus on ensuring that all necessary GPU-related resources and environment configurations are correctly propagated into the Nydus-enabled containers. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `python/ray/_private/runtime_env/image_uri.py:307-308` </location>
<code_context>
+ "/usr/lib/libnvidia-*.so",
+ "/usr/lib/libcuda.so",
+ "/usr/lib/libvdpau_nvidia.so",
+ "/usr/lib/libnvcuvid.so"
+ "/usr/lib/libnvidia-*.so.*",
+ "/usr/lib/libcuda.so.*",
+ "/usr/lib/libvdpau_nvidia.so.*",
</code_context>
<issue_to_address>
**issue (bug_risk):** Missing comma between string literals in nvidia_lib_patterns list.
This missing comma will result in a syntax error.
</issue_to_address>
### Comment 2
<location> `python/ray/_private/runtime_env/image_uri.py:358-359` </location>
<code_context>
+ container_command.append("PATH=/root/.tnvm/versions/alinode/v5.20.3/bin:/opt/conda/bin:/opt/conda/bin:/opt/conda/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/odpscmd_public/bin:/opt/taobao/java/bin:/usr/local/cuda/bin")
+ container_command.append("--env")
+ container_command.append("LD_LIBRARY_PATH=/usr/lib64:/root/workdir/astra-build/Asystem-HybridEngine/astra_cache/build/lib:/opt/conda/lib/python3.10/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/nccl/lib:/usr/local/nccl2/lib:/usr/lib64/libibverbs:/usr/lib64/librdmacm")
+ container_command.append("--ulimit")
+ container_command.append("host")
+ container_command.append("--pids-limit")
+ container_command.append("0")
</code_context>
<issue_to_address>
**question (bug_risk):** The use of '--ulimit host' may not be supported by podman.
If using podman, verify that this flag does not cause errors, or use a podman-compatible method for setting ulimits.
</issue_to_address>
### Comment 3
<location> `python/ray/_private/runtime_env/image_uri.py:360-361` </location>
<code_context>
+ container_command.append("LD_LIBRARY_PATH=/usr/lib64:/root/workdir/astra-build/Asystem-HybridEngine/astra_cache/build/lib:/opt/conda/lib/python3.10/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/nccl/lib:/usr/local/nccl2/lib:/usr/lib64/libibverbs:/usr/lib64/librdmacm")
+ container_command.append("--ulimit")
+ container_command.append("host")
+ container_command.append("--pids-limit")
+ container_command.append("0")
+
+ container_command.append("--entrypoint")
</code_context>
<issue_to_address>
**question (bug_risk):** Setting --pids-limit to 0 disables the PID limit; verify this is safe.
Please ensure that removing the PID limit is necessary and won't risk system stability.
</issue_to_address>
### Comment 4
<location> `python/ray/_private/runtime_env/image_uri.py:32` </location>
<code_context>
async def _create_impl(image_uri: str, logger: logging.Logger):
# Pull image if it doesn't exist
# Also get path to `default_worker.py` inside the image.
import os
use_nydus = False
if os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true":
use_nydus = True
python_executable = os.getenv("PYTHON_EXECUTABLE_PATH", "/tmp/run_python.sh")
dir = os.path.dirname(os.path.abspath(__file__))
get_worker_script = os.path.join(dir, "get_worker_path.py")
pull_image_cmd = [
"podman",
"run",
"--rm",
"-v",
f"{python_executable}:/tmp/run_python.sh",
"-v",
f"{get_worker_script}:/tmp/get_worker_path.py",
"--rootfs",
image_uri + ":O",
"/tmp/run_python.sh",
"/tmp/get_worker_path.py",
]
if os.geteuid() != 0:
pull_image_cmd = ["sudo", "-E"] + pull_image_cmd
else:
pull_image_cmd = [
"podman",
"run",
"--rm",
image_uri,
"python",
"-c",
(
"import ray._private.workers.default_worker as default_worker; "
"print(default_worker.__file__)"
),
]
logger.info("Pulling image %s", image_uri)
worker_path = await check_output_cmd(pull_image_cmd, logger=logger)
if use_nydus:
lines = worker_path.strip().split("\n")
for line in lines:
if not line.startswith("time="):
worker_path = line
break
return worker_path.strip()
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Use f-string instead of string concatenation ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
- Don't assign to builtin variable `dir` ([`avoid-builtin-shadow`](https://docs.sourcery.ai/Reference/Default-Rules/comments/avoid-builtin-shadow/))
```suggestion
f"{image_uri}:O",
```
<br/><details><summary>Explanation</summary>
Python has a number of `builtin` variables: functions and constants that
form a part of the language, such as `list`, `getattr`, and `type`
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:
```python
list = [1, 2, 3]
```
However, this is considered poor practice.
- It will confuse other developers.
- It will confuse syntax highlighters and linters.
- It means you can no longer use that builtin for its original purpose.
How can you solve this?
Rename the variable something more specific, such as `integers`.
In a pinch, `my_list` and similar names are colloquially-recognized
placeholders.</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| "/usr/lib/libnvcuvid.so" | ||
| "/usr/lib/libnvidia-*.so.*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Missing comma between string literals in nvidia_lib_patterns list.
This missing comma will result in a syntax error.
| container_command.append("--ulimit") | ||
| container_command.append("host") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (bug_risk): The use of '--ulimit host' may not be supported by podman.
If using podman, verify that this flag does not cause errors, or use a podman-compatible method for setting ulimits.
| container_command.append("--pids-limit") | ||
| container_command.append("0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (bug_risk): Setting --pids-limit to 0 disables the PID limit; verify this is safe.
Please ensure that removing the PID limit is necessary and won't risk system stability.
| "-v", | ||
| f"{get_worker_script}:/tmp/get_worker_path.py", | ||
| "--rootfs", | ||
| image_uri + ":O", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Use f-string instead of string concatenation (
use-fstring-for-concatenation) - Don't assign to builtin variable
dir(avoid-builtin-shadow)
| image_uri + ":O", | |
| f"{image_uri}:O", |
Explanation
Python has a number of
builtin variables: functions and constants thatform a part of the language, such as
list, getattr, and type(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:
list = [1, 2, 3]However, this is considered poor practice.
- It will confuse other developers.
- It will confuse syntax highlighters and linters.
- It means you can no longer use that builtin for its original purpose.
How can you solve this?
Rename the variable something more specific, such as integers.
In a pinch, my_list and similar names are colloquially-recognized
placeholders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces support for Podman with Nydus snapshotter, enhancing Ray's runtime environment capabilities. The changes involve conditional logic for Nydus, dynamic mounting of host resources (especially NVIDIA-related), and specific environment variable propagation. While the functionality seems to align with the PR's goal, there are several areas where the code could be improved for clarity, maintainability, and robustness, particularly around the extensive use of hardcoded paths and environment variables, and the dynamic discovery logic.
| ] | ||
| import os | ||
| use_nydus = False | ||
| if os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using os.getenv with a default string and then converting to lowercase and comparing to a string literal is a common pattern, but it can be made more robust by defining a helper function or a constant for the environment variable name and its expected value. This reduces magic strings and improves readability.
NYDUS_ENV_VAR = "RAY_PODMAN_USE_NYDUS"
use_nydus = os.getenv(NYDUS_ENV_VAR, "false").lower() == "true"| if use_nydus: | ||
| lines = worker_path.strip().split("\n") | ||
| for line in lines: | ||
| if not line.startswith("time="): | ||
| worker_path = line | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic to strip time= lines from worker_path seems specific to the output of the pull_image_cmd when Nydus is enabled. It would be more robust to parse the output more explicitly, perhaps by looking for a specific marker or using a regular expression, rather than assuming time= lines are the only non-path output. If the output format changes, this parsing could break.
lines = worker_path.strip().split("\n")
for line in lines:
if not line.startswith("time="):
worker_path = line
break| podman_use_nydus = False | ||
|
|
||
| # Check if nydus is enabled | ||
| if os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, using os.getenv with a default string and then converting to lowercase and comparing to a string literal is a common pattern, but it can be made more robust by defining a helper function or a constant for the environment variable name and its expected value. This reduces magic strings and improves readability.
podman_use_nydus = os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true"| # Discover NVIDIA firmware files using glob | ||
| firmware_patterns = [ | ||
| "/usr/lib/firmware/nvidia/*/gsp_*.bin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Discover Vulkan and EGL config files using glob | ||
| vulkan_config_patterns = [ | ||
| "/etc/vulkan/icd.d/nvidia*.json", | ||
| "/etc/vulkan/implicit_layer.d/nvidia*.json", | ||
| "/usr/share/egl/egl_external_platform.d/*nvidia*.json", | ||
| "/usr/share/glvnd/egl_vendor.d/*nvidia*.json" | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vulkan and EGL config file patterns are hardcoded. These paths can vary, and maintaining this list can be challenging. A more dynamic discovery method or configurable options would be more robust.
vulkan_config_patterns = [
"/etc/vulkan/icd.d/nvidia*.json",
"/etc/vulkan/implicit_layer.d/nvidia*.json",
"/usr/share/egl/egl_external_platform.d/*nvidia*.json",
"/usr/share/glvnd/egl_vendor.d/*nvidia*.json"
]|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
| "/tmp/get_worker_path.py", | ||
| ] | ||
| if os.geteuid() != 0: | ||
| pull_image_cmd = ["sudo", "-E"] + pull_image_cmd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于同一个集群,是否要加sudo -E是否是确定的
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Summary by Sourcery
Enable optional Podman+Nydus workflow for Ray runtime_env by adding environment-controlled snapshotter support, custom rootfs handling, and dynamic host resource mounting
New Features:
Enhancements: