Skip to content

Conversation

@xsuler
Copy link
Collaborator

@xsuler xsuler commented Oct 31, 2025

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Summary by Sourcery

Enable optional Podman+Nydus workflow for Ray runtime_env by adding environment-controlled snapshotter support, custom rootfs handling, and dynamic host resource mounting

New Features:

  • Add support for pulling and running runtime_env images via podman using the Nydus snapshotter based on RAY_PODMAN_USE_NYDUS

Enhancements:

  • Propagate specified host environment variables and dynamically mount host resources (GPU devices, CUDA/NCCL libraries, Vulkan/EGL, RDMA/InfiniBand, system paths) into containers when Nydus is enabled
  • Automatically prepend sudo -E to podman commands for non-root users

@xsuler xsuler self-assigned this Oct 31, 2025
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 31, 2025

Reviewer's Guide

Introduces conditional support for Podman with Nydus by detecting the RAY_PODMAN_USE_NYDUS flag and switching to a rootfs-based workflow: pull commands mount auxiliary scripts and invoke Python via a mounted entrypoint, while container run commands dynamically assemble mounts for devices, libraries, and environment variables.

Sequence diagram for Podman container creation with Nydus support

sequenceDiagram
    participant Caller
    participant ImageManager
    participant Podman
    participant HostOS
    Caller->>ImageManager: _create_impl(image_uri)
    ImageManager->>HostOS: Check RAY_PODMAN_USE_NYDUS env
    alt Nydus enabled
        ImageManager->>HostOS: Get python_executable & get_worker_script
        ImageManager->>Podman: podman run --rootfs image_uri:O /tmp/run_python.sh /tmp/get_worker_path.py
        Podman->>ImageManager: Return worker_path
        ImageManager->>HostOS: Parse worker_path output
    else Nydus disabled
        ImageManager->>Podman: podman run image_uri python -c "import ..."
        Podman->>ImageManager: Return worker_path
    end
    ImageManager->>Caller: Return worker_path
Loading

Entity relationship diagram for environment variable propagation with Nydus

erDiagram
    HOST_ENV_VARS {
        string RAY_*
        string CLUSTER_NAME
        string POD_IP
        string NODE_IP
        string OTHER
    }
    CONTAINER_ENV_VARS {
        string RAY_*
        string CLUSTER_NAME
        string POD_IP
        string NODE_IP
        string OTHER
    }
    HOST_ENV_VARS ||--|{ CONTAINER_ENV_VARS : "propagates"
    CONTEXT_ENV_VARS {
        string env_vars
    }
    CONTEXT_ENV_VARS ||--|{ CONTAINER_ENV_VARS : "merges"
    CONTAINER_ENV_VARS {
        string nydus_env_keys
    }
Loading

Class diagram for updated image_uri.py container command assembly

classDiagram
    class ImageManager {
        +_create_impl(image_uri, logger)
        +_modify_context_impl(context, image_uri, logger)
    }
    class Context {
        +env_vars
    }
    ImageManager -- Context: uses
    class PodmanCommandBuilder {
        +assemble_run_command()
        +assemble_mounts()
    }
    ImageManager -- PodmanCommandBuilder: uses
    PodmanCommandBuilder : +mount_commands
    PodmanCommandBuilder : +env_vars
    PodmanCommandBuilder : +run_options
    PodmanCommandBuilder : +podman_use_nydus
    PodmanCommandBuilder : +python_executable
    PodmanCommandBuilder : +image_uri
    PodmanCommandBuilder : +dynamic device/library discovery
    PodmanCommandBuilder : +conditional sudo
    PodmanCommandBuilder : +rootfs workflow
    PodmanCommandBuilder : +entrypoint selection
Loading

File-Level Changes

Change Details Files
Conditional image pull implementation for Nydus
  • Detect RAY_PODMAN_USE_NYDUS and toggle use_nydus flag
  • Construct alternate podman run command mounting helper scripts and using --rootfs
  • Prepend 'sudo -E' when running as non-root
  • Fallback to original python -c pull command when nydus disabled
python/ray/_private/runtime_env/image_uri.py
Propagation of Nydus-specific environment variables
  • Detect nydus mode in _modify_context_impl
  • Define a whitelist of environment keys to inherit
  • Merge existing context.env_vars afterwards
  • Log the chosen podman nydus flag
python/ray/_private/runtime_env/image_uri.py
Enhanced container command for Nydus-based run
  • Prepend 'sudo -E' for non-root users
  • Mount python_executable script into container
  • Dynamically discover and mount GPU devices, CUDA/NCCL installations, RDMA paths, Vulkan/EGL, shared memory, sockets, system config, firmware, and binaries using glob patterns
  • Append privileged flags, environment variables, ulimits, entrypoint, and --rootfs
python/ray/_private/runtime_env/image_uri.py

Possibly linked issues

  • #unknown: PR implements Podman 4.x/5.x compatibility via Nydus and RAY_PODMAN_USE_NYDUS, directly addressing the issue's requirements.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gemini-code-assist
Copy link

Summary of Changes

Hello @xsuler, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the foundational support for integrating Nydus with Podman within Ray's runtime environment. The primary goal is to accelerate container image loading and execution, particularly for workloads that leverage GPUs. The changes involve adapting Podman commands for image handling and container setup, with a strong focus on ensuring that all necessary GPU-related resources and environment configurations are correctly propagated into the Nydus-enabled containers.

Highlights

  • Nydus Integration: Introduces conditional support for Nydus, a container image acceleration solution, when using Podman for Ray's runtime environment. This feature is activated by setting the RAY_PODMAN_USE_NYDUS environment variable to true.
  • Enhanced GPU/CUDA Support: Significantly enhances GPU support within Nydus-enabled Podman containers by dynamically discovering and mounting a comprehensive list of NVIDIA-related devices, libraries, configurations, and binaries from the host into the container. This includes CUDA, NCCL, Vulkan, EGL, and RDMA components.
  • Custom Podman Commands: Modifies the podman run commands for both image pulling and container execution to leverage Nydus-specific options, such as --rootfs for on-demand image loading, and custom entrypoints. It also sets specific environment variables like PATH and LD_LIBRARY_PATH and applies resource limits (--ulimit host, --pids-limit 0) when Nydus is active.
  • Environment Variable Inheritance: When Nydus is enabled, a predefined list of critical environment variables (many related to GPU, networking, and Ray-specific configurations) are explicitly inherited from the host system into the Podman container to ensure proper operation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and found some issues that need to be addressed.

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `python/ray/_private/runtime_env/image_uri.py:307-308` </location>
<code_context>
+            "/usr/lib/libnvidia-*.so",
+            "/usr/lib/libcuda.so",
+            "/usr/lib/libvdpau_nvidia.so",
+            "/usr/lib/libnvcuvid.so"
+            "/usr/lib/libnvidia-*.so.*",
+            "/usr/lib/libcuda.so.*",
+            "/usr/lib/libvdpau_nvidia.so.*",
</code_context>

<issue_to_address>
**issue (bug_risk):** Missing comma between string literals in nvidia_lib_patterns list.

This missing comma will result in a syntax error.
</issue_to_address>

### Comment 2
<location> `python/ray/_private/runtime_env/image_uri.py:358-359` </location>
<code_context>
+        container_command.append("PATH=/root/.tnvm/versions/alinode/v5.20.3/bin:/opt/conda/bin:/opt/conda/bin:/opt/conda/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/odpscmd_public/bin:/opt/taobao/java/bin:/usr/local/cuda/bin")
+        container_command.append("--env")
+        container_command.append("LD_LIBRARY_PATH=/usr/lib64:/root/workdir/astra-build/Asystem-HybridEngine/astra_cache/build/lib:/opt/conda/lib/python3.10/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/nccl/lib:/usr/local/nccl2/lib:/usr/lib64/libibverbs:/usr/lib64/librdmacm")
+        container_command.append("--ulimit")
+        container_command.append("host")
+        container_command.append("--pids-limit")
+        container_command.append("0")
</code_context>

<issue_to_address>
**question (bug_risk):** The use of '--ulimit host' may not be supported by podman.

If using podman, verify that this flag does not cause errors, or use a podman-compatible method for setting ulimits.
</issue_to_address>

### Comment 3
<location> `python/ray/_private/runtime_env/image_uri.py:360-361` </location>
<code_context>
+        container_command.append("LD_LIBRARY_PATH=/usr/lib64:/root/workdir/astra-build/Asystem-HybridEngine/astra_cache/build/lib:/opt/conda/lib/python3.10/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/nccl/lib:/usr/local/nccl2/lib:/usr/lib64/libibverbs:/usr/lib64/librdmacm")
+        container_command.append("--ulimit")
+        container_command.append("host")
+        container_command.append("--pids-limit")
+        container_command.append("0")
+
+        container_command.append("--entrypoint")
</code_context>

<issue_to_address>
**question (bug_risk):** Setting --pids-limit to 0 disables the PID limit; verify this is safe.

Please ensure that removing the PID limit is necessary and won't risk system stability.
</issue_to_address>

### Comment 4
<location> `python/ray/_private/runtime_env/image_uri.py:32` </location>
<code_context>
async def _create_impl(image_uri: str, logger: logging.Logger):
    # Pull image if it doesn't exist
    # Also get path to `default_worker.py` inside the image.
    import os
    use_nydus = False
    if os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true":
        use_nydus = True
        python_executable = os.getenv("PYTHON_EXECUTABLE_PATH", "/tmp/run_python.sh")
        dir = os.path.dirname(os.path.abspath(__file__))
        get_worker_script = os.path.join(dir, "get_worker_path.py")

        pull_image_cmd = [
            "podman",
            "run",
            "--rm",
            "-v",
            f"{python_executable}:/tmp/run_python.sh",
            "-v",
            f"{get_worker_script}:/tmp/get_worker_path.py",
            "--rootfs",
            image_uri + ":O",
            "/tmp/run_python.sh",
            "/tmp/get_worker_path.py",
        ]
        if os.geteuid() != 0:
            pull_image_cmd = ["sudo", "-E"] + pull_image_cmd
    else:
        pull_image_cmd = [
            "podman",
            "run",
            "--rm",
            image_uri,
            "python",
            "-c",
            (
                "import ray._private.workers.default_worker as default_worker; "
                "print(default_worker.__file__)"
            ),
        ]
    logger.info("Pulling image %s", image_uri)
    worker_path = await check_output_cmd(pull_image_cmd, logger=logger)
    if use_nydus:
        lines = worker_path.strip().split("\n")
        for line in lines:
            if not line.startswith("time="):
                worker_path = line
                break
    return worker_path.strip()

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Use f-string instead of string concatenation ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
- Don't assign to builtin variable `dir` ([`avoid-builtin-shadow`](https://docs.sourcery.ai/Reference/Default-Rules/comments/avoid-builtin-shadow/))

```suggestion
            f"{image_uri}:O",
```

<br/><details><summary>Explanation</summary>
Python has a number of `builtin` variables: functions and constants that
form a part of the language, such as `list`, `getattr`, and `type`
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:

```python
list = [1, 2, 3]
```
However, this is considered poor practice.
- It will confuse other developers.
- It will confuse syntax highlighters and linters.
- It means you can no longer use that builtin for its original purpose.

How can you solve this?

Rename the variable something more specific, such as `integers`.
In a pinch, `my_list` and similar names are colloquially-recognized
placeholders.</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +307 to +308
"/usr/lib/libnvcuvid.so"
"/usr/lib/libnvidia-*.so.*",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Missing comma between string literals in nvidia_lib_patterns list.

This missing comma will result in a syntax error.

Comment on lines +358 to +359
container_command.append("--ulimit")
container_command.append("host")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): The use of '--ulimit host' may not be supported by podman.

If using podman, verify that this flag does not cause errors, or use a podman-compatible method for setting ulimits.

Comment on lines +360 to +361
container_command.append("--pids-limit")
container_command.append("0")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): Setting --pids-limit to 0 disables the PID limit; verify this is safe.

Please ensure that removing the PID limit is necessary and won't risk system stability.

"-v",
f"{get_worker_script}:/tmp/get_worker_path.py",
"--rootfs",
image_uri + ":O",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
image_uri + ":O",
f"{image_uri}:O",


Explanation
Python has a number of builtin variables: functions and constants that
form a part of the language, such as list, getattr, and type
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:

list = [1, 2, 3]

However, this is considered poor practice.

  • It will confuse other developers.
  • It will confuse syntax highlighters and linters.
  • It means you can no longer use that builtin for its original purpose.

How can you solve this?

Rename the variable something more specific, such as integers.
In a pinch, my_list and similar names are colloquially-recognized
placeholders.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for Podman with Nydus snapshotter, enhancing Ray's runtime environment capabilities. The changes involve conditional logic for Nydus, dynamic mounting of host resources (especially NVIDIA-related), and specific environment variable propagation. While the functionality seems to align with the PR's goal, there are several areas where the code could be improved for clarity, maintainability, and robustness, particularly around the extensive use of hardcoded paths and environment variables, and the dynamic discovery logic.

]
import os
use_nydus = False
if os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using os.getenv with a default string and then converting to lowercase and comparing to a string literal is a common pattern, but it can be made more robust by defining a helper function or a constant for the environment variable name and its expected value. This reduces magic strings and improves readability.

    NYDUS_ENV_VAR = "RAY_PODMAN_USE_NYDUS"
    use_nydus = os.getenv(NYDUS_ENV_VAR, "false").lower() == "true"

Comment on lines +53 to +58
if use_nydus:
lines = worker_path.strip().split("\n")
for line in lines:
if not line.startswith("time="):
worker_path = line
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to strip time= lines from worker_path seems specific to the output of the pull_image_cmd when Nydus is enabled. It would be more robust to parse the output more explicitly, perhaps by looking for a specific marker or using a regular expression, rather than assuming time= lines are the only non-path output. If the output format changes, this parsing could break.

        lines = worker_path.strip().split("\n")
        for line in lines:
            if not line.startswith("time="):
                worker_path = line
                break

podman_use_nydus = False

# Check if nydus is enabled
if os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, using os.getenv with a default string and then converting to lowercase and comparing to a string literal is a common pattern, but it can be made more robust by defining a helper function or a constant for the environment variable name and its expected value. This reduces magic strings and improves readability.

    podman_use_nydus = os.getenv("RAY_PODMAN_USE_NYDUS", "false").lower() == "true"

Comment on lines +319 to +321
# Discover NVIDIA firmware files using glob
firmware_patterns = [
"/usr/lib/firmware/nvidia/*/gsp_*.bin"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The firmware patterns are hardcoded. While glob helps, the base path /usr/lib/firmware/nvidia is fixed. This might not cover all possible firmware locations or future changes in NVIDIA's directory structure.

        firmware_patterns = [
            "/usr/lib/firmware/nvidia/*/gsp_*.bin"
        ]

Comment on lines +329 to +335
# Discover Vulkan and EGL config files using glob
vulkan_config_patterns = [
"/etc/vulkan/icd.d/nvidia*.json",
"/etc/vulkan/implicit_layer.d/nvidia*.json",
"/usr/share/egl/egl_external_platform.d/*nvidia*.json",
"/usr/share/glvnd/egl_vendor.d/*nvidia*.json"
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Vulkan and EGL config file patterns are hardcoded. These paths can vary, and maintaining this list can be challenging. A more dynamic discovery method or configurable options would be more robust.

        vulkan_config_patterns = [
            "/etc/vulkan/icd.d/nvidia*.json",
            "/etc/vulkan/implicit_layer.d/nvidia*.json",
            "/usr/share/egl/egl_external_platform.d/*nvidia*.json",
            "/usr/share/glvnd/egl_vendor.d/*nvidia*.json"
        ]

@xsuler xsuler requested a review from wumuzi520 October 31, 2025 02:55
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Nov 15, 2025
@Qstar Qstar requested a review from liying919 November 17, 2025 09:35
@github-actions github-actions bot removed the stale label Nov 18, 2025
"/tmp/get_worker_path.py",
]
if os.geteuid() != 0:
pull_image_cmd = ["sudo", "-E"] + pull_image_cmd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于同一个集群,是否要加sudo -E是否是确定的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants