-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-10-31 #666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦RS (ray-project#57801) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
β¦ject#57770) Allow skipping parameterizing limits in Training ingest benchmark --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]>
β¦ray-project#57753) <!-- Thank you for contributing to Ray! π --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change would allow `Aggregator` actor to execute tasks out of order to avoid head-of-line blocking scenario. Also, made sure that overridden Ray remote args are applied as an overlay on top of default ones. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [ ] Enhancement π - [ ] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested β _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <[email protected]>
## Description Shortening the template per @edoakes' feedback. ## Related issues Follow-up to ray-project#57193. ## Additional information Made the following changes: 1. Removed `Types of change` and `Checklist`. 2. Updated contribution guide to point to Ray Docs. 3. Renamed `Additional context` to `Additional information` to be more encompassing. --------- Signed-off-by: Matthew Deng <[email protected]>
This pull request improves the display names for the Download operator, making them more descriptive by including the URI column name. This is a great enhancement for observability. --------- Signed-off-by: kyuds <[email protected]>
Previously we deprecated runtime_env_info field in Task and Actor definition events, for backwards compatibility we continued populating these fields in the proto objects. This pr stops populating the filed so that it is fully deprecated Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
β¦ing, HF Transformers, and HF Accelerate (ray-project#57632) This PR adds new release tests to test Ray Train v2 compatibility with Pytorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate. Each release tests launches a train run on a real cluster. --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
updating release test lockfiles due to a bug in raydepsets that doesn't respect package declaration in configs as it does requirements Adding unsafe ray to ray image building to keep ray out of the release test dependencies Successful aws_cluster_launcher run: https://buildkite.com/ray-project/release/builds/63731#_ (only manual tests are failing): aws_cluster_launcher_release_image --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦t#57804) Cleaning up raylet worker class by pulling out worker_interface into its own file. Modified bazel files and includes. --------- Signed-off-by: joshlee <[email protected]>
adding pre commit and ci semgrep check for data test files semgrep test for missing pytest.main in python/ray/data/tests/test_arrow_serialization.py: https://buildkite.com/ray-project/microcheck/builds/28269#0199cb94-4aab-4f48-9fad-776f90563d53 Pre commit local test test_should_fail.py <img width="1968" height="1792" alt="image" src="https://github.com/user-attachments/assets/e1a21849-6319-4b80-a964-2b236d5e26d5" /> --------- Signed-off-by: elliot-barn <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
β¦ay-project#57809) Before this PR, since the thread that kills the actor is different from the thread that executes ray_send/ray_recv, sometimes `ray.get(actors[0].recv.remote(1))` can be executed before the actor is killed. In this PR, move this after `check_logs` to ensure the actor has been killed before executing this line.
Ports over the remaining unit tests that were marked as TODOs from this series of PRs: ray-project#57534, ray-project#57256, ray-project#56868, ray-project#56820, ray-project#56816. Notably: * `test_new_dataset_config -> test_data_integration` * `test_backend -> test_torch_trainer, test_worker_group` * `test_gpu -> test_torch_gpu` This PR also finishes migrating the Tune LightGBM/Keras examples which were unblocked by ray-project#57042 and ray-project#57121. --------- Signed-off-by: Justin Yu <[email protected]>
β¦t#57782) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]>
β¦project#57747) adding base extra test deps cuda (12.8) depset successful release test run: https://buildkite.com/ray-project/release/builds/63766 --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: β’ Enables efficient support for the summary API. β’ More scalable than exact Quantile on large datasets. Note: β’ DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketchβs accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
Expanding nested depsets recursively when expanding. Expand currently only allows for single level depset-> depset expansion if there are more than a single level (ex depset1 -> depset2 -> depset3. Depset3 will not contain the requirements of depset 1) --------- Signed-off-by: elliot-barn <[email protected]>
β¦e initialization, etc. (ray-project#57257) Signed-off-by: ahao-anyscale <[email protected]>
so that it does not need to always rebuild ray core binary bits from source Signed-off-by: Lonnie Liu <[email protected]>
updating scipy inventory link to referenced forked scipy repo that contains the artifact in a release --------- Signed-off-by: elliot-barn <[email protected]> Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦dashboard (ray-project#53370) Signed-off-by: kenmcheng <[email protected]> Co-authored-by: Alan Guo <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
) Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>
β¦ig` (ray-project#57169) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? So the way the logger works is log() -> buffer -> flush() -> target (file/console/remote_uri) On shutdown (triggered by logging.config.dictConfig), the MemoryHandler is closed which sets the output target to None When the target is set to None, the logs in the buffer have nowhere to be flushed, so they're retained in memory, hence it accumulates to no end. Solution: Temporarily detach the handlers for loggers (preserve attributes like `level`, `target`, `filter` and `formatters`) and restore them after calling `dictConfig()` Based off of https://github.com/ray-project/ray/pull/48958/files#diff-d5e32f872cf7b123f0899714728ac7fc0916a5609844a7d1bf7317d98094778c (Ray Core - Ray Data log fix) ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <[email protected]> Signed-off-by: Goutam <[email protected]>
- Use released image: - for test jobs that specify release version and don't need custom img - for custom image build jobs that can use released image as base - Change dependency path so these jobs can skip waiting for image build --------- Signed-off-by: kevin <[email protected]>
- add telemetry to track number of replicas using asynchronous inference --------- Signed-off-by: harshit <[email protected]>
β¦-project#57776) For more details about the resource isolation project see ray-project#54703. This PR moves the driver into the workers cgroup when it registers with the NodeManager. Also updates the tests to reflect this. This now includes changes from ray-project#57800. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
β¦hon version matching (ray-project#57552) Current default parameter is "patch" and this may be quite restrictive in some uses cases where temporarily the head machine and the worker machines are not 100% matching in Python version at the patch level. We propose here to be able to relax this constraint via the RAY_DEFAULT_PYTHON_VERSION_MATCH_LEVEL env var so that users can set export RAY_DEFAULT_PYTHON_VERSION_MATCH_LEVEL=minor. --------- Signed-off-by: a.damato <[email protected]> Signed-off-by: Anthony D'Amato <[email protected]> Co-authored-by: a.damato <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
logs in controller can be expensive for large cluster logs in autoscaling state and just noisy --------- Signed-off-by: abrar <[email protected]>
) This PR updates Ray Train's circular dependency tests to isolate each import statement in the test within a separate subprocess. This is because each import has the side effect of mutating `sys.modules` which can cause other imports to pass when they could have failed in an isolated subprocess. --------- Signed-off-by: JasonLi1909 <[email protected]>
β¦e functions (ray-project#57810) Ray Tune users need to stop using `ray.train.get_context()`, `ray.train.get_checkpoint()` and `ray.train.report`. This PR improves the error messages raise if they try to call these instead of the `ray.tune` counterparts. Note that we've already soft-deprecated this usage with a warning message for 6+ months. --------- Signed-off-by: Justin Yu <[email protected]>
updating tune release tests to run on python 3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65655 (failing tests are already disabled) --------- Signed-off-by: elliot-barn <[email protected]>
Remove actor handle from object that get's passed around in long poll communication. Return actor handle in nested objects from the task make the caller of this task a borrower from the reference counting POV. But this pattern, although allowed, is not very well tested. Hence breaking it by passing actor_name from listen_for_change instead. --------- Signed-off-by: abrar <[email protected]>
## Description The full name was probably hallucinated from LLM. ## Related issues ## Additional information Signed-off-by: Rui Qiao <[email protected]>
β¦ross-node parallelism (ray-project#57261) Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nikhil G <[email protected]>
β¦imit` (ray-project#58303) ## Description ## Related issues Fix comment ray-project#58264 (comment) ## Additional information Signed-off-by: You-Cheng Lin <[email protected]>
β¦egy (ray-project#58306) Signed-off-by: wei-chenglai <[email protected]>
β¦ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Stephanie Wang <[email protected]>
β¦tuned_examples/`` in ``rllib`` (ray-project#56746) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Seventh split of ray-project#56416 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Gagandeep Singh <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Mark Towers <[email protected]>
β¦ct#57835) ## Description builds atop of ray-project#58047, this pr ensures the following when `auth_mode` is `token`: calling `ray.init() `(without passing an existing cluster address) -> check if token is present, generate and store in default path if not present calling `ray.init(address="xyz")` (connecting to an existing cluster) -> check if token is present, raise exception if one is not present --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
β¦indefinite waiting (ray-project#58238) ## Description Add guidance for RayService initialization timeout to prevent indefinite waiting with `ray.io/initializing-timeout` annotation on RayService. ## Related issues Closes ray-project/kuberay#4138 ## Additional information None --------- Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: Wei-Cheng Lai <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
β¦check in accelerator_context.py (ray-project#58269) ## Description - Change the caught exception type from IndexError to TypeError - This modification ensures that the correct exception is raised when the expected accelerator ID is not included in the accelerator_visible_list `list.index` will raise a [ValueError](https://docs.python.org/3/library/exceptions.html#ValueError) if there is no such item. https://docs.python.org/3/tutorial/datastructures.html <img width="797" height="79" alt="image" src="https://github.com/user-attachments/assets/830cf4aa-d9cb-44d3-9363-8e5cd576bae9" /> Now, the error logs will be correctly captured and printed. <img width="1454" height="143" alt="image" src="https://github.com/user-attachments/assets/2b0ed0aa-60c7-49c3-84b0-7f0d4f1ebe48" /> Signed-off-by: daiping8 <[email protected]>
Significant component, I keep forgetting it's buried inside of `common/`. Also cleaned up the mock & proto build targets that were in the top-level `BUILD.bazel` file. We should also at some point clean it up to follow the common pattern of separate targets for an interface, client, & server. --------- Signed-off-by: Edward Oakes <[email protected]>
β¦t#58070) RayEvent provides a special API, merge, which allows multiple events to be combined into a single event. This reduces gRPC message size, network bandwidth usage, and is essential for scaling task event exports. This PR leverages that feature. Specifically, it clusters events into groups based on (i) entity ID and (ii) event type. Each group is merged into a single event, which is then added to the gRPC message body. The EntityId is a user-defined function, implemented by the event class creator, that determines which events can be safely merged. ``` Note: this is a redo of ray-project#56558 which gets converted because it randomize the order the events that get exported, lead to flaky tests etc. This attempt maintain the order even after merging. ``` Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
β¦ in core worker (ray-project#58315) The `metrics_agent_client_` depends on `client_call_manager_`, but previously it was pulling out a reference to it from the core worker, which is not guaranteed to outlive the agent client. Modifying it to keep the `client_call_manager_` as a field of the `core_worker_process` instead. I think we may also need to drain any ongoing RPCs from the `metrics_agent_client_` on shutdown. Leaving that for a future PR. --------- Signed-off-by: Edward Oakes <[email protected]>
## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>
A few release tests currently use the `raw_metric` API to retrieve Prometheus metrics. This approach is unreliable because it directly polls the metric export endpoint, which can race with the Prometheus server that also polls the same endpoint. To address this, add a method to query metrics directly from the Prometheus server instead. This ensures that in end-to-end and production environments, Prometheus remains the sole poller of exported metrics and the single source of truth for metric values. Test: - CI - The memory metric is collected and the number makes sense: https://buildkite.com/ray-project/release/builds/66031/steps/canvas?sid=019a3215-e3be-4e9b-86a3-1f0a8253fea7#019a3215-e3f2-4d0e-877f-5f43d97d6e8e/557-654 Signed-off-by: Cuong Nguyen <[email protected]>
Extend token auth support to dashboard head (all API's) --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
This PR also updates the cluster resource scheduler logic to account for the list of `LabelSelector`s specified by the `fallback_strategy`, falling back to each fallback strategy `LabelSelector` in-order until one is satisfied when selecting the best node. We're able to support fallback selectors by considering them in the cluster resource scheduler in-order using the existing label selector logic in `IsFeasible` and `IsAvailable`, returning the first valid node returned by `GetBestSchedulableNode`. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]>
β¦t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #666 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5083.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the latest changes from the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a large automated merge that includes a significant amount of refactoring across the build system, CI configuration, and dependency management. Key changes include a major overhaul of the Bazel build system, including changing the workspace name to io_ray, breaking down the monolithic root BUILD.bazel file, and modernizing packaging rules. It also features a comprehensive refactoring of the Buildkite CI pipelines, a switch from miniconda to miniforge, and the introduction of a new dependency management tool raydepsets. Additionally, there's an expansion of aarch64 support and standardization on arm64 for macOS CI, along with numerous cleanups and improvements to tests, linting, and documentation. Overall, these changes are a massive step forward for the project's infrastructure. My review focuses on a few specific areas that could be improved or have significant impact.
| @@ -1,4 +1,4 @@ | |||
| workspace(name = "com_github_ray_project_ray") | |||
| workspace(name = "io_ray") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workspace name has been changed from com_github_ray_project_ray to io_ray. This is a significant breaking change for any downstream projects that depend on Ray via Bazel. While this is likely intentional as part of the build system overhaul, it's a major change that should be clearly communicated to the community.
| std::vector<std::pair<std::string_view, std::string>> tags_pair_vec; | ||
| tags_pair_vec.reserve(tags.size()); | ||
| for (const auto &tag : tags) { | ||
| tags_pair_vec.emplace_back(std::string_view(tag.first), tag.second); | ||
| } | ||
| metric->Record(value, std::move(tags_pair_vec)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Metric::Record function has been changed to construct a std::vector<std::pair<std::string_view, std::string>> from the input std::unordered_map. This involves iterating over the map and creating a new vector for every call. Given that metrics can be recorded frequently in performance-sensitive code, this conversion could introduce a non-trivial overhead.
If possible, consider changing the underlying stats::Metric::Record to accept const std::unordered_map<std::string, std::string> & directly to avoid this conversion. If that's not feasible, this change is acceptable, but the performance implication is worth noting.
| head_node=${nodes_array[0]} | ||
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) | ||
|
|
||
| # if we detect a space character in the head node IP, we'll | ||
| # convert it to an ipv4 address. This step is optional. | ||
| if [[ "$head_node_ip" == *" "* ]]; then | ||
| IFS=' ' read -ra ADDR <<<"$head_node_ip" | ||
| if [[ ${#ADDR[0]} -gt 16 ]]; then | ||
| head_node_ip=${ADDR[1]} | ||
| else | ||
| head_node_ip=${ADDR[0]} | ||
| fi | ||
| echo "IPV6 address detected. We split the IPV4 address as $head_node_ip" | ||
| fi | ||
| # __doc_head_address_end__ | ||
|
|
||
| # __doc_head_ray_start__ | ||
| port=6379 | ||
| ip_head=$head_node_ip:$port | ||
| ip_head=$head_node:$port | ||
| export ip_head | ||
| echo "IP Head: $ip_head" | ||
| # __doc_head_address_end__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script has been updated to use ray symmetric-run, which is a great simplification. However, the method for determining the head node IP address has changed. The new version uses the hostname directly (ip_head=$head_node:$port), which assumes that the hostname is resolvable to the correct IP address from all worker nodes. This might not be true in all network configurations.
The previous version was more robust as it used hostname --ip-address to explicitly get the IP address. To avoid potential issues in different environments, I recommend reverting to a more explicit IP discovery method.
| head_node=${nodes_array[0]} | |
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) | |
| # if we detect a space character in the head node IP, we'll | |
| # convert it to an ipv4 address. This step is optional. | |
| if [[ "$head_node_ip" == *" "* ]]; then | |
| IFS=' ' read -ra ADDR <<<"$head_node_ip" | |
| if [[ ${#ADDR[0]} -gt 16 ]]; then | |
| head_node_ip=${ADDR[1]} | |
| else | |
| head_node_ip=${ADDR[0]} | |
| fi | |
| echo "IPV6 address detected. We split the IPV4 address as $head_node_ip" | |
| fi | |
| # __doc_head_address_end__ | |
| # __doc_head_ray_start__ | |
| port=6379 | |
| ip_head=$head_node_ip:$port | |
| ip_head=$head_node:$port | |
| export ip_head | |
| echo "IP Head: $ip_head" | |
| # __doc_head_address_end__ | |
| head_node=${nodes_array[0]} | |
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | awk '{print $1}') | |
| port=6379 | |
| ip_head=${head_node_ip}:${port} | |
| export ip_head | |
| echo "IP Head: $ip_head" | |
| # __doc_head_address_end__ |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-10-31
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.