🔄 daily merge: master → main 2025-10-31 #666

antfin-oss · 2025-10-31T02:55:15Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-10-31
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…RS (ray-project#57801) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…ject#57770) Allow skipping parameterizing limits in Training ingest benchmark --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]>

https://anyscale-ray--57756.com.readthedocs.build/en/57756/serve/advanced-guides/advanced-autoscaling.html#application-level-autoscaling --------- Signed-off-by: abrar <[email protected]>

…ray-project#57753)    ## Description This change would allow `Aggregator` actor to execute tasks out of order to avoid head-of-line blocking scenario. Also, made sure that overridden Ray remote args are applied as an overlay on top of default ones. ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [ ] No  **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: Alexey Kudinkin <[email protected]>

@edoakes

## Description Shortening the template per @edoakes' feedback. ## Related issues Follow-up to ray-project#57193. ## Additional information Made the following changes: 1. Removed `Types of change` and `Checklist`. 2. Updated contribution guide to point to Ray Docs. 3. Renamed `Additional context` to `Additional information` to be more encompassing. --------- Signed-off-by: Matthew Deng <[email protected]>

This pull request improves the display names for the Download operator, making them more descriptive by including the URI column name. This is a great enhancement for observability. --------- Signed-off-by: kyuds <[email protected]>

Previously we deprecated runtime_env_info field in Task and Actor definition events, for backwards compatibility we continued populating these fields in the proto objects. This pr stops populating the filed so that it is fully deprecated Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

…ing, HF Transformers, and HF Accelerate (ray-project#57632) This PR adds new release tests to test Ray Train v2 compatibility with Pytorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate. Each release tests launches a train run on a real cluster. --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

updating release test lockfiles due to a bug in raydepsets that doesn't respect package declaration in configs as it does requirements Adding unsafe ray to ray image building to keep ray out of the release test dependencies Successful aws_cluster_launcher run: https://buildkite.com/ray-project/release/builds/63731#_ (only manual tests are failing): aws_cluster_launcher_release_image --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…t#57804) Cleaning up raylet worker class by pulling out worker_interface into its own file. Modified bazel files and includes. --------- Signed-off-by: joshlee <[email protected]>

adding pre commit and ci semgrep check for data test files semgrep test for missing pytest.main in python/ray/data/tests/test_arrow_serialization.py: https://buildkite.com/ray-project/microcheck/builds/28269#0199cb94-4aab-4f48-9fad-776f90563d53 Pre commit local test test_should_fail.py <img width="1968" height="1792" alt="image" src="https://github.com/user-attachments/assets/e1a21849-6319-4b80-a964-2b236d5e26d5" /> --------- Signed-off-by: elliot-barn <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

…ay-project#57809) Before this PR, since the thread that kills the actor is different from the thread that executes ray_send/ray_recv, sometimes `ray.get(actors[0].recv.remote(1))` can be executed before the actor is killed. In this PR, move this after `check_logs` to ensure the actor has been killed before executing this line.

Ports over the remaining unit tests that were marked as TODOs from this series of PRs: ray-project#57534, ray-project#57256, ray-project#56868, ray-project#56820, ray-project#56816. Notably: * `test_new_dataset_config -> test_data_integration` * `test_backend -> test_torch_trainer, test_worker_group` * `test_gpu -> test_torch_gpu` This PR also finishes migrating the Tune LightGBM/Keras examples which were unblocked by ray-project#57042 and ray-project#57121. --------- Signed-off-by: Justin Yu <[email protected]>

…t#57782) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]>

…project#57747) adding base extra test deps cuda (12.8) depset successful release test run: https://buildkite.com/ray-project/release/builds/63766 --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

## Why are these changes needed? Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL. Reason: • Enables efficient support for the summary API. • More scalable than exact Quantile on large datasets. Note: • DataSketches is not added as a Ray dependency; if missing, users are prompted to install it. --- Here's a simple test to show the efficiency difference between `ApproximateQuantile` and `Quantile` ```py import ray import ray.data import time ray.init(num_cpus=16) from ray.data.aggregate import ApproximateQuantile, Quantile ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5]))) print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds") ds = ray.data.range(10**8) start_time = time.time() print(ds.aggregate(Quantile(on="id", q=0.5))) print(f"Time taken Quantile: {time.time() - start_time} seconds") ``` In this run with 1e8 rows, the approximate median returned 49,979,428.0 in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s. The difference reflects the sketch’s accuracy trade-off for significant speed and scalability gains. When k=800 (the default), we are guaranteed to have the error rate < 0.45% , in this test our error rate is `(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which is < 0.45% , but we get the approximate median **13.11x** faster. ``` {'approx_quantile(id)': [49979428.0]} Time taken ApproximateQuantile: 12.457247257232666 seconds {'quantile(id)': 49999999.5} Time taken Quantile: 163.32705521583557 seconds ```  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: You-Cheng Lin (Owen) <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

Expanding nested depsets recursively when expanding. Expand currently only allows for single level depset-> depset expansion if there are more than a single level (ex depset1 -> depset2 -> depset3. Depset3 will not contain the requirements of depset 1) --------- Signed-off-by: elliot-barn <[email protected]>

…e initialization, etc. (ray-project#57257) Signed-off-by: ahao-anyscale <[email protected]>

so that it does not need to always rebuild ray core binary bits from source Signed-off-by: Lonnie Liu <[email protected]>

updating scipy inventory link to referenced forked scipy repo that contains the artifact in a release --------- Signed-off-by: elliot-barn <[email protected]> Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…dashboard (ray-project#53370) Signed-off-by: kenmcheng <[email protected]> Co-authored-by: Alan Guo <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

) Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

…ig` (ray-project#57169)   ## Why are these changes needed? So the way the logger works is log() -> buffer -> flush() -> target (file/console/remote_uri) On shutdown (triggered by logging.config.dictConfig), the MemoryHandler is closed which sets the output target to None When the target is set to None, the logs in the buffer have nowhere to be flushed, so they're retained in memory, hence it accumulates to no end. Solution: Temporarily detach the handlers for loggers (preserve attributes like `level`, `target`, `filter` and `formatters`) and restore them after calling `dictConfig()` Based off of https://github.com/ray-project/ray/pull/48958/files#diff-d5e32f872cf7b123f0899714728ac7fc0916a5609844a7d1bf7317d98094778c (Ray Core - Ray Data log fix) ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <[email protected]> Signed-off-by: Goutam <[email protected]>

- Use released image: - for test jobs that specify release version and don't need custom img - for custom image build jobs that can use released image as base - Change dependency path so these jobs can skip waiting for image build --------- Signed-off-by: kevin <[email protected]>

- add telemetry to track number of replicas using asynchronous inference --------- Signed-off-by: harshit <[email protected]>

…-project#57776) For more details about the resource isolation project see ray-project#54703. This PR moves the driver into the workers cgroup when it registers with the NodeManager. Also updates the tests to reflect this. This now includes changes from ray-project#57800. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…hon version matching (ray-project#57552) Current default parameter is "patch" and this may be quite restrictive in some uses cases where temporarily the head machine and the worker machines are not 100% matching in Python version at the patch level. We propose here to be able to relax this constraint via the RAY_DEFAULT_PYTHON_VERSION_MATCH_LEVEL env var so that users can set export RAY_DEFAULT_PYTHON_VERSION_MATCH_LEVEL=minor. --------- Signed-off-by: a.damato <[email protected]> Signed-off-by: Anthony D'Amato <[email protected]> Co-authored-by: a.damato <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

logs in controller can be expensive for large cluster logs in autoscaling state and just noisy --------- Signed-off-by: abrar <[email protected]>

) This PR updates Ray Train's circular dependency tests to isolate each import statement in the test within a separate subprocess. This is because each import has the side effect of mutating `sys.modules` which can cause other imports to pass when they could have failed in an isolated subprocess. --------- Signed-off-by: JasonLi1909 <[email protected]>

…e functions (ray-project#57810) Ray Tune users need to stop using `ray.train.get_context()`, `ray.train.get_checkpoint()` and `ray.train.report`. This PR improves the error messages raise if they try to call these instead of the `ray.tune` counterparts. Note that we've already soft-deprecated this usage with a warning message for 6+ months. --------- Signed-off-by: Justin Yu <[email protected]>

updating tune release tests to run on python 3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65655 (failing tests are already disabled) --------- Signed-off-by: elliot-barn <[email protected]>

Remove actor handle from object that get's passed around in long poll communication. Return actor handle in nested objects from the task make the caller of this task a borrower from the reference counting POV. But this pattern, although allowed, is not very well tested. Hence breaking it by passing actor_name from listen_for_change instead. --------- Signed-off-by: abrar <[email protected]>

## Description The full name was probably hallucinated from LLM. ## Related issues ## Additional information Signed-off-by: Rui Qiao <[email protected]>

…ross-node parallelism (ray-project#57261) Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nikhil G <[email protected]>

…imit` (ray-project#58303) ## Description ## Related issues Fix comment ray-project#58264 (comment) ## Additional information Signed-off-by: You-Cheng Lin <[email protected]>

…egy (ray-project#58306) Signed-off-by: wei-chenglai <[email protected]>

…ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Stephanie Wang <[email protected]>

…tuned_examples/`` in ``rllib`` (ray-project#56746)   ## Why are these changes needed?  Seventh split of ray-project#56416 ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Gagandeep Singh <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Mark Towers <[email protected]>

…ct#57835) ## Description builds atop of ray-project#58047, this pr ensures the following when `auth_mode` is `token`: calling `ray.init() `(without passing an existing cluster address) -> check if token is present, generate and store in default path if not present calling `ray.init(address="xyz")` (connecting to an existing cluster) -> check if token is present, raise exception if one is not present --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…indefinite waiting (ray-project#58238) ## Description Add guidance for RayService initialization timeout to prevent indefinite waiting with `ray.io/initializing-timeout` annotation on RayService. ## Related issues Closes ray-project/kuberay#4138 ## Additional information None --------- Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: Wei-Cheng Lai <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>

…check in accelerator_context.py (ray-project#58269) ## Description - Change the caught exception type from IndexError to TypeError - This modification ensures that the correct exception is raised when the expected accelerator ID is not included in the accelerator_visible_list `list.index` will raise a [ValueError](https://docs.python.org/3/library/exceptions.html#ValueError) if there is no such item. https://docs.python.org/3/tutorial/datastructures.html <img width="797" height="79" alt="image" src="https://github.com/user-attachments/assets/830cf4aa-d9cb-44d3-9363-8e5cd576bae9" /> Now, the error logs will be correctly captured and printed. <img width="1454" height="143" alt="image" src="https://github.com/user-attachments/assets/2b0ed0aa-60c7-49c3-84b0-7f0d4f1ebe48" /> Signed-off-by: daiping8 <[email protected]>

Significant component, I keep forgetting it's buried inside of `common/`. Also cleaned up the mock & proto build targets that were in the top-level `BUILD.bazel` file. We should also at some point clean it up to follow the common pattern of separate targets for an interface, client, & server. --------- Signed-off-by: Edward Oakes <[email protected]>

…t#58070) RayEvent provides a special API, merge, which allows multiple events to be combined into a single event. This reduces gRPC message size, network bandwidth usage, and is essential for scaling task event exports. This PR leverages that feature. Specifically, it clusters events into groups based on (i) entity ID and (ii) event type. Each group is merged into a single event, which is then added to the gRPC message body. The EntityId is a user-defined function, implemented by the event class creator, that determines which events can be safely merged. ``` Note: this is a redo of ray-project#56558 which gets converted because it randomize the order the events that get exported, lead to flaky tests etc. This attempt maintain the order even after merging. ``` Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

… in core worker (ray-project#58315) The `metrics_agent_client_` depends on `client_call_manager_`, but previously it was pulling out a reference to it from the core worker, which is not guaranteed to outlive the agent client. Modifying it to keep the `client_call_manager_` as a field of the `core_worker_process` instead. I think we may also need to drain any ongoing RPCs from the `metrics_agent_client_` on shutdown. Leaving that for a future PR. --------- Signed-off-by: Edward Oakes <[email protected]>

## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

A few release tests currently use the `raw_metric` API to retrieve Prometheus metrics. This approach is unreliable because it directly polls the metric export endpoint, which can race with the Prometheus server that also polls the same endpoint. To address this, add a method to query metrics directly from the Prometheus server instead. This ensures that in end-to-end and production environments, Prometheus remains the sole poller of exported metrics and the single source of truth for metric values. Test: - CI - The memory metric is collected and the number makes sense: https://buildkite.com/ray-project/release/builds/66031/steps/canvas?sid=019a3215-e3be-4e9b-86a3-1f0a8253fea7#019a3215-e3f2-4d0e-877f-5f43d97d6e8e/557-654 Signed-off-by: Cuong Nguyen <[email protected]>

Extend token auth support to dashboard head (all API's) --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

This PR also updates the cluster resource scheduler logic to account for the list of `LabelSelector`s specified by the `fallback_strategy`, falling back to each fallback strategy `LabelSelector` in-order until one is satisfied when selecting the best node. We're able to support fallback selectors by considering them in the cluster resource scheduler in-order using the existing label selector logic in `IsFeasible` and `IsAvailable`, returning the first valid node returned by `GetBestSchedulableNode`. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]>

…t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]>

sourcery-ai

The pull request #666 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5083.

gemini-code-assist · 2025-10-31T03:03:55Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest changes from the master branch into main, encompassing a broad range of updates to the project's build system, CI/CD pipelines, and development tooling. The changes aim to enhance build reliability, improve developer experience through updated linting and dependency management, and ensure compatibility across various platforms and Python versions. The documentation infrastructure has also been significantly revised to align with new standards and tools.

Highlights

Bazel Configuration Updates: The .bazelrc file has been updated to make --incompatible_strict_action_env a global default, add workspace_status_command for Linux, include /utf-8 cxxopt for Windows, and ignore warnings for third-party files. It also introduces test_env=PATH for CI Python tests and allows custom user options via .user.bazelrc. The BUILD.bazel file now uses pkg_files and pkg_zip for packaging, and introduces Python binaries for generating packages. The WORKSPACE file has been updated to io_ray, includes rules_java, and adds uv and platform-specific redis binaries. Various bazel/*.bzl files have been updated for consistency with the new workspace name, updated dependencies, and new Python macros.
CI/CD Pipeline Refinements (Buildkite): Significant changes have been made to the Buildkite pipeline configurations. A new _images.rayci.yml file defines various image build steps, including extra variants. Python 3.10 has been added to several matrix builds, and cibase tags are now widely used. Dedicated build steps for ray-core, ray-dashboard, and ray-java have been introduced. Test configurations have been streamlined, with cgraph python tests and workflow tests removed, and dask & modin tests split. The lint.rayci.yml now includes semgrep_lint and adjusts docbuild environments. Automated dependency compilation is now handled by a new dependencies.rayci.yml file and raydepsets tool. macOS builds have been consolidated to macos-arm64, and RAYCI_DISABLE_TEST_DB is now used in release automation. Dockerfiles for forge and manylinux have been updated to use miniforge3 and uv, and include Node.js and Azure CLI installations.
Codebase and Documentation Updates: The .github/CODEOWNERS file has been simplified and updated to reflect new team responsibilities and file structures. The .github/PULL_REQUEST_TEMPLATE.md has been revised for clarity. .gitignore now includes new entries for build artifacts and user configurations. The .pre-commit-config.yaml has been updated with new pydoclint and semgrep hooks, and expanded buildifier and cpplint coverage. The .rayciversion has been incremented. The .readthedocs.yaml now points to a specific lock file for documentation dependencies. .vale configuration files have been updated with new rules and vocabulary. C++ code has seen updates to actor_creator.h, actor_task_caller.h, metric.h, ray_runtime.h, task_caller.h, config_internal.cc, abstract_ray_runtime.cc, metric.cc, native_ray_runtime.cc, local_mode_object_store.cc, runtime_env.cc, local_mode_task_submitter.cc, native_task_submitter.cc, task_executor.cc, api_test.cc, cluster_mode_test.cc, counter.cc, simple_kv_store.cc, process_helper.cc, process_helper.h, util.cc, and util.h for consistency, bug fixes, and improved network utility. The cpp/test_submit_cpp_job.py has been updated to remove _node_ip_address from _ray_start.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a large automated merge that includes a significant amount of refactoring across the build system, CI configuration, and dependency management. Key changes include a major overhaul of the Bazel build system, including changing the workspace name to io_ray, breaking down the monolithic root BUILD.bazel file, and modernizing packaging rules. It also features a comprehensive refactoring of the Buildkite CI pipelines, a switch from miniconda to miniforge, and the introduction of a new dependency management tool raydepsets. Additionally, there's an expansion of aarch64 support and standardization on arm64 for macOS CI, along with numerous cleanups and improvements to tests, linting, and documentation. Overall, these changes are a massive step forward for the project's infrastructure. My review focuses on a few specific areas that could be improved or have significant impact.

gemini-code-assist · 2025-10-31T03:08:06Z

WORKSPACE

@@ -1,4 +1,4 @@
-workspace(name = "com_github_ray_project_ray")
+workspace(name = "io_ray")


The workspace name has been changed from com_github_ray_project_ray to io_ray. This is a significant breaking change for any downstream projects that depend on Ray via Bazel. While this is likely intentional as part of the build system overhaul, it's a major change that should be clearly communicated to the community.

gemini-code-assist · 2025-10-31T03:08:06Z

cpp/src/ray/runtime/metric/metric.cc

+  std::vector<std::pair<std::string_view, std::string>> tags_pair_vec;
+  tags_pair_vec.reserve(tags.size());
+  for (const auto &tag : tags) {
+    tags_pair_vec.emplace_back(std::string_view(tag.first), tag.second);
+  }
+  metric->Record(value, std::move(tags_pair_vec));


The Metric::Record function has been changed to construct a std::vector<std::pair<std::string_view, std::string>> from the input std::unordered_map. This involves iterating over the map and creating a new vector for every call. Given that metrics can be recorded frequently in performance-sensitive code, this conversion could introduce a non-trivial overhead.

If possible, consider changing the underlying stats::Metric::Record to accept const std::unordered_map<std::string, std::string> & directly to avoid this conversion. If that's not feasible, this change is acceptable, but the performance implication is worth noting.

gemini-code-assist · 2025-10-31T03:08:06Z

doc/source/cluster/doc_code/slurm-basic.sh

 head_node=${nodes_array[0]}
-head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

-# if we detect a space character in the head node IP, we'll
-# convert it to an ipv4 address. This step is optional.
-if [[ "$head_node_ip" == *" "* ]]; then
-IFS=' ' read -ra ADDR <<<"$head_node_ip"
-if [[ ${#ADDR[0]} -gt 16 ]]; then
-  head_node_ip=${ADDR[1]}
-else
-  head_node_ip=${ADDR[0]}
-fi
-echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
-fi
-# __doc_head_address_end__
-
-# __doc_head_ray_start__
 port=6379
-ip_head=$head_node_ip:$port
+ip_head=$head_node:$port
 export ip_head
 echo "IP Head: $ip_head"
+# __doc_head_address_end__


The script has been updated to use ray symmetric-run, which is a great simplification. However, the method for determining the head node IP address has changed. The new version uses the hostname directly (ip_head=$head_node:$port), which assumes that the hostname is resolvable to the correct IP address from all worker nodes. This might not be true in all network configurations.

The previous version was more robust as it used hostname --ip-address to explicitly get the IP address. To avoid potential issues in different environments, I recommend reverting to a more explicit IP discovery method.

Suggested change

head_node=${nodes_array[0]}

head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# if we detect a space character in the head node IP, we'll

# convert it to an ipv4 address. This step is optional.

if [[ "$head_node_ip" == *" "* ]]; then

IFS=' ' read -ra ADDR <<<"$head_node_ip"

if [[ ${#ADDR[0]} -gt 16 ]]; then

head_node_ip=${ADDR[1]}

else

head_node_ip=${ADDR[0]}

fi

echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"

fi

# __doc_head_address_end__

# __doc_head_ray_start__

port=6379

ip_head=$head_node_ip:$port

ip_head=$head_node:$port

export ip_head

echo "IP Head: $ip_head"

# __doc_head_address_end__

head_node=${nodes_array[0]}

head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | awk '{print $1}')

port=6379

ip_head=${head_node_ip}:${port}

export ip_head

echo "IP Head: $ip_head"

# __doc_head_address_end__

github-actions · 2025-11-15T01:40:57Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

kouroshHakha and others added 30 commits October 16, 2025 18:08

[docs] Exclude /doc/source/serve/llm/ from ray-serve team in CODEOWNE…

9a434c7

…RS (ray-project#57801) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[Train] Support skipping limits in Training ingest benchmark (ray-pro…

c0ed5a9

…ject#57770) Allow skipping parameterizing limits in Training ingest benchmark --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]>

add ray io docs for app level autoscaling (ray-project#57756)

10bc0bf

https://anyscale-ray--57756.com.readthedocs.build/en/57756/serve/advanced-guides/advanced-autoscaling.html#application-level-autoscaling --------- Signed-off-by: abrar <[email protected]>

[core] Pull out raylet worker interface into its own file (ray-projec…

5510a8c

…t#57804) Cleaning up raylet worker class by pulling out worker_interface into its own file. Modified bazel files and includes. --------- Signed-off-by: joshlee <[email protected]>

[docs] Convert serve/llm/quick-start.rst to MyST Markdown (ray-projec…

ef08695

…t#57782) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]>

[serve][llm] Introduce callback API for customizing things like engin…

4e5ca6a

…e initialization, etc. (ray-project#57257) Signed-off-by: ahao-anyscale <[email protected]>

[ci] use prebuilt core bits for doc/api checks (ray-project#57779)

968b270

so that it does not need to always rebuild ray core binary bits from source Signed-off-by: Lonnie Liu <[email protected]>

Filter out ANSI escape codes from logs when retrieving logs from the …

7d60a3d

…dashboard (ray-project#53370) Signed-off-by: kenmcheng <[email protected]> Co-authored-by: Alan Guo <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

[core] Remove RAY_CHECK from GetAndPinArgsForExecutor (ray-project#56866

386bf59

) Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

add telemetry usage for async inference (ray-project#57665)

61a4e7e

- add telemetry to track number of replicas using asynchronous inference --------- Signed-off-by: harshit <[email protected]>

remove expensive debug logs from controller (ray-project#57813)

d84d0fd

logs in controller can be expensive for large cluster logs in autoscaling state and just noisy --------- Signed-off-by: abrar <[email protected]>

elliot-barn and others added 20 commits October 29, 2025 16:13

[LLM] Correct NIXL in docs (ray-project#58258)

3cd8202

## Description The full name was probably hallucinated from LLM. ## Related issues ## Additional information Signed-off-by: Rui Qiao <[email protected]>

[Data] Add logging to limit pushdown when map `min_rows > limit_op._l…

fa69de8

…imit` (ray-project#58303) ## Description ## Related issues Fix comment ray-project#58264 (comment) ## Additional information Signed-off-by: You-Cheng Lin <[email protected]>

[Docs] Update RayJob documentation to introduce the New DeletionStrat…

3fcb2a2

…egy (ray-project#58306) Signed-off-by: wei-chenglai <[email protected]>

[core]move python_callbacks to common (ray-project#57909)

d7b6f1e

antfin-oss requested review from SongGuyang and kfstorm as code owners October 31, 2025 02:55

antfin-oss added auto-generated daily-merge labels Oct 31, 2025

antfin-oss assigned ffbin Oct 31, 2025

sourcery-ai bot reviewed Oct 31, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

github-actions bot added the stale label Nov 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-10-31 #666

🔄 daily merge: master → main 2025-10-31 #666

Uh oh!

antfin-oss commented Oct 31, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Oct 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

gemini-code-assist bot Oct 31, 2025

Uh oh!

github-actions bot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

76 participants

		@@ -1,4 +1,4 @@
		workspace(name = "com_github_ray_project_ray")
		workspace(name = "io_ray")

🔄 daily merge: master → main 2025-10-31 #666

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-10-31 #666

Uh oh!

Conversation

antfin-oss commented Oct 31, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Oct 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

76 participants