Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-10-31
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

kouroshHakha and others added 30 commits October 16, 2025 18:08
…ject#57770)

Allow skipping parameterizing limits in Training ingest benchmark

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
…ray-project#57753)

<!-- Thank you for contributing to Ray! πŸš€ -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

This change would allow `Aggregator` actor to execute tasks out of order
to avoid head-of-line blocking scenario.

Also, made sure that overridden Ray remote args are applied as an
overlay on top of default ones.

## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix πŸ›
- [ ] New feature ✨
- [ ] Enhancement πŸš€
- [ ] Code refactoring πŸ”§
- [ ] Documentation update πŸ“–
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [ ] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [ ] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [ ] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
## Description
Shortening the template per @edoakes' feedback.

## Related issues
Follow-up to ray-project#57193.

## Additional information
Made the following changes:
1. Removed `Types of change` and `Checklist`.
2. Updated contribution guide to point to Ray Docs.
3. Renamed `Additional context` to `Additional information` to be more
encompassing.

---------

Signed-off-by: Matthew Deng <[email protected]>
This pull request improves the display names for the Download operator, making them more descriptive by including the URI column name. This is a great enhancement for observability.

---------

Signed-off-by: kyuds <[email protected]>
Previously we deprecated runtime_env_info field in Task and Actor definition events, for backwards compatibility we continued populating these fields in the proto objects. This pr stops populating the filed so that it is fully deprecated

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
…ing, HF Transformers, and HF Accelerate (ray-project#57632)

This PR adds new release tests to test Ray Train v2 compatibility with
Pytorch Lightning, Hugging Face Transformers, and Hugging Face
Accelerate. Each release tests launches a train run on a real cluster.

---------

Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: Jason Li <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
updating release test lockfiles due to a bug in raydepsets that doesn't
respect package declaration in configs as it does requirements

Adding unsafe ray to ray image building to keep ray out of the release
test dependencies

Successful aws_cluster_launcher run:
https://buildkite.com/ray-project/release/builds/63731#_
(only manual tests are failing):
aws_cluster_launcher_release_image

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…t#57804)

Cleaning up raylet worker class by pulling out worker_interface into its
own file. Modified bazel files and includes.

---------

Signed-off-by: joshlee <[email protected]>
adding pre commit and ci semgrep check for data test files

semgrep test for missing pytest.main in
python/ray/data/tests/test_arrow_serialization.py:
https://buildkite.com/ray-project/microcheck/builds/28269#0199cb94-4aab-4f48-9fad-776f90563d53

Pre commit local test test_should_fail.py
<img width="1968" height="1792" alt="image"
src="https://github.com/user-attachments/assets/e1a21849-6319-4b80-a964-2b236d5e26d5"
/>

---------

Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
…ay-project#57809)

Before this PR, since the thread that kills the actor is different from
the thread that executes ray_send/ray_recv, sometimes
`ray.get(actors[0].recv.remote(1))` can be executed before the actor is
killed. In this PR, move this after `check_logs` to ensure the actor has
been killed before executing this line.
Ports over the remaining unit tests that were marked as TODOs from this
series of PRs: ray-project#57534, ray-project#57256, ray-project#56868, ray-project#56820, ray-project#56816.

Notably:
* `test_new_dataset_config -> test_data_integration`
* `test_backend -> test_torch_trainer, test_worker_group`
* `test_gpu -> test_torch_gpu`

This PR also finishes migrating the Tune LightGBM/Keras examples which
were unblocked by ray-project#57042 and
ray-project#57121.

---------

Signed-off-by: Justin Yu <[email protected]>
…project#57747)

adding base extra test deps cuda (12.8) depset


successful release test run:
https://buildkite.com/ray-project/release/builds/63766

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Add ApproximateQuantile aggregator to Ray Data using DataSketches KLL.

Reason:
 β€’ Enables efficient support for the summary API.
 β€’ More scalable than exact Quantile on large datasets.

Note:
β€’ DataSketches is not added as a Ray dependency; if missing, users are
prompted to install it.

---

Here's a simple test to show the efficiency difference between
`ApproximateQuantile` and `Quantile`
```py
import ray
import ray.data
import time

ray.init(num_cpus=16)
from ray.data.aggregate import ApproximateQuantile, Quantile

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(ApproximateQuantile(on="id", quantiles=[0.5])))
print(f"Time taken ApproximateQuantile: {time.time() - start_time} seconds")

ds = ray.data.range(10**8)
start_time = time.time()

print(ds.aggregate(Quantile(on="id", q=0.5)))
print(f"Time taken Quantile: {time.time() - start_time} seconds")
```
In this run with 1e8 rows, the approximate median returned 49,979,428.0
in ~12.46s, while the exact Quantile returned 49,999,999.5 in ~163.33s.
The difference reflects the sketch’s accuracy trade-off for significant
speed and scalability gains.

When k=800 (the default), we are guaranteed to have the error rate <
0.45% , in this test our error rate is
`(49,999,999.5-49,979,428.0)/49,999,999.5`= 0.00041143 = 0.041143% which
is < 0.45% , but we get the approximate median **13.11x** faster.
```
{'approx_quantile(id)': [49979428.0]}
Time taken ApproximateQuantile: 12.457247257232666 seconds
{'quantile(id)': 49999999.5}
Time taken Quantile: 163.32705521583557 seconds
```
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Expanding nested depsets recursively when expanding.
Expand currently only allows for single level depset-> depset expansion
if there are more than a single level (ex depset1 -> depset2 -> depset3.
Depset3 will not contain the requirements of depset 1)

---------

Signed-off-by: elliot-barn <[email protected]>
so that it does not need to always rebuild ray core binary bits from
source

Signed-off-by: Lonnie Liu <[email protected]>
updating scipy inventory link to referenced forked scipy repo that
contains the artifact in a release

---------

Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…dashboard (ray-project#53370)

Signed-off-by: kenmcheng <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
)

Signed-off-by: zac <[email protected]>
Signed-off-by: Zac Policzer <[email protected]>
Co-authored-by: Ibrahim Rabbani <[email protected]>
…ig` (ray-project#57169)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

So the way the logger works is log() -> buffer -> flush() -> target
(file/console/remote_uri)
On shutdown (triggered by logging.config.dictConfig), the MemoryHandler
is closed which sets the output target to None
When the target is set to None, the logs in the buffer have nowhere to
be flushed, so they're retained in memory, hence it accumulates to no
end.

Solution:
Temporarily detach the handlers for loggers (preserve attributes like
`level`, `target`, `filter` and `formatters`) and restore them after
calling `dictConfig()`

Based off of
https://github.com/ray-project/ray/pull/48958/files#diff-d5e32f872cf7b123f0899714728ac7fc0916a5609844a7d1bf7317d98094778c
(Ray Core - Ray Data log fix)


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: Goutam <[email protected]>
- Use released image:
  - for test jobs that specify release version and don't need custom img
  - for custom image build jobs that can use released image as base
- Change dependency path so these jobs can skip waiting for image build

---------

Signed-off-by: kevin <[email protected]>
- add telemetry to track number of replicas using asynchronous inference

---------

Signed-off-by: harshit <[email protected]>
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
…hon version matching (ray-project#57552)

Current default parameter is "patch" and this may be quite restrictive
in some uses cases where temporarily the head machine and the worker
machines are not 100% matching in Python version at the patch level.

We propose here to be able to relax this constraint via the
RAY_DEFAULT_PYTHON_VERSION_MATCH_LEVEL env var so that users can set
export RAY_DEFAULT_PYTHON_VERSION_MATCH_LEVEL=minor.

---------

Signed-off-by: a.damato <[email protected]>
Signed-off-by: Anthony D'Amato <[email protected]>
Co-authored-by: a.damato <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
logs in controller can be expensive for large cluster
logs in autoscaling state and just noisy

---------

Signed-off-by: abrar <[email protected]>
)

This PR updates Ray Train's circular dependency tests to isolate each
import statement in the test within a separate subprocess. This is
because each import has the side effect of mutating `sys.modules` which
can cause other imports to pass when they could have failed in an
isolated subprocess.

---------

Signed-off-by: JasonLi1909 <[email protected]>
…e functions (ray-project#57810)

Ray Tune users need to stop using `ray.train.get_context()`,
`ray.train.get_checkpoint()` and `ray.train.report`. This PR improves
the error messages raise if they try to call these instead of the
`ray.tune` counterparts. Note that we've already soft-deprecated this
usage with a warning message for 6+ months.

---------

Signed-off-by: Justin Yu <[email protected]>
elliot-barn and others added 20 commits October 29, 2025 16:13
updating tune release tests to run on python 3.10

Successful release test run:
https://buildkite.com/ray-project/release/builds/65655
(failing tests are already disabled)

---------

Signed-off-by: elliot-barn <[email protected]>
Remove actor handle from object that get's passed around in long poll
communication.

Return actor handle in nested objects from the task make the caller of
this task a borrower from the reference counting POV. But this pattern,
although allowed, is not very well tested. Hence breaking it by passing
actor_name from listen_for_change instead.

---------

Signed-off-by: abrar <[email protected]>
## Description

The full name was probably hallucinated from LLM.

## Related issues

## Additional information

Signed-off-by: Rui Qiao <[email protected]>
…ross-node parallelism (ray-project#57261)

Signed-off-by: jeffreyjeffreywang <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: jeffreyjeffreywang <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nikhil G <[email protected]>
…imit` (ray-project#58303)

## Description

## Related issues

Fix comment
ray-project#58264 (comment)
## Additional information

Signed-off-by: You-Cheng Lin <[email protected]>
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
…tuned_examples/`` in ``rllib`` (ray-project#56746)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Seventh split of ray-project#56416

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Gagandeep Singh <[email protected]>
Signed-off-by: Kamil Kaczmarek <[email protected]>
Co-authored-by: Kamil Kaczmarek <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
…ct#57835)

## Description
builds atop of ray-project#58047, this pr
ensures the following when `auth_mode` is `token`:
calling `ray.init() `(without passing an existing cluster address) ->
check if token is present, generate and store in default path if not
present
calling `ray.init(address="xyz")` (connecting to an existing cluster) ->
check if token is present, raise exception if one is not present

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Sampan S Nayak <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
…indefinite waiting (ray-project#58238)

## Description
Add guidance for RayService initialization timeout to prevent indefinite
waiting with `ray.io/initializing-timeout` annotation on RayService.

## Related issues
Closes ray-project/kuberay#4138

## Additional information
None

---------

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: Wei-Cheng Lai <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
…check in accelerator_context.py (ray-project#58269)

## Description

- Change the caught exception type from IndexError to TypeError
- This modification ensures that the correct exception is raised when
the expected accelerator ID is not included in the
accelerator_visible_list

`list.index` will raise a
[ValueError](https://docs.python.org/3/library/exceptions.html#ValueError)
if there is no such item.

https://docs.python.org/3/tutorial/datastructures.html

<img width="797" height="79" alt="image"
src="https://github.com/user-attachments/assets/830cf4aa-d9cb-44d3-9363-8e5cd576bae9"
/>

Now, the error logs will be correctly captured and printed.
<img width="1454" height="143" alt="image"
src="https://github.com/user-attachments/assets/2b0ed0aa-60c7-49c3-84b0-7f0d4f1ebe48"
/>

Signed-off-by: daiping8 <[email protected]>
Significant component, I keep forgetting it's buried inside of
`common/`.

Also cleaned up the mock & proto build targets that were in the
top-level `BUILD.bazel` file.

We should also at some point clean it up to follow the common pattern of
separate targets for an interface, client, & server.

---------

Signed-off-by: Edward Oakes <[email protected]>
…t#58070)

RayEvent provides a special API, merge, which allows multiple events to
be combined into a single event. This reduces gRPC message size, network
bandwidth usage, and is essential for scaling task event exports. This
PR leverages that feature.

Specifically, it clusters events into groups based on (i) entity ID and
(ii) event type. Each group is merged into a single event, which is then
added to the gRPC message body. The EntityId is a user-defined function,
implemented by the event class creator, that determines which events can
be safely merged.

```
Note: this is a redo of ray-project#56558 which gets converted because it randomize the order the events that get exported, lead to flaky tests etc. This attempt maintain the order even after merging.
```

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
… in core worker (ray-project#58315)

The `metrics_agent_client_` depends on `client_call_manager_`, but
previously it was pulling out a reference to it from the core worker,
which is not guaranteed to outlive the agent client.

Modifying it to keep the `client_call_manager_` as a field of the
`core_worker_process` instead.

I think we may also need to drain any ongoing RPCs from the
`metrics_agent_client_` on shutdown. Leaving that for a future PR.

---------

Signed-off-by: Edward Oakes <[email protected]>
## Description

Historically, the intention was to avoid failures upon attempts to
modify provided batch in-place when, for ex, using Torch tensors.

However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of
scenarios. As such, we're flipping this setting to be
`zero_copy_batch=True` by default.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
A few release tests currently use the `raw_metric` API to retrieve
Prometheus metrics. This approach is unreliable because it directly
polls the metric export endpoint, which can race with the Prometheus
server that also polls the same endpoint.

To address this, add a method to query metrics directly from the
Prometheus server instead. This ensures that in end-to-end and
production environments, Prometheus remains the sole poller of exported
metrics and the single source of truth for metric values.

Test:
- CI
- The memory metric is collected and the number makes sense:
https://buildkite.com/ray-project/release/builds/66031/steps/canvas?sid=019a3215-e3be-4e9b-86a3-1f0a8253fea7#019a3215-e3f2-4d0e-877f-5f43d97d6e8e/557-654

Signed-off-by: Cuong Nguyen <[email protected]>
Extend token auth support to dashboard head (all API's)

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Sampan S Nayak <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
This PR also updates the cluster resource scheduler logic to account for
the list of `LabelSelector`s specified by the `fallback_strategy`,
falling back to each fallback strategy `LabelSelector` in-order until
one is satisfied when selecting the best node. We're able to support
fallback selectors by considering them in the cluster resource scheduler
in-order using the existing label selector logic in `IsFeasible` and
`IsAvailable`, returning the first valid node returned by
`GetBestSchedulableNode`.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
…t Raylet logs (ray-project#58244)

`test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is
still flaky in CI despite bumping up the timeout. Making a few
improvements here:

- Increasing the timeout to `20s` just in case it's a timeout issue
(unlikely).
- Changing to scheduling an actor instead of using `internal_kv` for our
signal that the GCS is back up. This should better indicate that the
Raylet is resubscribed.
- Cleaning up some system logs.
- Modifying the `ObjectLostError` logs to avoid logging
likely-irrelevant plasma usage on owner death.

It's likely that the underlying issue here is that we don't actually
reliably resubscribe to all worker death notifications, as indicated in
the TODO in the PR.

---------

Signed-off-by: Edward Oakes <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #666 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5083.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest changes from the master branch into main, encompassing a broad range of updates to the project's build system, CI/CD pipelines, and development tooling. The changes aim to enhance build reliability, improve developer experience through updated linting and dependency management, and ensure compatibility across various platforms and Python versions. The documentation infrastructure has also been significantly revised to align with new standards and tools.

Highlights

  • Bazel Configuration Updates: The .bazelrc file has been updated to make --incompatible_strict_action_env a global default, add workspace_status_command for Linux, include /utf-8 cxxopt for Windows, and ignore warnings for third-party files. It also introduces test_env=PATH for CI Python tests and allows custom user options via .user.bazelrc. The BUILD.bazel file now uses pkg_files and pkg_zip for packaging, and introduces Python binaries for generating packages. The WORKSPACE file has been updated to io_ray, includes rules_java, and adds uv and platform-specific redis binaries. Various bazel/*.bzl files have been updated for consistency with the new workspace name, updated dependencies, and new Python macros.
  • CI/CD Pipeline Refinements (Buildkite): Significant changes have been made to the Buildkite pipeline configurations. A new _images.rayci.yml file defines various image build steps, including extra variants. Python 3.10 has been added to several matrix builds, and cibase tags are now widely used. Dedicated build steps for ray-core, ray-dashboard, and ray-java have been introduced. Test configurations have been streamlined, with cgraph python tests and workflow tests removed, and dask & modin tests split. The lint.rayci.yml now includes semgrep_lint and adjusts docbuild environments. Automated dependency compilation is now handled by a new dependencies.rayci.yml file and raydepsets tool. macOS builds have been consolidated to macos-arm64, and RAYCI_DISABLE_TEST_DB is now used in release automation. Dockerfiles for forge and manylinux have been updated to use miniforge3 and uv, and include Node.js and Azure CLI installations.
  • Codebase and Documentation Updates: The .github/CODEOWNERS file has been simplified and updated to reflect new team responsibilities and file structures. The .github/PULL_REQUEST_TEMPLATE.md has been revised for clarity. .gitignore now includes new entries for build artifacts and user configurations. The .pre-commit-config.yaml has been updated with new pydoclint and semgrep hooks, and expanded buildifier and cpplint coverage. The .rayciversion has been incremented. The .readthedocs.yaml now points to a specific lock file for documentation dependencies. .vale configuration files have been updated with new rules and vocabulary. C++ code has seen updates to actor_creator.h, actor_task_caller.h, metric.h, ray_runtime.h, task_caller.h, config_internal.cc, abstract_ray_runtime.cc, metric.cc, native_ray_runtime.cc, local_mode_object_store.cc, runtime_env.cc, local_mode_task_submitter.cc, native_task_submitter.cc, task_executor.cc, api_test.cc, cluster_mode_test.cc, counter.cc, simple_kv_store.cc, process_helper.cc, process_helper.h, util.cc, and util.h for consistency, bug fixes, and improved network utility. The cpp/test_submit_cpp_job.py has been updated to remove _node_ip_address from _ray_start.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large automated merge that includes a significant amount of refactoring across the build system, CI configuration, and dependency management. Key changes include a major overhaul of the Bazel build system, including changing the workspace name to io_ray, breaking down the monolithic root BUILD.bazel file, and modernizing packaging rules. It also features a comprehensive refactoring of the Buildkite CI pipelines, a switch from miniconda to miniforge, and the introduction of a new dependency management tool raydepsets. Additionally, there's an expansion of aarch64 support and standardization on arm64 for macOS CI, along with numerous cleanups and improvements to tests, linting, and documentation. Overall, these changes are a massive step forward for the project's infrastructure. My review focuses on a few specific areas that could be improved or have significant impact.

@@ -1,4 +1,4 @@
workspace(name = "com_github_ray_project_ray")
workspace(name = "io_ray")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The workspace name has been changed from com_github_ray_project_ray to io_ray. This is a significant breaking change for any downstream projects that depend on Ray via Bazel. While this is likely intentional as part of the build system overhaul, it's a major change that should be clearly communicated to the community.

Comment on lines +40 to +45
std::vector<std::pair<std::string_view, std::string>> tags_pair_vec;
tags_pair_vec.reserve(tags.size());
for (const auto &tag : tags) {
tags_pair_vec.emplace_back(std::string_view(tag.first), tag.second);
}
metric->Record(value, std::move(tags_pair_vec));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Metric::Record function has been changed to construct a std::vector<std::pair<std::string_view, std::string>> from the input std::unordered_map. This involves iterating over the map and creating a new vector for every call. Given that metrics can be recorded frequently in performance-sensitive code, this conversion could introduce a non-trivial overhead.

If possible, consider changing the underlying stats::Metric::Record to accept const std::unordered_map<std::string, std::string> & directly to avoid this conversion. If that's not feasible, this change is acceptable, but the performance implication is worth noting.

Comment on lines 18 to +24
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
# __doc_head_address_end__

# __doc_head_ray_start__
port=6379
ip_head=$head_node_ip:$port
ip_head=$head_node:$port
export ip_head
echo "IP Head: $ip_head"
# __doc_head_address_end__

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script has been updated to use ray symmetric-run, which is a great simplification. However, the method for determining the head node IP address has changed. The new version uses the hostname directly (ip_head=$head_node:$port), which assumes that the hostname is resolvable to the correct IP address from all worker nodes. This might not be true in all network configurations.

The previous version was more robust as it used hostname --ip-address to explicitly get the IP address. To avoid potential issues in different environments, I recommend reverting to a more explicit IP discovery method.

Suggested change
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
# __doc_head_address_end__
# __doc_head_ray_start__
port=6379
ip_head=$head_node_ip:$port
ip_head=$head_node:$port
export ip_head
echo "IP Head: $ip_head"
# __doc_head_address_end__
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | awk '{print $1}')
port=6379
ip_head=${head_node_ip}:${port}
export ip_head
echo "IP Head: $ip_head"
# __doc_head_address_end__

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.