Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-10-27
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

aslonnie and others added 30 commits October 13, 2025 11:40
…roject#57675)

out of `util.py`. also adding its own `py_library`

Signed-off-by: Lonnie Liu <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

I've encountered an issue where Ray sends SIGKILL to child processes
(grandchild will not receive the signal) launched by a Ray actor. As a
result, the subprocess cannot catch the signal to gracefully clean up
its child processes. Therefore, the grandchild processes of the actor
will leak.

I'm glad to see ray-project#56476 by
@codope, and I also built a similar solution myself. This PR adds the
case where I met.

@codope why not enable this feature by default?

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Kai-Hsun Chen <[email protected]>
into `anyscale_job_runner`. it is only used in `anyscale_job_runner` now

Signed-off-by: Lonnie Liu <[email protected]>
…project#57682)

should all be using hermetic python with python 3.8 or above now

Signed-off-by: Lonnie Liu <[email protected]>
make that `test_in_docker` does not depend on the entire `ray_release`
library, but only depends on python files that are required for the test
db to work. this removes the dependency of `cryptography` library from
`ray_ci`, so that windows wheels can be built and windows tests can run
again.

Signed-off-by: Lonnie Liu <[email protected]>
…ystem reserved resources (ray-project#57653)

Signed-off-by: irabbani <[email protected]>
Signed-off-by: israbbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Cleaning out plasma and the plasma client and its neighbors.

The plasma client had a pimpl implementation, even though we didn't
really need anything that would come with pimpl out of plasma. So just
killing the separate impl class and just having the plasma client and
its interface. One note about this is that it needs `shared_from_this`
and the old plasma client would always contain a shared ptr to the impl,
so had to refactor the raylet to use a shared ptr to the plasma client
so we could keep using the `shared_from_this`.

Other cleanup:
- a lot of the ipc functions always returned status::ok so changed to
void
- some extra reserving of vectors and moving
- unnecessary consts in pull manager that would prevent moves
etc.

---------

Signed-off-by: dayshah <[email protected]>
ray-project#57626)

The test times out frequently in CI.

Before this change, the test took `~40s` to run on my laptop. After the
change, the test took `~15s` to run on my laptop.

There also seems to be hanging related to in-order execution semantics,
so for now flipping to `allow_out_of_order_exection=True`. @dayshah will
add the `False` variant when he fixes the underlying issue.

---------

Signed-off-by: Edward Oakes <[email protected]>
…oject#56853)

This PR refactors the `TaskExecutionEvent` proto in two ways:
- Rename the file to `events_task_lifecycle_event.proto`
- Refactor the task_state from a map to an array of TaskState and
timestamp. Also rename the field to `state_transitions` for consistency.

This PR depends on the upstream to update their logic to consume this
new schema.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Renames task execution event to task lifecycle event and changes its
schema from a state map to an ordered state_transitions list, updating
core, GCS, dashboard, builds, tests, and docs.
> 
> - **Proto/API changes (breaking)**
> - Rename `TaskExecutionEvent` β†’ `TaskLifecycleEvent` and update
`RayEvent.EventType` (`TASK_EXECUTION_EVENT` β†’ `TASK_LIFECYCLE_EVENT`).
> - Replace `task_state` map with `state_transitions` (list of `{state,
timestamp}`) in `events_task_lifecycle_event.proto`.
> - Update `events_base_event.proto` field from `task_execution_event` β†’
`task_lifecycle_event` and imports/BUILD deps accordingly.
> - **Core worker**
> - Update buffer/conversion logic in
`src/ray/core_worker/task_event_buffer.{h,cc}` to populate/emit
`TaskLifecycleEvent` with `state_transitions`.
> - **GCS**
> - Update `GcsRayEventConverter` to consume `TASK_LIFECYCLE_EVENT` and
convert `state_transitions` to `state_ts_ns`.
> - **Dashboard/Aggregator**
> - Switch exposable type defaults/env to `TASK_LIFECYCLE_EVENT` in
`python/.../aggregator_agent.py`.
> - **Tests**
> - Adjust unit tests to new event/type and schema across core worker,
GCS, and dashboard.
> - **Docs**
>   - Update event export guide references to new lifecycle event proto.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
61507e8. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: Cuong Nguyen <[email protected]>
- removing enum capability enum as it is not being used, for more
details:
ray-project#56707 (comment)

---------

Signed-off-by: harshit <[email protected]>
… default (ray-project#57623)

Previously we were using `DeprecationWarning` which is silenced by
default. Now this is printed:
```
>>> ray.init(local_mode=True)
/Users/eoakes/code/ray/python/ray/_private/client_mode_hook.py:104: FutureWarning: `local_mode` is an experimental feature that is no longer maintained and will be removed in the near future. For debugging consider using the Ray distributed debugger.
  return func(*args, **kwargs)
```

---------

Signed-off-by: Edward Oakes <[email protected]>
- adding a new note about using filesystem as a broker in celery

---------

Signed-off-by: harshit <[email protected]>
<!-- Thank you for contributing to Ray! πŸš€ -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Improved the Ray pull request template to make it less overwhelming for
contributors while giving maintainers better information for reviews and
release notes. The new template has clearer sections and organized
checklists that are much easier to fill out. This should encourage more
contributions while making the review process smoother and release note
generation more straightforward.

## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix πŸ›
- [ ] New feature ✨
- [x] Enhancement πŸš€
- [ ] Code refactoring πŸ”§
- [ ] Documentation update πŸ“–
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [x] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…epseek support) (ray-project#56906)

Signed-off-by: Jiang Wu <[email protected]>
Signed-off-by: Jiang Wu <[email protected]>
Co-authored-by: Nikhil G <[email protected]>
…ject#57702)

This PR refactors the operator metrics logging tests in `test_stats.py`
to improve clarity, reliability, and maintainability.

- Replaced `test_op_metrics_logging` and `test_op_state_logging` with a
single, more focused test:
`test_executor_logs_metrics_on_operator_completion`
- Uses pytest's `caplog` fixture instead of mocking the logger (more
idiomatic)
- Tests the core behavior (operator completion metrics logged exactly
once) without depending on exact log message formatting
- Eliminates reliance on helper functions and complex string matching
- More descriptive test name following unit testing best practices
- Reduced test code complexity while maintaining coverage of critical
logging behavior

Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: iamjustinhsu <[email protected]>
## Why are these changes needed?

This PR speeds up the Data CI pipeline by increasing parallelism and
improving test distribution:

1. **Increased parallelism for parallel tests**: Bumped from 2 to 8
workers for both `data9test` and `dataltest` jobs that handle tests
tagged with `data` (but not `data_non_parallel`)
2. **Added parallelism for non-parallel tests**: Added 3-way parallelism
to `data9test_non_parallel` and `dataltest_non_parallel` jobs with
proper worker distribution flags (`--workers` and `--worker-id`)

These changes should significantly reduce CI runtime for Data tests by
better utilizing available resources.

## Related issue number

N/A

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [x] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <[email protected]>
temporarily soft failing on llm dependency compilation

Signed-off-by: elliot-barn <[email protected]>
Including config_name in depsets 
Remove build_arg_sets from config class

---------

Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: Elliot Barnwell <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Lonnie Liu <[email protected]>
…f format.sh (ray-project#57703)

Update Ray documentation and pre-push hooks to standardize on pre-commit
for linting and formatting.

## Summary

- Updated `ci/lint/pre-push` hook to use `pre-commit run` instead of
`ci/lint/format.sh`
- Updated development documentation to reference `pre-commit` instead of
`format.sh` for linting
- Removed language suggesting pre-commit is "opt-in" or "planned for the
future" since it's now the standard approach
- Updated installation instructions to use `pre-commit install`

## Test plan

- Verified documentation changes are accurate
- Confirmed pre-commit configuration works correctly

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: angelinalg <[email protected]>
ray-project#57548)

part 2 of ray-project#56149, a significant
portion of the code is taken from the original PR.

This PR does not introduce any change in functionality. Autoscaling is
still performed at the deployment level. This will help us make the
transition towards application level autoscaling.

The only change in this PR
1. is moving the autoscaling control loop from the deployment state to
the application state.
2. adding application autoscaling state class, in the new design
autoscaling state manager will manage a list of application autoscaling
states and each application autoscaling state will manage a list of
deployment autoscaling states

Signed-off-by: abrar <[email protected]>
…n actor (ray-project#57688)

## Summary
This change disables Ray Core's streaming generator backpressure for the
partition actor used in download operations. The partition actor is a
lightweight, fast operation that batches URIs before they're sent to
download tasks. When backpressure was enabled, Ray Core would throttle
the partition actor's output, which starved the downstream download
tasks of work and reduced parallelism.

## Changes
- Set `_generator_backpressure_num_objects` to -1 for the partition
actor
- Use dedicated `ray_remote_args` for the partition actor instead of the
user-provided args (which should only affect download tasks, not
internal partitioning logic)

## Test plan
- [ ] Verify download operations complete successfully
- [ ] Confirm improved parallelism in download tasks
- [ ] Check that backpressure is properly disabled for partition actor

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Currently, node events support only two states: ALIVE and DEAD. This PR
introduces a new substate of ALIVE, called ALIVE_DRAINING.

While this state may be triggered repeatedly, the consumer (dashboard)
only needs to observe it once. To prevent overwhelming the event system,
we add a flag to ensure the ALIVE_DRAINING event is emitted only once.

Test:
- CI 

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds DRAINING and IDLE_OR_ACTIVE node lifecycle states, emits DRAINING
only once, and updates proto, event mapping, GCS manager, and tests
accordingly.
> 
> - **Proto**:
> - Update `events_node_lifecycle_event.proto`: replace `ALIVE` with
`IDLE_OR_ACTIVE` and add `DRAINING` state.
> - **Observability**:
> - `RayNodeLifecycleEvent`: when `GcsNodeInfo` is ALIVE, emit
`DRAINING` if `state_snapshot` is `DRAINING`, else emit
`IDLE_OR_ACTIVE`.
> - **GCS Node Manager**:
> - Track `draining_node_ids_` to ensure `DRAINING` export event is
written once; clear on node removal.
> - `UpdateAliveNode(...)`: set snapshot to `DRAINING` when draining and
write a single export event for the transition.
> - **Tests**:
>   - Adjust expectations from `ALIVE` to `IDLE_OR_ACTIVE`.
> - Add assertion that only one `DRAINING` lifecycle event is exported
for repeated draining updates.
>   - Update dashboard aggregator test to expect `IDLE_OR_ACTIVE`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a5f1e37. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
In ray-project#57035 we deprecate `concurrency` params and use `compute` instead in
`map`, `map_batches`, `flat_map` and `filter` so the related docs should
be changed to use it as well so user won't use deprecated params.
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number
Follow up for in ray-project#57035
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
…57730)

Example flake:
https://buildkite.com/ray-project/postmerge/builds/13666#0199de53-b97e-4dea-9c1d-37ef56433b7c/607-1103

The way the test was written was inherently flaky because the first GC
could happen at any time, so the timeouts that attempt to measure the
time between the interval were inaccurate.

It somewhat pains me to not make the test fully deterministic, but in an
attempt to deflake without spending too much time here, I've improved it
to at least wait until the first GC interval before starting the clock.

I've also split the driver & actor conditions because the
timers/intervals can be out of sync between them.

There's also some weirdness here that we have two configs to control the
GC interval, one for C++ and one for Python, but I'm letting that
sleeping dog lie...

---------

Signed-off-by: Edward Oakes <[email protected]>
khluu and others added 20 commits October 24, 2025 09:52
Add Azure CLI and dependencies into `base-extra` images

---------

Signed-off-by: kevin <[email protected]>
…8088)

those tests have been failing and jailed for quite some time

related to:
- ray-project#46687 
- ray-project#49847
- ray-project#49846

Signed-off-by: Lonnie Liu <[email protected]>
removing format script and all references

---------

Signed-off-by: elliot-barn <[email protected]>
… submission/block generation metrics (ray-project#57246)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
On executor shutdown, the metrics persist even after execution. The plan
is to reset on streaming_executor.shutdown. This PR also includes 2
potential drive-by fixes for metric calculation
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Add some output example of the command to help the end-user to verify
the execution result.


Signed-off-by: fscnick <[email protected]>
## Description

This PR adds a β€Ž`preserve_row` option to β€Ž`map_batches`. When
β€Ž`preserve_row` is true, the limit operator can be pushed down through
this β€Ž`map_batches` call for optimization.

Note: β€Ž`map_group` is built on β€Ž`map_batches`, but limit pushdown
support for β€Ž`map_group` is out of scope for this PR, so
β€Ž`preserve_row_count` is set to false for it.


## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: You-Cheng Lin <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

~~Before:~~

~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~
~~In before, the progress bar won't update until the first tasks
finishes.~~

~~After:

~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~

~~In After, the progress bar won't update until the first task generates
output. If a task generates 10 blocks, we will update the progress bar
while it's generating blocks, even if the task hasn't finished. Once the
task finishes, we default back to the way it was before.~~

~~This is better because the very 1st progress bar update will occur
sooner, and won't feel abrupt to the user.~~

Refractoring the progress bar estimates using known metrics.

## Why are these changes needed?
Currently we use number of finished tasks. This is OK, but since we use
streaming geneator, 1 task = thousands of blocks. This is troublesome
for additional split factor (split blocks) in read parquet
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
…58046)

This pr sets up the helper classes and utils to enable token based
authentication for ray core rpc calls.

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
I suspect that when we deploy the app config, we dont wait long enough
before sending traffic, so requests could go to the wrong version

---------

Signed-off-by: abrar <[email protected]>
…ay-project#57882)

# Summary

The crux of the issue is that in the past, train run status was
synonymous with final worker group status, but now, when there are
pending validations, the worker group is finished but the train run is
not. This leads to confusing situations in which the Train Run is
`FINISHED`, but because there are pending validations, the `controller`
actor is alive and results are inaccessible.

This PR:
* Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after
the worker group finishes but before the controller shuts everything
down.
* Makes `ValidationManager` logging slightly cleaner.
 
Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in
`StateManager` logs and Grafana but not in the state export. We only
want to show terminal states in the state export after `fit()` has
returned and results are accessible. More concretely:
* Finished/errored: The worker group finishes (Train Run is `RUNNING`
but internal state is `SHUTTING_DOWN`), validation finishes (both Train
Run and internal state say `FINISHED` or `ERRORED`), then results are
accessible.
* Aborted: Ideally, the worker group should be aborted and in-flight
validation tasks canceled before the Train Run is `ABORTED`. However,
this PR doesn't change the current behavior, in which the Train Run
might be `ABORTED` before reference counting cleans up the validation
tasks. I will cancel validation tasks before marking the train run
`ABORTED` in a future PR.

I considered polling both the worker group and validations in `_step`
itself, but decided to leave `_step` as a function that only cares about
the worker group.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ject#57930)

Add actor+job+node event to ray event export doc

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Dhyey Shah <[email protected]>
Signed-off-by: Qiaolin-Yu <[email protected]>
Signed-off-by: Qiaolin Yu <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
disabled the wrong test with a different name from the issue

mistakenly associated issue:
ray-project#46687

Signed-off-by: Lonnie Liu <[email protected]>
upgrading batch inference tests to py3.10

Successful release test run:
https://buildkite.com/ray-project/release/builds/65258

all except for image_embedding_from_jsonl are running on python 3.10

---------

Signed-off-by: elliot-barn <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #661 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 4926.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a substantial overhaul of the project's build and continuous integration infrastructure. The primary goal is to enhance build determinism, efficiency, and maintainability across various components, including Python, C++, and Java. Key changes involve adopting a new dependency management tool (raydepsets with uv), reorganizing Buildkite pipelines for improved modularity, and updating core Bazel configurations and Docker image creation processes. These efforts are crucial for ensuring a robust and scalable development workflow.

Highlights

  • Bazel Configuration Updates: Several changes to .bazelrc were implemented, including enabling --incompatible_strict_action_env by default, adding a workspace status command for Linux, specifying UTF-8 encoding for Windows C++ compiler options, and suppressing warnings for third-party code. Some previous configurations related to strict action environments and specific Linux compiler flags were removed.
  • CI Pipeline Refactoring: The Buildkite CI YAML files underwent significant restructuring. Image definitions were moved to a new _images.rayci.yml file for better organization. New build steps were introduced for core, dashboard, and Java components, and dependencies for various test stages were updated to reflect these changes and new CUDA versions.
  • Dependency Management Modernization: The project adopted uv for Python dependency management and introduced a new raydepsets tool. This replaces older pip-compile methods and manual requirement handling, aiming to streamline the compilation and management of Python dependency sets for improved consistency and efficiency.
  • MacOS Build and Test Migration: MacOS builds and tests were migrated to macos-arm64 instances, with the removal of x86_64 MacOS wheel builds. Associated MacOS build scripts were consolidated to support this new architecture.
  • Linting and Code Ownership Enhancements: The .pre-commit-config.yaml was updated to integrate new linters like semgrep and vale. Additionally, the .github/CODEOWNERS file was revised to reflect broader team ownership and include new file paths, ensuring better code quality and accountability.
  • Python Protobuf Generation Refinement: The process for generating Python protobuf files was refactored. It now utilizes pkg_zip and genrule for more efficient packaging and improved dependency handling within the Bazel build system.
  • C++ API and Runtime Adjustments: Updates were made to C++ API headers and runtime implementations. This includes changes to RemoteFunctionHolder and how metrics are recorded, along with the removal of some ray_common dependencies, contributing to a cleaner and potentially more performant C++ codebase.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main, containing a large number of changes, primarily focused on a significant refactoring and improvement of the build and CI system. The changes are extensive and well-executed, leading to a more modular, maintainable, and robust system.

Key improvements include:

  • Bazel Refactoring: The root BUILD.bazel file has been cleaned up, with targets moved to more appropriate sub-packages. The use of rules_pkg for packaging is a welcome modernization.
  • CI/CD Overhaul: The Buildkite pipelines have been heavily refactored. Builds are now more modular (e.g., core, dashboard, java are built as separate artifacts). Dependency management is enhanced with the introduction of the raydepsets tool.
  • Dependency Management: The project has migrated from miniconda to miniforge, and the uv package manager has been introduced for faster dependency resolution. Several package versions have been updated.
  • Linting and Style: The pre-commit configuration has been significantly improved with the addition of tools like semgrep and vale, and better configuration for existing tools. The CODEOWNERS file and PR template have also been updated.
  • Code Modernization: Several C++ components have been updated to use modern C++ features (e.g., std::invoke_result_t), and code style has been improved. Python code has been updated to use newer APIs where applicable.

Overall, these changes represent a substantial step forward for the project's infrastructure. The refactoring is logical and the improvements are clear. I did not find any issues that require attention.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.