Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-11-10
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

wph95 and others added 30 commits October 21, 2025 17:11
…oject#57948)

There’s a potential risk in the current midpoint calculation. It can be
wrong when values are negative.

Line 167: lower_bound + buckets[0] / 2.0
Line 171: (buckets[i] + buckets[i - 1]) / 2.0

I improved the formula and added a test to make sure it works.

Signed-off-by: justwph <[email protected]>
## Description

See  ray-project#57924

## Related issues
Fixes ray-project#57924

## Additional information
- [x] Verify by manual testing -> confirmed with example using
`write_numpy`

Signed-off-by: kyuds <[email protected]>
## Why are these changes needed?

With ray-project#55207 Ray Train now has
support for training functions with a JAX backend through the new
`JaxTrainer` API. This guide provides a short overview of the API, how
to configure with TPUs, and how to edit a JAX script to use Ray Train.

TODO: I will link a longer e2e guide with KubeRay, MaxText, and the
JaxTrainer on TPUs in GKE

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Currently RDT object metadata will stick around on the driver. This can
lead to some weird behavior like the src actor living past expectations
because the actor handle sticks around in the metadata. I'm now freeing
the metadata on the owner at whenever we decide to tell the primary copy
owner to free (when the ref counter decides it's "OutOfScope").


https://github.com/ray-project/ray/blob/d8b7a54be63638924369ac5c7d5e671a23f151a7/python/ray/experimental/gpu_object_manager/gpu_object_manager.py#L29-L38

A couple TODO's for the future:
- Currently not freeing metadata on borrowers when you use NIXL to put
-> get.
- We currently don't support lineage reconstruction for RDT objects,
once we do this, we may have to change the metadata free to happen on
ref deletion rather than on OutOfScope. Or have some metadata recreation
path?

---------

Signed-off-by: dayshah <[email protected]>
run it continuously to capture potential issues 

Signed-off-by: Kevin H. Luu <[email protected]>
…project#57947)

creating a depset for doc builds
Pinning pydantic==2.50 for doc requirements due to api_policy_check
failing due to dependencies (failure below)
Installing ray without deps

Failure due to dependencies:
https://buildkite.com/ray-project/premerge/builds/52231#019a04a9-e9f8-4151-b5e5-d4bceb48a3cc

Successful api_policy_check run on this PR:
https://buildkite.com/ray-project/microcheck/builds/29312#019a05b8-ef40-4449-9ca2-6dd9d8e790a7

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
## Description
Change `train_colocate_trainer` release test frequency to be manual.

## Related issues
Related to ray-project#49454.

## Additional information

`ScalingConfig.trainer_resources` has been deprecated in Ray Train V2.
As a result, we should disable the test for now. In the future, we can
either:
1. Delete this test entirely.
2. Add back functionality for colocation & reenable this test.

Signed-off-by: Matthew Deng <[email protected]>
…or (ray-project#57869)

# Summary

We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run

This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).

I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <[email protected]>
Since we are now logging into Azure using certificate, these environment
variables are not used anymore.

Signed-off-by: Kevin H. Luu <[email protected]>
…API (ray-project#57977)

## Summary

This PR updates the document embedding benchmark to use the canonical
Ray Data implementation pattern, following best practices for the
framework.

## Key Changes

### Use `download()` expression instead of separate materialization
**Before:**
```python
file_paths = (
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .take_all()
)
file_paths = [row["uploaded_pdf_path"] for row in file_paths]
ds = ray.data.read_binary_files(file_paths, include_paths=True)
```

**After:**
```python
(
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .with_column("bytes", download("uploaded_pdf_path"))
```

This change:
- Eliminates the intermediate materialization with `take_all()`, which
loads all data into memory
- Uses the `download()` expression to lazily fetch file contents as part
of the pipeline
- Removes the need for a separate `read_binary_files()` call

### Method chaining for cleaner code
All operations are now chained in a single pipeline, making the data
flow more clear and idiomatic.

### Consistent column naming
Updated references from `path` to `uploaded_pdf_path` throughout the
code for consistency with the source data schema.

Signed-off-by: Balaji Veeramani <[email protected]>
This PR addresses several failing release tests likely due to the recent
Ray Train V2 default enablement.

The following failing release tests are addressed: 
- huggingface_transformers
- distributed_training.regular
- distributed_training.chaos

distributed_training fix: 
`distributed_training.regular` and `distributed_training.chaos` were
failing due to relying on the deprecated Reporting free-floating metrics
functionality. The tests attempted to access an non existent key in the
`result.metrics` that were not reported. The fix uploads a checkpoint to
ensure this key exists.

huggingface_transformers:
The `huggingface_transformers` test was failing due to outdated
accelerate and peft versions. The fix leverages a post-build file to
ensure the proper accelerate and peft versions.

Tests:

Test Name | Before | After
-- | -- | --
huggingface_transformers |
https://buildkite.com/ray-project/release/builds/64733#019a0559-25da-401f-8d7e-3128b8f7d287
|
https://buildkite.com/ray-project/release/builds/64888#019a090d-5f53-4f7d-b0ac-ac8cf7c529b6
distributed_training.regular |
https://buildkite.com/ray-project/release/builds/64733#019a0572-1095-4b6f-b3bc-b496227c9280
|
https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b5-41b6-aaf6-2bbd443a0a1e
distributed_training.chaos |
https://buildkite.com/ray-project/release/builds/64733#019a0574-3862-4da2-a264-a9e11333bd72
|
https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b6-4344-90f2-cbcd637aae3d

---------

Signed-off-by: JasonLi1909 <[email protected]>
Signed-off-by: Jason Li <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
num_waiter == 0 does not necessarily mean that the request has been
completed.

---------

Signed-off-by: abrar <[email protected]>
…es (ray-project#57883)

This PR adds a test to verify that DataOpTask handles node failures
correctly during execution. To enable this testing, callback seams are
added to DataOpTask that allow tests to simulate preemption scenarios by
killing and restarting nodes at specific points during task execution.

## Summary
- Add callback seams (`block_ready_callback` and
`metadata_ready_callback`) to `DataOpTask` for testing purposes
- Add `has_finished` property to track task completion state
- Create `create_stub_streaming_gen` helper function to simplify test
setup
- Refactor existing `DataOpTask` tests to use the new helper function
- Add new parametrized test `test_on_data_ready_with_preemption` to
verify behavior when nodes fail during execution

## Test plan
- Existing tests pass with refactored code
- New preemption test validates that `on_data_ready` handles node
failures correctly by testing both block and metadata callback scenarios

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description
1. the `mlflow.start_run()` does not have the `tracking_uri` arg:
https://mlflow.org/docs/latest/api_reference/python_api/mlflow.html#mlflow.start_run
2. rewrite the mlflow set up as follow
```
mlflow.set_tracking_uri(uri="file://some_shared_storage_path/mlruns")
mlflow.set_experiment("my_experiment")
mlflow.start_run()
```

## Related issues
N/A

---------

Signed-off-by: Lehui Liu <[email protected]>
…t#57855)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.


## Description
1. Add visitors for collecting column names from all expressions and
renaming names across the tree.
2. Use expressions for rename_columns, with_column, select_columns and
remove cols and cols_rename in Project
3. Modify Projection Pushdown to work with combinations of the above
operators correctly

## Related issues
Closes ray-project#56878,
ray-project#57700

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <[email protected]>
…ject#55291)

Resolves: ray-project#55288 (wrong `np.array` in `TensorType`)

Furthermore changes:
- Changed comments to (semi)docstring which will be displayed as
tooltips by IDEs (e.g. VSCode + Pylance) making that information
available to the user.
- `AgentID: Any -> Hashable` as it used for dict keys
- changed `DeviceType` to be not a TypeVar (makes no sense in the way it
is currently used), also includes DeviceLikeType (`int | str | device`)
from `torch`. IMO it can fully replace the current type but being
defensive I only added it as an extra possible type
- Used updated DeviceType to improve type of Runner._device and make it
more correct
- Used torch's own type in `data`, current code supports more than just
`str`. I refrained from adding a reference to `rllib` despite it being
nice if they would be in sync.
- Some extra formatting that is forced by pre-commit

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Revamps `rllib.utils.typing` (NDArray-based `TensorType`, broader
`DeviceType`, `AgentID` as `Hashable`, docstring cleanups) and updates
call sites to use optional device typing and improved hints.
> 
> - **Types**:
>   - Overhaul `rllib/utils/typing.py`:
> - `TensorType` now uses `numpy.typing.NDArray`; heavy use of
`TYPE_CHECKING` to avoid runtime deps on torch/tf/jax.
> - `DeviceType` widened to `Union[str, torch.device, int]` (was
`TypeVar`).
> - `AgentID` tightened to `Hashable`; `NetworkType` uses `keras.Model`.
> - Refined aliases (e.g., `FromConfigSpec`, `SpaceStruct`) and added
concise docstrings.
> - **Runners**:
> - `Runner._device` now `Optional` (`Union[DeviceType, None]`) with
updated docstring; same change in offline runners’ `_device` properties.
> - **Connectors**:
> - `NumpyToTensor`: `device` param typed as `Optional[DeviceType]` (via
`TYPE_CHECKING`).
> - **Utils**:
> - `from_config`: typed `config: Optional[FromConfigSpec]` with
`TYPE_CHECKING` import.
> - **Misc**:
>   - Minor formatting/import ordering and comment typo fixes.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
ae2e422. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Daniel Sperber <[email protected]>
Signed-off-by: Daraan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Kamil Kaczmarek <[email protected]>
Co-authored-by: Kamil Kaczmarek <[email protected]>
…57993)

Although Spark-on-Ray depends on the Java bindings, we `java` tests are
triggered by all C++ changes and we don't want to run Spark-on-Ray tests
every time we change C++ code.

---------

Signed-off-by: Edward Oakes <[email protected]>
…ect#57974)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all object-manager components. Normally,
metrics are defined at the top-level component and passed down to
sub-components. However, in this case, because object manager is used as
an API across, doing so would feel unnecessarily cumbersome. I decided
to define the metrics inline within each client and server class
instead.

Note that the metric classes (Metric, Gauge, Sum, etc.) are simply
wrappers around static OpenCensus/OpenTelemetry entities.


**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
## Description
Use `tune.report` instead of `train.report`.

Signed-off-by: Matthew Deng <[email protected]>
…t#57620)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This will be used to help control the targets that are returned.
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: akyang-anyscale <[email protected]>
<!-- Thank you for contributing to Ray! πŸš€ -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

This PR adds a new check to make sure proxies are ready to serve traffic
before finishing serve.run. For now, the check immediately finishes.
<!-- Briefly describe what this PR accomplishes and why it's needed -->

## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix πŸ›
- [ ] New feature ✨
- [ ] Enhancement πŸš€
- [ ] Code refactoring πŸ”§
- [ ] Documentation update πŸ“–
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [ ] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [ ] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [ ] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: akyang-anyscale <[email protected]>
…roject#57793)

When deploying ray on Yarn using Skein, it's useful to expose the ray's
dashboard via Skein's web ui. This PR shows how to expose that and
update the related document.

Signed-off-by: Zakelly <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
…cgroup even if they are drivers (ray-project#57955)

For more details about the resource isolation project see
ray-project#54703.

Driver processes that are registered in ray's internal namespace (such
as ray dashboard's job and serve modules) are considered system
processes. Therefore, they will not be moved into the workers cgroup
when they register with the raylet.

---------

Signed-off-by: irabbani <[email protected]>
xinyuangui2 and others added 20 commits November 7, 2025 18:38
…t stats (ray-project#58422)

## Why These Changes Are Needed

This PR adds a new metric to track the time spent retrieving `RefBundle`
objects during dataset iteration. This metric provides better visibility
into the performance breakdown of batch iteration, specifically
capturing the time spent in `get_next_ref_bundle()` calls within the
`prefetch_batches_locally` function.

## Related Issue Number

N/A

## Example

```
  dataloader/train = {'producer_throughput': 8361.841782656593, 'iter_stats': {'prefetch_block-avg': inf, 'prefetch_block-min': inf, 'prefetch_block-max': 0, 'prefetch_block-total': 0, 'get_ref_bundles-avg': 0.05172277254545271, 'get_ref_bundles-min': 1.1991999997462699e-05, 'get_ref_bundles-max': 11.057470971999976, 'get_ref_bundles-total': 15.361663445999454, 'fetch_block-avg': 0.31572694455743233, 'fetch_block-min': 0.0006362799999806157, 'fetch_block-max': 2.1665870369999993, 'fetch_block-total': 93.45517558899996, 'block_to_batch-avg': 0.001048687573988573, 'block_to_batch-min': 2.10620000302697e-05, 'block_to_batch-max': 0.049948245999985375, 'block_to_batch-total': 2.048086831999683, 'format_batch-avg': 0.0001013781433686053, 'format_batch-min': 1.415700000961806e-05, 'format_batch-max': 0.009682661999988795, 'format_batch-total': 0.19799151399888615, 'collate-avg': 0.01303446213312943, 'collate-min': 0.00025646699998560507, 'collate-max': 0.9855495820000328, 'collate-total': 25.456304546001775, 'finalize-avg': 0.012211385266257683, 'finalize-min': 0.004209667999987232, 'finalize-max': 0.3785081949999949, 'finalize-total': 23.848835425001255, 'time_spent_blocked-avg': 0.04783407008137157, 'time_spent_blocked-min': 1.2316999971062614e-05, 'time_spent_blocked-max': 12.46102861700001, 'time_spent_blocked-total': 93.46777293900004, 'time_spent_training-avg': 0.015053571562211652, 'time_spent_training-min': 1.3704999958008557e-05, 'time_spent_training-max': 1.079616685000019, 'time_spent_training-total': 29.399625260999358}}
```

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: xgui <[email protected]>
Signed-off-by: Xinyuan <[email protected]>
## Description:
when token auth is enabled, the dashboard prompts the user to enter the
valid auth token and caches it (as a browser cookie). when token based
auth is disabled, existing behaviour is retained.

all dashboard ui based rpc's to to the ray cluster set the authorization
header in their requests

## Screenshots

token popup
<img width="3440" height="2146" alt="image"
src="https://github.com/user-attachments/assets/004c23a3-991e-4a2c-a2ad-5a0ce2e60893"
/>


on entering an invalid token
<img width="3440" height="2146" alt="image"
src="https://github.com/user-attachments/assets/7183a798-ceb7-4657-8706-39ce5fe8e61e"
/>

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
…ants (ray-project#57910)

1. **Remove direct environment variable access patterns**
- Replace all instances of `os.getenv("RAY_enable_open_telemetry") ==
"1"`
- Standardize to use `ray_constants.RAY_ENABLE_OPEN_TELEMETRY`
consistently throughout the codebase

2. **Unify default value format for RAY_enable_open_telemetry**
   - Standardize the default value to `"true"` | `"false"` 
- Previously, the codebase had mixed usage of `"1"` and `"true"`, which
is now unified

3. **Backward compatibility maintained**
- Carefully verified that the existing `RAY_ENABLE_OPEN_TELEMETRY`
constant properly handles both `"1"` and `"true"` values
   - This change will not introduce any breaking behavior
   - The `env_bool` helper function already supports both formats:
```python
RAY_ENABLE_OPEN_TELEMETRY = env_bool("RAY_enable_open_telemetry", False)
def env_bool(key, default):
    if key in os.environ:
        return (
            True
            if os.environ[key].lower() == "true" or os.environ[key] == "1"
            else False
        )
    return default
```

---
Most of the current code uses: `RAY_enable_open_telemetry: "1"`

A smaller portion (not zero) uses: `RAY_enable_open_telemetry: "true"`

https://github.com/ray-project/ray/blob/fe7ad00f9720a722fde5fecba5bb681234bcdb63/python/ray/tests/test_metrics_agent.py#L497

My personal preference is "true"β€”it’s concise and unambiguous. If it’s
"1", I have to think/guess whether it means "true" or "false".

---------

Signed-off-by: justwph <[email protected]>
…y-project#58217)

Change the unit of `scheduler_placement_time` from seconds to
mili-seconds. The current bucket is in the range of 0.1s to 2.5 hours
which doesn't make sense. According to a sample of data, the range we
are interested in would be from us to s. Thanks @ZacAttack for pointing
this out.

```
Note: This is an internal (non–public-facing) metric, so we only need to update its usage within Ray (e.g., the dashboard). A simple code change should suffice.
```

<img width="1609" height="421"
alt="505491038-c5d81017-b86c-406f-acf4-614560752062"
src="https://github.com/user-attachments/assets/cc647b97-42ec-42eb-bf01-4d1867940207"
/>

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
…s in the Raylet (ray-project#58342)

Found it very hard to parse what was happening here, so helping future
me (or you!).

Also:

- Deleted vestigial `next_resource_seq_no_`.
- Converted from non-monotonic clock to a monotonically incremented
`uint64_t` for the version number for commands.
- Added logs when we drop messages with stale versions.

---------

Signed-off-by: Edward Oakes <[email protected]>
## Description
There was a typo

## Related issues
N/A

## Additional information
N/A

Signed-off-by: Daniel Shin <[email protected]>
be consistent with the CI base env specified in `--build-name`

Signed-off-by: Lonnie Liu <[email protected]>
getting ready to run things on python 3.10

Signed-off-by: Lonnie Liu <[email protected]>
…tion on a single node (ray-project#58456)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description

Currently, finalization is scheduled in batches sequentially -- ie batch
of N adjacent partitions is finalized at once (in a sliding window).

This creates a lensing effect since:

1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators
j and j+i (since membership is determined as j = i % num_aggregators)
2. Adjacent aggregators have high likelihood of getting scheduled on the
same node (due to similarly being scheduled at about the same time in
sequence)

To address that this change applies random sampling when choosing next
partitions to finalize to make sure partitions are chosen uniformly
reducing concurrent finalization of the adjacent partitions.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
## Description
> Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were
multiple places where we were always returning Status::OK() in the
gcs_subscriber making idempotency harder to understand and there was
dead code for one of the resubscribes, so did a minor clean up. Added a
python integration test to verify retry behavior, left out the cpp test
since on the raylet side there's nothing to test since its just making a
gcs_client rpc call

---------

Signed-off-by: joshlee <[email protected]>
…ct#58445)

## Summary
Creates a dedicated `tests/unit/` directory for unit tests that don't
require Ray runtime or external dependencies.

## Changes
- Created `tests/unit/` directory structure
- Moved 13 pure unit tests to `tests/unit/`
- Added `conftest.py` with fixtures to prevent `ray.init()` and
`time.sleep()`
- Added `README.md` documenting unit test requirements
- Updated `BUILD.bazel` to run unit tests with "small" size tag

## Test Files Moved
1. test_arrow_type_conversion.py
2. test_block.py
3. test_block_boundaries.py
4. test_data_batch_conversion.py
5. test_datatype.py
6. test_deduping_schema.py
7. test_expression_evaluator.py
8. test_expressions.py
9. test_filename_provider.py
10. test_logical_plan.py
11. test_object_extension.py
12. test_path_util.py
13. test_ruleset.py

These tests are fast (<1s each), isolated (no Ray runtime), and
deterministic (no time.sleep or randomness).

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.


### [Data] Concurrency Cap Backpressure tuning
- Maintain asymmetric EWMA of total queued bytes (this op + downstream)
as the typical level: level.
- Maintain asymmetric EWMA of absolute residual vs the previous level as
a scale proxy: dev = EWMA(|q - level_prev|).
- Define deadband: [lower, upper] = [level - K_DEVdev, level +
K_DEVdev].
If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
Else -> target cap = running (hold)
- Clamp to [1, configured_cap], admit iff running < target cap.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… in read-only mode (ray-project#58460)

This ensures node type names are correctly reported even when the
autoscaler is disabled (read-only mode).

## Description

Autoscaler v2 fails to report prometheus metrics when operating in
read-only mode on KubeRay with the following KeyError error:

```
2025-11-08 12:06:57,402	ERROR autoscaler.py:215 -- 'small-group'
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
    Reconciler._scale_cluster(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
    reply = scheduler.schedule(sched_request)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
    ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
    node_config = ctx.get_node_type_configs()[node_type]
KeyError: 'small-group'
```

This happens because the `ReadOnlyProviderConfigReader` populates
`ctx.get_node_type_configs()` using node IDs as node types, which is
correct for local Ray (where local ray does not have
`RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
`ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME`
set.

As a result, in read-only mode the scheduler sees a node type name (ex.
small-group) that never exists in the populated configs.

This PR fixes the issue by using `ray_node_type_name` when it exists,
and only falling back to node ID when it does not.
## Related issues
Fixes ray-project#58227

Signed-off-by: Rueian <[email protected]>
…cess: bool (ray-project#58384)

## Description
Pass in `status_code` directly into `do_reply`. This is a follow up to
ray-project#58255
## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #673 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5283.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is an automated daily merge from the master branch into the main branch. It incorporates a wide array of changes across the project's infrastructure, including substantial updates to the Bazel build system, modernization of CI/CD pipelines, and enhancements to code quality tooling. The changes aim to streamline development workflows, improve build reliability, and maintain high code standards.

Highlights

  • Bazel Build System Refactoring: The Bazel build system has undergone significant refactoring, including enabling --incompatible_strict_action_env by default, adding platform-specific Redis binaries, and introducing a more modular approach to RPC definitions and Python package generation. This aims to improve build efficiency and consistency.
  • CI/CD Pipeline Modernization: Extensive updates have been made to the Buildkite CI/CD pipelines. This includes new image build steps, Python version updates (e.g., adding Python 3.10 to various jobs), and a transition from miniconda to miniforge in Docker environments, enhancing CI reliability and build processes.
  • Code Quality and Linting Enhancements: New pre-commit hooks for semgrep, vale, cython-lint, and eslint have been integrated. Existing Python and C++ linting rules have been updated to enforce stricter code quality standards and improve overall code maintainability.
  • C++ API and Runtime Adjustments: The C++ API has been refined with consistent naming conventions for remote function holders and metric tags. Updates to network utility functions for IP address resolution and address building have also been implemented, contributing to a more robust C++ core.
  • Documentation and Style Guide Updates: The pull request template has been modernized, and .readthedocs.yaml configurations have been updated. New .vale style rules have been added to align with updated project standards, ensuring better documentation consistency and quality.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge that incorporates a massive set of refactoring and improvement changes from the master branch. Key changes include a significant overhaul of the Bazel build system for better modularity and hermeticity, a major refactoring of the Buildkite CI pipelines to improve organization and test coverage, and the introduction of a new dependency management tool raydepsets. The C++ code has been modernized, and numerous configurations for linting, code ownership, and Docker builds have been updated and improved. Overall, these changes represent a substantial step forward in the project's build system, CI infrastructure, and code quality practices. The changes appear to be well-executed and beneficial.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.