🔄 daily merge: master → main 2025-11-10 #673

antfin-oss · 2025-11-10T03:04:38Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-10
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…oject#57948) There’s a potential risk in the current midpoint calculation. It can be wrong when values are negative. Line 167: lower_bound + buckets[0] / 2.0 Line 171: (buckets[i] + buckets[i - 1]) / 2.0 I improved the formula and added a test to make sure it works. Signed-off-by: justwph <[email protected]>

## Description See ray-project#57924 ## Related issues Fixes ray-project#57924 ## Additional information - [x] Verify by manual testing -> confirmed with example using `write_numpy` Signed-off-by: kyuds <[email protected]>

## Why are these changes needed? With ray-project#55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: matthewdeng <[email protected]>

Signed-off-by: Sagar Sumit <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Cuong Nguyen <[email protected]>

Currently RDT object metadata will stick around on the driver. This can lead to some weird behavior like the src actor living past expectations because the actor handle sticks around in the metadata. I'm now freeing the metadata on the owner at whenever we decide to tell the primary copy owner to free (when the ref counter decides it's "OutOfScope"). https://github.com/ray-project/ray/blob/d8b7a54be63638924369ac5c7d5e671a23f151a7/python/ray/experimental/gpu_object_manager/gpu_object_manager.py#L29-L38 A couple TODO's for the future: - Currently not freeing metadata on borrowers when you use NIXL to put -> get. - We currently don't support lineage reconstruction for RDT objects, once we do this, we may have to change the metadata free to happen on ref deletion rather than on OutOfScope. Or have some metadata recreation path? --------- Signed-off-by: dayshah <[email protected]>

run it continuously to capture potential issues Signed-off-by: Kevin H. Luu <[email protected]>

…project#57947) creating a depset for doc builds Pinning pydantic==2.50 for doc requirements due to api_policy_check failing due to dependencies (failure below) Installing ray without deps Failure due to dependencies: https://buildkite.com/ray-project/premerge/builds/52231#019a04a9-e9f8-4151-b5e5-d4bceb48a3cc Successful api_policy_check run on this PR: https://buildkite.com/ray-project/microcheck/builds/29312#019a05b8-ef40-4449-9ca2-6dd9d8e790a7 --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

## Description Change `train_colocate_trainer` release test frequency to be manual. ## Related issues Related to ray-project#49454. ## Additional information `ScalingConfig.trainer_resources` has been deprecated in Ray Train V2. As a result, we should disable the test for now. In the future, we can either: 1. Delete this test entirely. 2. Add back functionality for colocation & reenable this test. Signed-off-by: Matthew Deng <[email protected]>

…or (ray-project#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <[email protected]>

## Description Fix release tests that are still importing from `ray.train` to import from `ray.tune`, as described in ray-project#49454. ## Additional information Test Runs: | Test Name | Before | After| | --- | --- | ---| | `cluster_tune_scale_up_down.aws` | https://buildkite.com/ray-project/release/builds/64733#019a0546-dbe2-47b1-ba85-983d48098352 | https://buildkite.com/ray-project/release/builds/64809/steps/canvas?sid=019a0804-dc62-4177-a24a-b23806cd0d51 | | `cluster_tune_scale_up_down.kuberay` | https://buildkite.com/ray-project/release/builds/64733#019a0549-25dd-4dd3-ba84-df9d3ce5a72a | https://buildkite.com/ray-project/release/builds/64809/steps/canvas?sid=019a0804-dc63-49b3-8efc-ff3bd5fb8d28 | | `tune_worker_fault_tolerance` | https://buildkite.com/ray-project/release/builds/64734#019a058c-2f15-401c-88d9-894002152fff | https://buildkite.com/ray-project/release/builds/64845/steps/canvas?sid=019a08aa-6c85-43d4-83e0-bf56d49ab474 | --------- Signed-off-by: Matthew Deng <[email protected]>

Since we are now logging into Azure using certificate, these environment variables are not used anymore. Signed-off-by: Kevin H. Luu <[email protected]>

…API (ray-project#57977) ## Summary This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework. ## Key Changes ### Use `download()` expression instead of separate materialization **Before:** ```python file_paths = ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .take_all() ) file_paths = [row["uploaded_pdf_path"] for row in file_paths] ds = ray.data.read_binary_files(file_paths, include_paths=True) ``` **After:** ```python ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .with_column("bytes", download("uploaded_pdf_path")) ``` This change: - Eliminates the intermediate materialization with `take_all()`, which loads all data into memory - Uses the `download()` expression to lazily fetch file contents as part of the pipeline - Removes the need for a separate `read_binary_files()` call ### Method chaining for cleaner code All operations are now chained in a single pipeline, making the data flow more clear and idiomatic. ### Consistent column naming Updated references from `path` to `uploaded_pdf_path` throughout the code for consistency with the source data schema. Signed-off-by: Balaji Veeramani <[email protected]>

This PR addresses several failing release tests likely due to the recent Ray Train V2 default enablement. The following failing release tests are addressed: - huggingface_transformers - distributed_training.regular - distributed_training.chaos distributed_training fix: `distributed_training.regular` and `distributed_training.chaos` were failing due to relying on the deprecated Reporting free-floating metrics functionality. The tests attempted to access an non existent key in the `result.metrics` that were not reported. The fix uploads a checkpoint to ensure this key exists. huggingface_transformers: The `huggingface_transformers` test was failing due to outdated accelerate and peft versions. The fix leverages a post-build file to ensure the proper accelerate and peft versions. Tests: Test Name | Before | After -- | -- | -- huggingface_transformers | https://buildkite.com/ray-project/release/builds/64733#019a0559-25da-401f-8d7e-3128b8f7d287 | https://buildkite.com/ray-project/release/builds/64888#019a090d-5f53-4f7d-b0ac-ac8cf7c529b6 distributed_training.regular | https://buildkite.com/ray-project/release/builds/64733#019a0572-1095-4b6f-b3bc-b496227c9280 | https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b5-41b6-aaf6-2bbd443a0a1e distributed_training.chaos | https://buildkite.com/ray-project/release/builds/64733#019a0574-3862-4da2-a264-a9e11333bd72 | https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b6-4344-90f2-cbcd637aae3d --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

num_waiter == 0 does not necessarily mean that the request has been completed. --------- Signed-off-by: abrar <[email protected]>

…es (ray-project#57883) This PR adds a test to verify that DataOpTask handles node failures correctly during execution. To enable this testing, callback seams are added to DataOpTask that allow tests to simulate preemption scenarios by killing and restarting nodes at specific points during task execution. ## Summary - Add callback seams (`block_ready_callback` and `metadata_ready_callback`) to `DataOpTask` for testing purposes - Add `has_finished` property to track task completion state - Create `create_stub_streaming_gen` helper function to simplify test setup - Refactor existing `DataOpTask` tests to use the new helper function - Add new parametrized test `test_on_data_ready_with_preemption` to verify behavior when nodes fail during execution ## Test plan - Existing tests pass with refactored code - New preemption test validates that `on_data_ready` handles node failures correctly by testing both block and metadata callback scenarios --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## Description 1. the `mlflow.start_run()` does not have the `tracking_uri` arg: https://mlflow.org/docs/latest/api_reference/python_api/mlflow.html#mlflow.start_run 2. rewrite the mlflow set up as follow ``` mlflow.set_tracking_uri(uri="file://some_shared_storage_path/mlruns") mlflow.set_experiment("my_experiment") mlflow.start_run() ``` ## Related issues N/A --------- Signed-off-by: Lehui Liu <[email protected]>

…ay-project#57980)

…t#57855) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. ## Description 1. Add visitors for collecting column names from all expressions and renaming names across the tree. 2. Use expressions for rename_columns, with_column, select_columns and remove cols and cols_rename in Project 3. Modify Projection Pushdown to work with combinations of the above operators correctly ## Related issues Closes ray-project#56878, ray-project#57700 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <[email protected]>

…ject#55291) Resolves: ray-project#55288 (wrong `np.array` in `TensorType`) Furthermore changes: - Changed comments to (semi)docstring which will be displayed as tooltips by IDEs (e.g. VSCode + Pylance) making that information available to the user. - `AgentID: Any -> Hashable` as it used for dict keys - changed `DeviceType` to be not a TypeVar (makes no sense in the way it is currently used), also includes DeviceLikeType (`int | str | device`) from `torch`. IMO it can fully replace the current type but being defensive I only added it as an extra possible type - Used updated DeviceType to improve type of Runner._device and make it more correct - Used torch's own type in `data`, current code supports more than just `str`. I refrained from adding a reference to `rllib` despite it being nice if they would be in sync. - Some extra formatting that is forced by pre-commit  --- > [!NOTE] > Revamps `rllib.utils.typing` (NDArray-based `TensorType`, broader `DeviceType`, `AgentID` as `Hashable`, docstring cleanups) and updates call sites to use optional device typing and improved hints. > > - **Types**: > - Overhaul `rllib/utils/typing.py`: > - `TensorType` now uses `numpy.typing.NDArray`; heavy use of `TYPE_CHECKING` to avoid runtime deps on torch/tf/jax. > - `DeviceType` widened to `Union[str, torch.device, int]` (was `TypeVar`). > - `AgentID` tightened to `Hashable`; `NetworkType` uses `keras.Model`. > - Refined aliases (e.g., `FromConfigSpec`, `SpaceStruct`) and added concise docstrings. > - **Runners**: > - `Runner._device` now `Optional` (`Union[DeviceType, None]`) with updated docstring; same change in offline runners’ `_device` properties. > - **Connectors**: > - `NumpyToTensor`: `device` param typed as `Optional[DeviceType]` (via `TYPE_CHECKING`). > - **Utils**: > - `from_config`: typed `config: Optional[FromConfigSpec]` with `TYPE_CHECKING` import. > - **Misc**: > - Minor formatting/import ordering and comment typo fixes. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit ae2e422. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Daniel Sperber <[email protected]> Signed-off-by: Daraan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>

…57993) Although Spark-on-Ray depends on the Java bindings, we `java` tests are triggered by all C++ changes and we don't want to run Spark-on-Ray tests every time we change C++ code. --------- Signed-off-by: Edward Oakes <[email protected]>

…ay-project#57771) Signed-off-by: Nikhil Ghosh <[email protected]>

…ay-project#57987) Signed-off-by: daiping8 <[email protected]>

…ect#57974) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all object-manager components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because object manager is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead. Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

## Description Use `tune.report` instead of `train.report`. Signed-off-by: Matthew Deng <[email protected]>

…t#57620)   ## Why are these changes needed? This will be used to help control the targets that are returned.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: akyang-anyscale <[email protected]>

## Description This PR adds a new check to make sure proxies are ready to serve traffic before finishing serve.run. For now, the check immediately finishes.  ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [ ] No  **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: akyang-anyscale <[email protected]>

…roject#57793) When deploying ray on Yarn using Skein, it's useful to expose the ray's dashboard via Skein's web ui. This PR shows how to expose that and update the related document. Signed-off-by: Zakelly <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…cgroup even if they are drivers (ray-project#57955) For more details about the resource isolation project see ray-project#54703. Driver processes that are registered in ray's internal namespace (such as ray dashboard's job and serve modules) are considered system processes. Therefore, they will not be moved into the workers cgroup when they register with the raylet. --------- Signed-off-by: irabbani <[email protected]>

…t stats (ray-project#58422) ## Why These Changes Are Needed This PR adds a new metric to track the time spent retrieving `RefBundle` objects during dataset iteration. This metric provides better visibility into the performance breakdown of batch iteration, specifically capturing the time spent in `get_next_ref_bundle()` calls within the `prefetch_batches_locally` function. ## Related Issue Number N/A ## Example ``` dataloader/train = {'producer_throughput': 8361.841782656593, 'iter_stats': {'prefetch_block-avg': inf, 'prefetch_block-min': inf, 'prefetch_block-max': 0, 'prefetch_block-total': 0, 'get_ref_bundles-avg': 0.05172277254545271, 'get_ref_bundles-min': 1.1991999997462699e-05, 'get_ref_bundles-max': 11.057470971999976, 'get_ref_bundles-total': 15.361663445999454, 'fetch_block-avg': 0.31572694455743233, 'fetch_block-min': 0.0006362799999806157, 'fetch_block-max': 2.1665870369999993, 'fetch_block-total': 93.45517558899996, 'block_to_batch-avg': 0.001048687573988573, 'block_to_batch-min': 2.10620000302697e-05, 'block_to_batch-max': 0.049948245999985375, 'block_to_batch-total': 2.048086831999683, 'format_batch-avg': 0.0001013781433686053, 'format_batch-min': 1.415700000961806e-05, 'format_batch-max': 0.009682661999988795, 'format_batch-total': 0.19799151399888615, 'collate-avg': 0.01303446213312943, 'collate-min': 0.00025646699998560507, 'collate-max': 0.9855495820000328, 'collate-total': 25.456304546001775, 'finalize-avg': 0.012211385266257683, 'finalize-min': 0.004209667999987232, 'finalize-max': 0.3785081949999949, 'finalize-total': 23.848835425001255, 'time_spent_blocked-avg': 0.04783407008137157, 'time_spent_blocked-min': 1.2316999971062614e-05, 'time_spent_blocked-max': 12.46102861700001, 'time_spent_blocked-total': 93.46777293900004, 'time_spent_training-avg': 0.015053571562211652, 'time_spent_training-min': 1.3704999958008557e-05, 'time_spent_training-max': 1.079616685000019, 'time_spent_training-total': 29.399625260999358}} ``` ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]>

## Description: when token auth is enabled, the dashboard prompts the user to enter the valid auth token and caches it (as a browser cookie). when token based auth is disabled, existing behaviour is retained. all dashboard ui based rpc's to to the ray cluster set the authorization header in their requests ## Screenshots token popup <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/004c23a3-991e-4a2c-a2ad-5a0ce2e60893" /> on entering an invalid token <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/7183a798-ceb7-4657-8706-39ce5fe8e61e" /> --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

…ants (ray-project#57910) 1. **Remove direct environment variable access patterns** - Replace all instances of `os.getenv("RAY_enable_open_telemetry") == "1"` - Standardize to use `ray_constants.RAY_ENABLE_OPEN_TELEMETRY` consistently throughout the codebase 2. **Unify default value format for RAY_enable_open_telemetry** - Standardize the default value to `"true"` | `"false"` - Previously, the codebase had mixed usage of `"1"` and `"true"`, which is now unified 3. **Backward compatibility maintained** - Carefully verified that the existing `RAY_ENABLE_OPEN_TELEMETRY` constant properly handles both `"1"` and `"true"` values - This change will not introduce any breaking behavior - The `env_bool` helper function already supports both formats: ```python RAY_ENABLE_OPEN_TELEMETRY = env_bool("RAY_enable_open_telemetry", False) def env_bool(key, default): if key in os.environ: return ( True if os.environ[key].lower() == "true" or os.environ[key] == "1" else False ) return default ``` --- Most of the current code uses: `RAY_enable_open_telemetry: "1"` A smaller portion (not zero) uses: `RAY_enable_open_telemetry: "true"` https://github.com/ray-project/ray/blob/fe7ad00f9720a722fde5fecba5bb681234bcdb63/python/ray/tests/test_metrics_agent.py#L497 My personal preference is "true"—it’s concise and unambiguous. If it’s "1", I have to think/guess whether it means "true" or "false". --------- Signed-off-by: justwph <[email protected]>

…age (ray-project#53841) (ray-project#54618)

…b/utils`` (ray-project#56734)

…ay-project#58324)

@ZacAttack

…y-project#58217) Change the unit of `scheduler_placement_time` from seconds to mili-seconds. The current bucket is in the range of 0.1s to 2.5 hours which doesn't make sense. According to a sample of data, the range we are interested in would be from us to s. Thanks @ZacAttack for pointing this out. ``` Note: This is an internal (non–public-facing) metric, so we only need to update its usage within Ray (e.g., the dashboard). A simple code change should suffice. ``` <img width="1609" height="421" alt="505491038-c5d81017-b86c-406f-acf4-614560752062" src="https://github.com/user-attachments/assets/cc647b97-42ec-42eb-bf01-4d1867940207" /> Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>

## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>

be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>

getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

…tion on a single node (ray-project#58456) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>

…ct#58445) ## Summary Creates a dedicated `tests/unit/` directory for unit tests that don't require Ray runtime or external dependencies. ## Changes - Created `tests/unit/` directory structure - Moved 13 pure unit tests to `tests/unit/` - Added `conftest.py` with fixtures to prevent `ray.init()` and `time.sleep()` - Added `README.md` documenting unit test requirements - Updated `BUILD.bazel` to run unit tests with "small" size tag ## Test Files Moved 1. test_arrow_type_conversion.py 2. test_block.py 3. test_block_boundaries.py 4. test_data_batch_conversion.py 5. test_datatype.py 6. test_deduping_schema.py 7. test_expression_evaluator.py 8. test_expressions.py 9. test_filename_provider.py 10. test_logical_plan.py 11. test_object_extension.py 12. test_path_util.py 13. test_ruleset.py These tests are fast (<1s each), isolated (no Ray runtime), and deterministic (no time.sleep or randomness). --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ay-project#58301)

Signed-off-by: Nikhil Ghosh <[email protected]>

… in read-only mode (ray-project#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). ## Description Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME` set. As a result, in read-only mode the scheduler sees a node type name (ex. small-group) that never exists in the populated configs. This PR fixes the issue by using `ray_node_type_name` when it exists, and only falling back to node ID when it does not. ## Related issues Fixes ray-project#58227 Signed-off-by: Rueian <[email protected]>

…cess: bool (ray-project#58384) ## Description Pass in `status_code` directly into `do_reply`. This is a follow up to ray-project#58255 ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <[email protected]>

…ay-project#58464)

sourcery-ai

The pull request #673 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5283.

gemini-code-assist · 2025-11-10T03:17:24Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is an automated daily merge from the master branch into the main branch. It incorporates a wide array of changes across the project's infrastructure, including substantial updates to the Bazel build system, modernization of CI/CD pipelines, and enhancements to code quality tooling. The changes aim to streamline development workflows, improve build reliability, and maintain high code standards.

Highlights

Bazel Build System Refactoring: The Bazel build system has undergone significant refactoring, including enabling --incompatible_strict_action_env by default, adding platform-specific Redis binaries, and introducing a more modular approach to RPC definitions and Python package generation. This aims to improve build efficiency and consistency.
CI/CD Pipeline Modernization: Extensive updates have been made to the Buildkite CI/CD pipelines. This includes new image build steps, Python version updates (e.g., adding Python 3.10 to various jobs), and a transition from miniconda to miniforge in Docker environments, enhancing CI reliability and build processes.
Code Quality and Linting Enhancements: New pre-commit hooks for semgrep, vale, cython-lint, and eslint have been integrated. Existing Python and C++ linting rules have been updated to enforce stricter code quality standards and improve overall code maintainability.
C++ API and Runtime Adjustments: The C++ API has been refined with consistent naming conventions for remote function holders and metric tags. Updates to network utility functions for IP address resolution and address building have also been implemented, contributing to a more robust C++ core.
Documentation and Style Guide Updates: The pull request template has been modernized, and .readthedocs.yaml configurations have been updated. New .vale style rules have been added to align with updated project standards, ensuring better documentation consistency and quality.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is an automated daily merge that incorporates a massive set of refactoring and improvement changes from the master branch. Key changes include a significant overhaul of the Bazel build system for better modularity and hermeticity, a major refactoring of the Buildkite CI pipelines to improve organization and test coverage, and the introduction of a new dependency management tool raydepsets. The C++ code has been modernized, and numerous configurations for linting, code ownership, and Docker builds have been updated and improved. Overall, these changes represent a substantial step forward in the project's build system, CI infrastructure, and code quality practices. The changes appear to be well-executed and beneficial.

github-actions · 2025-11-25T01:41:13Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

wph95 and others added 30 commits October 21, 2025 17:11

[core] Make FreeObjects non-fatal (ray-project#57550)

53908c8

Signed-off-by: Sagar Sumit <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

[doc][serve][llm] Add benchmark page (ray-project#57960)

c584b1e

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[core][stats-die/01] kill STATS in rpc component (ray-project#57926)

332b671

Signed-off-by: Cuong Nguyen <[email protected]>

[release] make Azure test run nightly (ray-project#57949)

37386b7

run it continuously to capture potential issues Signed-off-by: Kevin H. Luu <[email protected]>

[release] Don't check for Azure env vars (ray-project#57975)

3304a5e

Since we are now logging into Azure using certificate, these environment variables are not used anymore. Signed-off-by: Kevin H. Luu <[email protected]>

deflake app level autoscaling test (ray-project#57967)

4660221

num_waiter == 0 does not necessarily mean that the request has been completed. --------- Signed-off-by: abrar <[email protected]>

[data][llm] Fix vLLMEngineStage field name inconsistency for images (r…

10ff03a

…ay-project#57980)

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache (r…

d9b0a85

…ay-project#57771) Signed-off-by: Nikhil Ghosh <[email protected]>

[Core][BUG] Fix transport type handling in DAG node initialization. (r…

c0faad4

…ay-project#57987) Signed-off-by: daiping8 <[email protected]>

[tune] update jobs test to use tune module (ray-project#57995)

d027e90

## Description Use `tune.report` instead of `train.report`. Signed-off-by: Matthew Deng <[email protected]>

xinyuangui2 and others added 20 commits November 7, 2025 18:38

fix(rllib): Correct typo and consistency in pyspiel import error mess…

1d57f9a

…age (ray-project#53841) (ray-project#54618)

[RLlib] LINT: Enable ruff imports for rllib/algorithms and ``rlli…

9268934

…b/utils`` (ray-project#56734)

[RLlib] Broken restore from remote - Add missing FileSystem argument (r…

7271b0c

…ay-project#58324)

[Data] Fix Progress Bar Name (ray-project#58451)

3553d8e

## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>

[data] ci: fix data doc test build env (ray-project#58458)

412220e

be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>

[bazel] add python 3.10 runtime pair (ray-project#58455)

a70a1b1

getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

[serve][llm] Data Parallel Attention: Public API and Documentation (r…

2691094

…ay-project#58301)

[data][llm] fix vllm ray data quickstart example (ray-project#58463)

654feda

Signed-off-by: Nikhil Ghosh <[email protected]>

[Data] Add exception handling for invalid URIs in download operation (r…

71c7bd0

…ay-project#58464)

antfin-oss requested review from SongGuyang and kfstorm as code owners November 10, 2025 03:04

antfin-oss added auto-generated daily-merge labels Nov 10, 2025

antfin-oss assigned ffbin Nov 10, 2025

sourcery-ai bot reviewed Nov 10, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 10, 2025

View reviewed changes

github-actions bot added the stale label Nov 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-10 #673

🔄 daily merge: master → main 2025-11-10 #673

Uh oh!

antfin-oss commented Nov 10, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

79 participants

🔄 daily merge: master → main 2025-11-10 #673

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-10 #673

Uh oh!

Conversation

antfin-oss commented Nov 10, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 10, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

79 participants