🔄 daily merge: master → main 2025-10-24 #660

antfin-oss · 2025-10-24T03:39:31Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-10-24
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

## Why are these changes needed? See ray-project#57226 . I got my env working. Was in py 3.13 by accident  ## Related issue number Solves ray-project#57226  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Henry Lindeman <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

…ray-project#57541) `test_api.py::test_max_constructor_retry_count` was failing for windows. Tried to expand the timeout on wait_on_condition at the last part of the test to 20s - 40s and added a debug statement to check how far the counter increments to. It goes up in a varying value but I was able to observe 9-12, not reaching 13. Did some drilling and seems like for our ray actor worker process is forked to be created for Linux and Windows uses `CreateProcessA`, which builds process from scratch each time ran unlike forking. And this difference is causing the number of counts for windows to be growing more slowly IIUC. The call for windows with `CreateProcessA` is available [here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171), and forking for Linux is availabe here. Hence, the solution is to alleviate the test's resource requirement by launching 4->3 replicas and attempting on less number of retries to satisfy both linux and windows. --------- Signed-off-by: doyoung <[email protected]>

…#57535) part 1 of ray-project#56149 1. move `_serialized_policy_def` into `AutoscalingPolicy` from `AutoscalingConfig`. We need this in order to reuse `AutoscalingPolicy` for application-level autoscaling. 2. Make `autoscaling_policy` a top-level config in `ServeApplicationSchema`. --------- Signed-off-by: abrar <[email protected]>

…ay-project#57561)

1. add docs under advance autoscaling 2. promote autoscaling_context to public api --------- Signed-off-by: abrar <[email protected]>

Signed-off-by: Seiji Eicher <[email protected]>

## Why are these changes needed? Add `_unresolved_paths` for file based datasources for lineage tracking capabilities. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam <[email protected]>

## Why are these changes needed? Adds block completion time, task completion time, block size histograms Note to calculate block completion time histogram, we must approximate because we only measure the task completion time. To aproximate, we assume each block took an equal amount of time within a task and split the time amongst them. <img width="1342" height="316" alt="Screenshot 2025-09-28 at 1 14 33 PM" src="https://github.com/user-attachments/assets/baf1e9c3-26a2-48ce-92e4-3299d698ddaf" /> <img width="1359" height="321" alt="Screenshot 2025-09-28 at 1 14 52 PM" src="https://github.com/user-attachments/assets/84c3c7a4-2631-4626-9677-d947d1afb112" />  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Introduce histogram metrics for task/block completion time and block sizes, wire them through export, and add Grafana bar chart panels to visualize these distributions. > > - **Ray Data metrics**: > - Add histogram metrics in `OpRuntimeMetrics`: > - `task_completion_time`, `block_completion_time`, `block_size_bytes`, `block_size_rows` with predefined bucket boundaries and thread-safe updates. > - New helpers `find_bucket_index` and bucket constants; support reset via `as_dict(..., reset_histogram_metrics=True)`. > - **Stats pipeline**: > - `_StatsManager` now snapshots and resets histogram metrics before sending to `_StatsActor` (both periodic and force updates). > - `_StatsActor.update_execution_metrics` handles histogram lists by observing per-bucket values. > - **Dashboard/Grafana**: > - Add `HISTOGRAM_BAR_CHART` target and `BAR_CHART` panel templates in `dashboards/common.py`. > - Replace `Task Completion Time` with histogram bar chart; add new panels: `Block Completion Time Histogram (s)`, `Block Size (Bytes) Histogram`, `Block Size (Rows) Histogram` in `data_dashboard_panels.py` and include them in `Outputs` and `Tasks` rows. > - Normalize time units from `seconds` to `s` for several panels. > - **Factory**: > - Panel generation respects template-specific settings (no behavioral change beyond using new templates). > - **Tests**: > - Add tests for histogram initialization, bucket indexing, and counting for task/block durations and block sizes in `test_op_runtime_metrics.py`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 7143039. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Alan Guo <[email protected]>

Adding workspace funcs to parse all configs not currently used in raydepsets --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…ect#57629) ## Why are these changes needed? we seem to only make the mistake in our docs see here https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/router.py#L545 this part ```python metrics.Gauge( "serve_deployment_queued_queries", description=( "The current number of queries to this deployment waiting" " to be assigned to a replica." ), tag_keys=("deployment", "application", "handle", "actor_id"), ) ```  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Marwan Sarieddine <[email protected]>

…oject#57630) Signed-off-by: Jiajun Yao <[email protected]>

…roject#57567)   ## Why are these changes needed?  `test_hanging_detector_detects_issues` checks that Ray Data emits a log if one task takes a lot longer than the others. The issue is that the test doesn't capture the log output correctly, and so the test fails even though Ray data correctly emits the log. To make this test more robust, this PR uses pytest's `caplog` fixture to capture the logs rather than a bespoke custom handler. ``` [2025-10-08T09:00:41Z] > assert hanging_detected, log_output | [2025-10-08T09:00:41Z] E AssertionError: | [2025-10-08T09:00:41Z] E assert False | [2025-10-08T09:00:41Z] | [2025-10-08T09:00:41Z] python/ray/data/tests/test_issue_detection_manager.py:153: AssertionError | ``` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <[email protected]>

…rm build. (ray-project#57244) For more details about the resource isolation project see ray-project#54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

- adding a test for having multiple task consumers in a single ray serve application --------- Signed-off-by: harshit <[email protected]>

…ieval (ray-project#56451) ## Why are these changes needed? This PR fixes a critical driver hang issue in Ray Data's streaming generator. The problem occurs when computation completes and block data is generated, but the worker crashes before the metadata object is generated, causing the driver to hang completely until the task's metadata is successfully rebuilt. This creates severe performance issues, especially in cluster environments with significant resource fluctuations. ## What was the problem? **Specific scenario:** 1. Computation completes, block data is generated 2. Worker crashes before the metadata object is generated 3. Driver enters the [physical_operator.on_data_ready()](https://github.com/ray-project/ray/blob/ray-2.46.0/python/ray/data/_internal/execution/interfaces/physical_operator.py#L124) logic and waits indefinitely for metadata until task retry succeeds and meta object becomes available 4. If cluster resources are insufficient, the task cannot be retried successfully, causing driver to hang for hours (actual case: 12 hours) **Technical causes:** - Using `ray.get(next(self._streaming_gen))` for metadata content retrieval, which may hang indefinitely - Lack of timeout mechanisms and state tracking, preventing driver recovery from hang state - No proper handling when worker crashes between block generation and metadata generation ## What does this fix do? - Adds `_pending_block_ref` and `_pending_meta_ref` state tracking to properly handle block/metadata pairs - Uses `ray.get(meta_ref, timeout=1)` with timeout for metadata content retrieval - Adds error handling for `GetTimeoutError` with warning logs - Prevents unnecessary re-fetching of already obtained block references - **Key improvement: Prevents driver from hanging for extended periods when worker crashes between block and metadata generation** ## Related issue number Fixes critical performance issue in streaming data processing that causes driver to hang for extended periods (up to 12 hours) when workers crash between block generation and metadata generation, especially in cluster environments with significant resource fluctuations. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - **Testing Strategy** - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: dragongu <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

…roject#57118) This PR adds a template that demonstrates how to use Ray Train with DeepSpeed. At a high level, this template covers: - A hands-on example of fine-tuning a language model. - Saving and loading model checkpoints with Ray Train and DeepSpeed. - Key DeepSpeed configurations (ZeRO stages, offloading, mixed precision). Tested environment: 1 CPU head node + 2 NVIDIA T4 GPU worker nodes.  --- > [!NOTE] > Adds a DeepSpeed + Ray Train LLM fine-tuning example with a training script, tutorial notebook, and cloud configs, and links it in the examples index. > > - **New example: `examples/pytorch/deepspeed_finetune`** > - `train.py`: End-to-end LLM fine-tuning with Ray Train + DeepSpeed (ZeRO config, BF16, checkpoint save/load, resume support, debug steps). > - `README.ipynb`: Step-by-step tutorial covering setup, dataloader, model init, training loop, checkpointing, and launch via `TorchTrainer`. > - Cluster configs: `configs/aws.yaml`, `configs/gce.yaml` for 1 head + 2 T4 workers. > - **Docs** > - Add example entry in `doc/source/train/examples.yml` linking to `examples/pytorch/deepspeed_finetune/README`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e376180. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Jason Li <[email protected]> Co-authored-by: angelinalg <[email protected]>

…nfig (ray-project#56707)

To remove overloaded semantics, we will undeprecate the compute parameter, deprecate the concurrency parameters, and ask users to use `ActorPoolStrategy` and `TaskPoolStrategy` directly, which makes the logic straightforward.

and related `test_wheels` test Signed-off-by: Lonnie Liu <[email protected]>

…57656) starting to breaking down `ray_release` into smaller bazel build rules. required to fix windows CI builds, where `test_in_docker` now imports third-party libraries that do not yet work on windows machines. Signed-off-by: Lonnie Liu <[email protected]>

…#57660) it is part of the `py_binary` Signed-off-by: Lonnie Liu <[email protected]>

the scripts are `py_binary` entrypoints, not for ray_release binary. Signed-off-by: Lonnie Liu <[email protected]>

…#57661) working towards breaking down `ray_release package` into smaller modules Signed-off-by: Lonnie Liu <[email protected]>

…ay-project#57663) also fixes typo in build file Signed-off-by: Lonnie Liu <[email protected]>

… Troubleshooting Guide (ray-project#55236) ## Why are these changes needed? Some users may not know how to configure the `ReconcileConcurrency` in kuberay. Docs link: https://anyscale-ray--55236.com.readthedocs.build/en/55236/cluster/kubernetes/troubleshooting/troubleshooting.html#how-to-configure-reconcile-concurrency-when-there-are-large-mount-of-crs ray-project/kuberay#3909 --------- Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Rueian <[email protected]>

…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

@rueian

… to fix over-provisioning (ray-project#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in ray-project#52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes ray-project#52864 Signed-off-by: jinbum-kim <[email protected]> Co-authored-by: Rueian <[email protected]>

…#57667) towards fixing test_in_docker for windows Signed-off-by: Lonnie Liu <[email protected]>

…#57669) as cluster envs are anyscale specific concepts Signed-off-by: Lonnie Liu <[email protected]>

## Description We are using `read_parquet` in two of our tests in `test_operator_fusion.py`, this switches those to use `range` to make the tests less brittle. Signed-off-by: Matthew Owen <[email protected]>

with comments to github issues Signed-off-by: Lonnie Liu <[email protected]>

otherwise, the ordering or messages looks strange on windows. Signed-off-by: Lonnie Liu <[email protected]>

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <[email protected]>

…8020) ## Description Currently, streaming repartition isn't combining blocks to the `target_num_rows_per_block` which is problematic, in a sense that it can only split blocks but not recombine them. This PR is addressing that by allowing it to recombine smaller blocks into bigger ones. However, one caveat is that the remainder of the block could still be under `target_num_rows_per_block`. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

…e buildup (ray-project#57996) …e buildup > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup **Issue** - When there is internal output queue buildup specifically when preserve_order is set, we don't limit task concurrency in streaming executor and just honor static concurrency cap. - When concurrency cap is unlimited, we keep queuing more Blocks into internal output queue leading to spill and steep spill curve. **Solution** In ConcurrencyCapBackpressurePolicy, detect internal output queue buildup and then limit the concurrency of the tasks. - Keep the internal output queue history and detect trends in percentage & size in GBs. Based on trends, increase/decrease the concurrency cap. - Given queue based buffering is needed for `preserve_order`, allow adaptive queuing threshold. This would result in Spill, but would flatten out the Spill curve and not cause run away buffering queue growth. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]>

…#57999) We have a feature flag to control the rolling out of ray export event, but the feature flag is missing the controlling of `StartExportingEvents`. This PR fixes the issue. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

otherwise they are failing windows core python tests Signed-off-by: Lonnie Liu <[email protected]>

…y-project#58023) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Only increase threshold When `_update_queue_threshold` to adjust the queue threshold to cap concurrency based on current queued bytes, - Only allow increasing the threshold or maintaining it. - Cannot decrease threshold because the steady state of queued bytes is not known. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]>

combining all depset checks into a single job TODO: add raydepset feature to build all depsets for the depset graph --------- Signed-off-by: elliot-barn <[email protected]>

- default deployment name was changed to _TaskConsumerWrapper after async inference implementation, fixed it now Signed-off-by: harshit <[email protected]>

…#58033) ## Description This change properly handles of pushing of the renaming projections into read ops (that support projections, like parquet reads). ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

…-project#58011) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

## Description This PR adds support for reading Unity Catalog Delta tables in Ray Data with automatic credential vending. This enables secure, temporary access to Delta Lake tables stored in Databricks Unity Catalog without requiring users to manage cloud credentials manually. ### What's Added - **`ray.data.read_unity_catalog()`** - Updated public API for reading Unity Catalog Delta tables - **`UnityCatalogConnector`** - Handles Unity Catalog REST API integration and credential vending - **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage, and Google Cloud Storage - **Automatic credential management** - Obtains temporary, least-privilege credentials via Unity Catalog API - **Delta Lake integration** - Properly configures PyArrow filesystem for Delta tables with session tokens ### Key Features ✅ **Production-ready credential vending API** - Uses stable, public Unity Catalog APIs ✅ **Secure by default** - Temporary credentials with automatic cleanup ✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud Storage) ✅ **Delta Lake optimized** - Handles session tokens and PyArrow filesystem configuration ✅ **Comprehensive error handling** - Helpful messages for common issues (deletion vectors, permissions, etc.) ✅ **Full logging support** - Debug and info logging throughout ### Usage Example ```python import ray # Read a Unity Catalog Delta table ds = ray.data.read_unity_catalog( table="main.sales.transactions", url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com", token="dapi...", region="us-west-2" # Optional, for AWS ) # Use standard Ray Data operations ds = ds.filter(lambda row: row["amount"] > 100) ds.show(5) ``` ### Implementation Notes This is a **simplified, focused implementation** that: - Supports **Unity Catalog tables only** (no volumes - that's in private preview) - Assumes **Delta Lake format** (most common Unity Catalog use case) - Uses **production-ready APIs** only (no private preview features) - Provides ~600 lines of clean, reviewable code The full implementation with volumes and multi-format support is available in the `data_uc_volumes` branch and can be added in a future PR once this foundation is reviewed. ### Testing - ✅ All ruff lint checks pass - ✅ Code formatted per Ray standards - ✅ Tested with real Unity Catalog Delta tables on AWS S3 - ✅ Proper PyArrow filesystem configuration verified - ✅ Credential vending flow validated ## Related issues Related to Unity Catalog and Delta Lake support requests in Ray Data. ## Additional information ### Architecture The implementation follows the **connector pattern** rather than a `Datasource` subclass because Unity Catalog is a metadata/credential layer, not a data format. The connector: 1. Fetches table metadata from Unity Catalog REST API 2. Obtains temporary credentials via credential vending API 3. Configures cloud-specific environment variables 4. Delegates to `ray.data.read_delta()` with proper filesystem configuration ### Delta Lake Special Handling Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration with session tokens (environment variables alone are insufficient). This implementation correctly creates and passes the filesystem object to the `deltalake` library. ### Cloud Provider Support | Provider | Credential Type | Implementation | |----------|----------------|----------------| | AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session token | | Azure Blob | SAS tokens | Environment variables (AZURE_STORAGE_SAS_TOKEN) | | GCP Cloud Storage | OAuth tokens / Service account | Environment variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) | ### Error Handling Comprehensive error messages for common issues: - **Deletion Vectors**: Guidance on upgrading deltalake library or disabling the feature - **Column Mapping**: Compatibility information and solutions - **Permissions**: Clear list of required Unity Catalog permissions - **Credential issues**: Detailed troubleshooting steps ### Future Enhancements Potential follow-up PRs: - Unity Catalog volumes support (when out of private preview) - Multi-format support (Parquet, CSV, JSON, images, etc.) - Custom datasource integration - Advanced Delta Lake features (time travel, partition filters) ### Dependencies - Requires `deltalake` package for Delta Lake support - Uses standard Ray Data APIs (`read_delta`, `read_datasource`) - Integrates with existing PyArrow filesystem infrastructure ### Documentation - Full docstrings with examples - Type hints throughout - Inline comments with references to external documentation - Comprehensive error messages with actionable guidance --------- Signed-off-by: soffer-anyscale <[email protected]>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <[email protected]>

These numbers are outdated, and the ones we report are not very useful. We will refresh them soon. Signed-off-by: Edward Oakes <[email protected]>

…54857) Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]>

## Description https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.to_numpy <img width="772" height="270" alt="Screenshot 2025-10-18 at 3 14 36 PM" src="https://github.com/user-attachments/assets/d9cbf986-4271-41e6-9c4c-96201d32d1c6" /> `zero_copy_only` is actually default to True, so we should explicit pass False, for pyarrow version < 13.0.0 https://github.com/ray-project/ray/blob/1e38c9408caa92c675f0aa3e8bb60409c2d9159f/python/ray/data/_internal/arrow_block.py#L540-L546 ## Related issues Closes ray-project#57819 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin (Owen) <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>

) Updating the default value calculation in the docstrings for the public API. Signed-off-by: irabbani <[email protected]>

…#58025) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

sourcery-ai

The pull request #660 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 4910.

gemini-code-assist · 2025-10-24T03:48:24Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is an automatic daily merge from the master branch into the main branch. It encompasses a broad range of updates and refactorings across the CI/CD pipeline, Bazel build configurations, Docker image generation, Python dependency management, and documentation infrastructure. The changes aim to improve build consistency, streamline development workflows, and enhance code quality checks.

Highlights

CI/CD Pipeline Refactoring: Extensive changes to .buildkite YAML files, introducing new build steps for core, dashboard, and Java components, and reorganizing existing steps. This includes adding _images.rayci.yml and dependencies.rayci.yml for better modularity.
Bazel Build System Updates: Modifications to .bazelrc for stricter action environments, platform-specific options, and warning suppression. Significant changes in BUILD.bazel and bazel/ray.bzl to use rules_pkg for artifact packaging and streamline C++ and Python protobuf generation.
Docker Image Management: Introduction of new Dockerfiles and Wanda configurations for ray-core, ray-dashboard, and ray-java images, along with extra variants for various Ray types (ML, LLM).
Python Dependency Management (raydepsets): Introduction of a new raydepsets system for managing Python dependencies, replacing manual uv pip compile scripts, and ensuring more consistent and reproducible builds.
Documentation Tooling & Style: Updates to .pre-commit-config.yaml to include semgrep and vale for linting, and a new ray-docs-style.mdc rule for enforcing documentation style. The PR template has also been updated.
Windows/macOS Build & Test Improvements: Refactoring of macOS build scripts to use miniforge and macos-arm64 instances. Windows build configurations were adjusted for shorter paths and cache handling.
C++ API and Runtime Changes: Updates to C++ API headers and runtime code, including changes to argument wrapping, metric recording, object store interactions, and runtime environment protobufs.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a large-scale refactoring and cleanup effort, primarily focused on improving the CI/CD pipeline and the Bazel build system. Key changes include modularizing the build process, centralizing dependency management, and modernizing build rules. The CI pipeline has been significantly reorganized for better clarity and parallelism. Notably, support for x86_64 macOS seems to have been dropped in favor of arm64. The project has also switched from Miniconda to Miniforge for environment management and has integrated several new code quality tools like semgrep and eslint.

Overall, these changes are excellent and represent a significant step forward in terms of maintainability and build efficiency. I have found one potential issue in the C++ code where an error status check was removed, which could lead to silently ignoring errors in local mode. My feedback includes a suggestion to restore this check.

gemini-code-assist · 2025-10-24T03:52:41Z

cpp/src/ray/runtime/object/local_mode_object_store.cc

+  memory_store_->Put(
      ::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()), object_id);


The status check for memory_store_->Put() has been removed. The underlying CoreWorkerMemoryStore::Put can return an error status, for example, if the object already exists. Silently ignoring this error could lead to unexpected behavior or hide bugs in tests or applications running in local mode. It would be safer to restore the status check and throw an exception on failure.

auto status = memory_store_->Put( ::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()), object_id); if (!status.ok()) { throw RayException("Put object error: " + status.ToString()); }

github-actions · 2025-11-08T01:40:34Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

HenryL27 and others added 30 commits October 10, 2025 09:44

Introduce sub-tabs with full Grafana dashboard embeds on Metrics tab (r…

d7ac83e

…ay-project#57561)

Add ray docs for custom autoscaling in serve (ray-project#57600)

a1036cb

1. add docs under advance autoscaling 2. promote autoscaling_context to public api --------- Signed-off-by: abrar <[email protected]>

[serve][llm] Enable engine metrics by default (ray-project#57615)

f92189e

Signed-off-by: Seiji Eicher <[email protected]>

[deps] raydepsets import all config (2/3) (ray-project#57584)

97455f3

Adding workspace funcs to parse all configs not currently used in raydepsets --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

[Core] Remove success field from the release test json result (ray-pr…

0a3683a

…oject#57630) Signed-off-by: Jiajun Yao <[email protected]>

add test for multiple task consumers in a single app (ray-project#56618)

41571b1

- adding a test for having multiple task consumers in a single ray serve application --------- Signed-off-by: harshit <[email protected]>

[Data] Adding MCAP datasource (ray-project#55716)

6320275

add celery default serializer and add new fields in celery adapter co…

5dc15b4

…nfig (ray-project#56707)

[release test] remove unused url_exists (ray-project#57655)

f06ea99

and related `test_wheels` test Signed-off-by: Lonnie Liu <[email protected]>

[ci] exclude build_in_docker_windows from ray_ci_lib (ray-project…

0cf89d8

…#57660) it is part of the `py_binary` Signed-off-by: Lonnie Liu <[email protected]>

[release test] exclude scripts from ray_release (ray-project#57662)

dea5716

the scripts are `py_binary` entrypoints, not for ray_release binary. Signed-off-by: Lonnie Liu <[email protected]>

[release test] make more files a standalone py_library's (ray-project…

a6b2053

…#57661) working towards breaking down `ray_release package` into smaller modules Signed-off-by: Lonnie Liu <[email protected]>

[release test] let custom byod tests only depend on the binary file (r…

d8b7a54

…ay-project#57663) also fixes typo in build file Signed-off-by: Lonnie Liu <[email protected]>

[release test] split out global_config, wheels and retry (ray-project…

ec1f9a9

…#57667) towards fixing test_in_docker for windows Signed-off-by: Lonnie Liu <[email protected]>

[release test] move cluster env utils to anyscale_util (ray-project…

1775318

…#57669) as cluster envs are anyscale specific concepts Signed-off-by: Lonnie Liu <[email protected]>

omatthew98 and others added 20 commits October 22, 2025 14:56

[data] Use ranges in test_operator_fusion.py (ray-project#58000)

1b1bd91

## Description We are using `read_parquet` in two of our tests in `test_operator_fusion.py`, this switches those to use `range` to make the tests less brittle. Signed-off-by: Matthew Owen <[email protected]>

[examples] disable tests that have been failing (ray-project#58015)

7e21548

with comments to github issues Signed-off-by: Lonnie Liu <[email protected]>

[ci] write test progress message to stderr (ray-project#58019)

a6193b2

otherwise, the ordering or messages looks strange on windows. Signed-off-by: Lonnie Liu <[email protected]>

[train] Update vicuna release test example to use V2 (ray-project#57767)

e7a79ba

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <[email protected]>

[core] add no_windows tag on cgroup related tests (ray-project#58027)

244caa8

otherwise they are failing windows core python tests Signed-off-by: Lonnie Liu <[email protected]>

[deps][ci] compiling all depsets in single job (ray-project#57957)

f0c7b53

combining all depset checks into a single job TODO: add raydepset feature to build all depsets for the depset graph --------- Signed-off-by: elliot-barn <[email protected]>

fix default dep name for async inf (ray-project#57664)

42b58a4

- default deployment name was changed to _TaskConsumerWrapper after async inference implementation, fixed it now Signed-off-by: harshit <[email protected]>

[serve][llm] Add KV connector factory and MultiConnector support (ray…

9051773

…-project#58011) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[core] Mark the scalability envelope as WIP (ray-project#58055)

4f497a6

These numbers are outdated, and the ones we report are not very useful. We will refresh them soon. Signed-off-by: Edward Oakes <[email protected]>

[core] (cgroups 22/n) Updating public API doc-strings (ray-project#58059

4f2db49

) Updating the default value calculation in the docstrings for the public API. Signed-off-by: irabbani <[email protected]>

[doc][serve][llm] Add user guide for kv-cache offloading (ray-project…

ac943b3

…#58025) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

antfin-oss requested review from SongGuyang and kfstorm as code owners October 24, 2025 03:39

antfin-oss added auto-generated daily-merge labels Oct 24, 2025

antfin-oss assigned ffbin Oct 24, 2025

sourcery-ai bot reviewed Oct 24, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 24, 2025

View reviewed changes

github-actions bot added the stale label Nov 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-10-24 #660

🔄 daily merge: master → main 2025-10-24 #660

Uh oh!

antfin-oss commented Oct 24, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Uh oh!

github-actions bot commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

76 participants

		memory_store_->Put(
		::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()), object_id);

🔄 daily merge: master → main 2025-10-24 #660

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-10-24 #660

Uh oh!

Conversation

antfin-oss commented Oct 24, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

76 participants