-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-10-30 #664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦oject#57765) Created by release automation bot. Update with commit 692c7c7 Signed-off-by: kevin <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <img width="1263" height="859" alt="Screenshot 2025-10-07 at 9 46 09β―PM" src="https://github.com/user-attachments/assets/45249e77-af49-4e3d-a758-608a51d15e10" /> [Link to video recording](https://drive.google.com/file/d/13Erjd_K4OXmn1r_7iMS97u8cgUdBgTvG/view?usp=sharing) Currently, by default, the original `tqdm` based progress is used. To enable `rich` progress reporting as shown in the screenshot, set: ``` ray.data.DataContext.get_current().enable_rich_progress_bars = True ``` or set the envvar: ``` export RAY_DATA_ENABLE_RICH_PROGRESS_BARS=1 ``` <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Fixes ray-project#52505 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Replaces legacy per-operator/global progress bars with a Rich-based progress manager that tracks global/operator progress and resources, refactors topology/progress APIs, and updates tests. > > - **Execution Progress (Rich-based)**: > - Introduces `progress_manager.py` with `RichExecutionProgressManager` for global/operator progress, rates, elapsed/remaining time, and live resource usage. > - Streaming executor integrates manager (start/refresh/close, finishing messages), updates on row/output and resources, and periodic refresh via `PROGRESS_MANAGER_UPDATE_INTERVAL`. > - **Operator/State Refactor**: > - `OpState` gains `OpDisplayMetrics`, `progress_manager_uuid`, `output_row_count`, and `update_display_metrics`; removes legacy progress bar handling and summary methods. > - `_debug_dump_topology` now logs `op_display_metrics.display_str()`. > - Minor: add TODOs on sub-progress-bar helpers in `AllToAllOperator` and `HashShuffleProgressBarMixin`. > - **Topology API**: > - `build_streaming_topology(...)` now returns only `Topology` (no progress bar count); all call sites and tests updated. > - **Iterator/Reporting**: > - `_ClosingIterator` updates total progress via manager. > - Resource reporting moved to progress manager. > - **Tests**: > - Adjust unit tests to new topology return type and removed progress bar expectations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 935f3c3. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Daniel Shin <[email protected]> Signed-off-by: kyuds <[email protected]> Signed-off-by: kyuds <[email protected]>
β¦56568) Signed-off-by: joshlee <[email protected]>
<!-- Thank you for contributing to Ray! π --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Add support for downloading from multiple URI columns in a single Download operation. This enhancement extends Ray Data's download functionality to efficiently handle multiple URI columns simultaneously, reducing the number of operations needed when working with datasets that contain multiple file references. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [ ] Enhancement π - [x] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested β _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Balaji Veeramani <[email protected]>
β¦project#57065) # Summary Add validation to checkpoint doc. Also mention that async checkpoint uploading requires you to create a long-lasting checkpoint directory. # Testing docbuild <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Documents checkpoint upload modes and async validation with new examples and API refs, adds s3torchconnector deps, and updates CI conda install to use the classic solver. > > - **Docs (Train)**: > - **Checkpoints guide**: Rename to "Saving, Validating, and Loading Checkpoints"; add sections on `CheckpointUploadMode` (sync/async/custom) and async checkpoint validation with `validate_fn`; include new examples and figure. > - **API refs**: Add `train.CheckpointUploadMode` to `doc/source/train/api/api.rst`. > - **Doc examples**: Add `checkpoints.py` snippets for `CheckpointUploadMode.SYNC/ASYNC/NO_UPLOAD`, Torch/XLA validation via `TorchTrainer`, Ray Data `map_batches`, and reporting with `validate_function`. > - **Monitoring & logging**: Note validating checkpoints as a primary reporting use case. > - **Dependencies**: > - Add `s3torchconnector==1.4.3` (and transitively `s3torchconnectorclient==1.4.3`) to `python/requirements_*` for custom S3 uploads. > - **CI**: > - Include `ci/env/install-miniforge.sh` in `ci/docker/base.ml.wanda.yaml` build context. > - In `ci/env/install-miniforge.sh`, install `mpi4py` with `conda ... --solver classic` for Python <3.12. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c5741d9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: angelinalg <[email protected]>
Phase 3 of https://gist.github.com/abrarsheikh/abde819a199c2190baba082f0e6fdd86 --------- Signed-off-by: abrar <[email protected]>
β¦57745) and update commits used in tests also stops testing x86_64 osx wheels on "latest", those are deprecated. Signed-off-by: Lonnie Liu <[email protected]>
ray-project#57716) Signed-off-by: Mark Rossett <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
removing cu121 build arg set Signed-off-by: elliot-barn <[email protected]>
β¦ts (ray-project#57724) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Currently there's two different shutdown flags on the raylet. There's `shutted_down` in main.cc which tracks whether `shutdown_raylet_gracefully_internal` has been executed yet, and then there's `is_shutting_down_` which tracks whether `shutdown_raylet_gracefully_` has been called from NodeManager. 1. `shutted_down` isn't atomically checked + changed inside `shutdown_raylet_gracefully_internal`. So it's possible to have the internal shutdown path happen twice. 2. When the raylet gets sigtermed it calls `shutdown_raylet_gracefully_internal` which only sets `shutted_down` in main.cc and not `is_shutting_down_` in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See ray-project#56966 for more context. 3. `shutdown_raylet_gracefully_` can also be called from the local resource manager or runtime env agent and we won't set a flag directly in that case. 4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe. ## Solution The solution is to just have one `shutdown_raylet_gracefully_` and one `shutting_down` flag that's set inside it. It's ok to not post `shutdown_raylet_gracefully_` because I've made `NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the `local_node_id_` + `local_node_info_` state because it was unnecessary and you can just pass node id into `UnregisterSelf`. The unregister callback is where the real shutdown work happens and that's posted onto the main io service so it's ok to not post the graceful shutdown callback. `shutting_down` is now checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started. Also a bit of miscellaneous cpp and gcs client node accessor cleanup while I was there... --------- Signed-off-by: dayshah <[email protected]>
β¦rovider::Get (ray-project#57691) This is 1/n in a series of PRs to fix ray-project#54007. CoreWorker::Get has two parts * CoreWorkerMemoryStore::Get (to wait for objects to become ready _anywhere_ in the cluster) * PlasmaStoreProvider::Get (to fetch objects into local plasma and return a ptr to shm). The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in ray-project#12912) is that the worker is not using any resources. PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker. In this PR, I clean up PlasmaStoreProvider::Get so it * No longer calls NotifyWorkerUnblocked sometimes (with some convoluted checking to see if we're executing a NORMAL_TASK on the main thread or an ACTOR_CREATION_TASK). * Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet. * Removes unnecessary parameters. --------- Signed-off-by: irabbani <[email protected]>
Timed out in a recent run: https://buildkite.com/ray-project/postmerge/builds/13756#0199eb6a-32b9-4892-9a04-d47be5b24a10/652-1951 Looks like it's just bumping up against the timeout: https://buildkite.com/ray-project/postmerge/builds/13749#0199eaff-d7b7-4299-9455-6ffa16433b41/835-999 Signed-off-by: Edward Oakes <[email protected]>
## Why are these changes needed? Changes should have been part of ray-project#56953. The reason we did ray-project#56930 and ray-project#56953, was that any kind of task would always be tagged and therefore limited per actor/type of task.
β¦ctor args (ray-project#57202) ## Changes 1. Wire up the AutoScalingContext constructor args to make metrics readable in the custom AutoScalingPolicy function. 2. dropped `requests_per_replica` since its expensive to compute 3. renamed `queued_requests` to `total_queued_requests` for consistency with `total_num_requests` 4. added `total_running_requests` 5. added tests assert new fields are populated correctly 6. run custom metrics tests with `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER` = 0 and 1 7. updated docs --------- Signed-off-by: abrar <[email protected]> Signed-off-by: Arthur Leung <[email protected]> Signed-off-by: Arthur Leung <[email protected]> Co-authored-by: abrar <[email protected]> Co-authored-by: Arthur Leung <[email protected]>
Example failure: https://buildkite.com/ray-project/postmerge/builds/13756#0199eb6a-3534-4cd5-9274-694269ff2c60 Didn't account for the carriage return on Windows... using `.strip()` now. Closes ray-project#51135 --------- Signed-off-by: Edward Oakes <[email protected]>
<!-- Thank you for contributing to Ray! π --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description Subject ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [ ] Enhancement π - [ ] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested β _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> Signed-off-by: Alexey Kudinkin <[email protected]>
flaky test ``` RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \ RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \ RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \ pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]" ``` What I think is the likely cause When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min` aggregation: 1. **Replicas emit metrics at slightly different times** (even if just 10ms apart due to the timestamp bucketing/rounding) 2. **The merged timeseries reflects the ramp-up**: - At t=0: Maybe only replica 1 is reporting β total = 25 requests - At t=0.01: Replica 2 starts reporting β total = 40 requests - At t=0.02: Replica 3 starts reporting β total = 50 requests - etc. 3. **`min` aggregation captures the starting point**: - `aggregate_timeseries(..., aggregation_function="min")` takes the minimum value from the merged timeseries - This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting - This value can never be β₯ 45, making the test inherently flaky Removing min from test fixture. I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function. Signed-off-by: abrar <[email protected]>
β¦t dashboard) (ray-project#57549) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? The goal of this set of changes is to make it easier to debug OOD failures (e.g. due to object store spilling) with some general improvements in the metrics experience. Right now we only have a % usage graph for memory. We want to see the % usage graph for all resources to monitor utilization & resource exhaustion issues easily. We also want to add new graphs to track Object Store memory usage (and usage %) by node, as well as absolute memory spilled to disk per node. ### Changes and screenshots * Added % usage graphs for all hardware resources & introduced consistent naming <img width="1420" height="856" alt="image" src="https://github.com/user-attachments/assets/90f17229-cba0-466d-966a-39d58c92c107" /> * Moved the current `Object Store Memory by Location` graph into the Ray primitives metrics group (alongside Tasks, Actors & PGs). This was in the `Ray Resources` group earlier, which we now want to change to `Ray Resources by Node` <img width="1409" height="620" alt="image" src="https://github.com/user-attachments/assets/66e0763c-0502-4565-9788-0572e7a8d4ac" /> * Introduced per node panels for Object Store Memory usage, and also % usage in the `Ray Resources by Node` group <img width="1418" height="850" alt="image" src="https://github.com/user-attachments/assets/0d0e6f7b-6375-4b13-bd40-f8de2700ff86" /> * Introduced a panel tracking absolute amount of memory spilled to disk per node (unstacked graph) <img width="713" height="312" alt="image" src="https://github.com/user-attachments/assets/ecf21bd5-1b6e-4315-af90-d7b8f9ac6259" /> I considered adding the stacked graph too, to help track total absolute disk spill, but that information can be seen in the `Object Store Memory by Location` panel as well so might not need it here. * Switched on`max` and `avg` columns by default, they should be helpful in monitoring utilization & debugging OOD errors etc <img width="2644" height="960" alt="image" src="https://github.com/user-attachments/assets/c9d905ce-124a-4d69-b962-213b72c3fd93" /> * Couple of fixes to PromQL queries that had minor errors. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: anmol <[email protected]> Co-authored-by: anmol <[email protected]>
β¦er inheritance from LLMServer (ray-project#57743) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
β¦RS (ray-project#57801) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
β¦ject#57770) Allow skipping parameterizing limits in Training ingest benchmark --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]>
β¦ray-project#57753) <!-- Thank you for contributing to Ray! π --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change would allow `Aggregator` actor to execute tasks out of order to avoid head-of-line blocking scenario. Also, made sure that overridden Ray remote args are applied as an overlay on top of default ones. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [ ] Enhancement π - [ ] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested β _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <[email protected]>
## Description Shortening the template per @edoakes' feedback. ## Related issues Follow-up to ray-project#57193. ## Additional information Made the following changes: 1. Removed `Types of change` and `Checklist`. 2. Updated contribution guide to point to Ray Docs. 3. Renamed `Additional context` to `Additional information` to be more encompassing. --------- Signed-off-by: Matthew Deng <[email protected]>
This pull request improves the display names for the Download operator, making them more descriptive by including the URI column name. This is a great enhancement for observability. --------- Signed-off-by: kyuds <[email protected]>
Previously we deprecated runtime_env_info field in Task and Actor definition events, for backwards compatibility we continued populating these fields in the proto objects. This pr stops populating the filed so that it is fully deprecated Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
β¦ing, HF Transformers, and HF Accelerate (ray-project#57632) This PR adds new release tests to test Ray Train v2 compatibility with Pytorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate. Each release tests launches a train run on a real cluster. --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
updating release test lockfiles due to a bug in raydepsets that doesn't respect package declaration in configs as it does requirements Adding unsafe ray to ray image building to keep ray out of the release test dependencies Successful aws_cluster_launcher run: https://buildkite.com/ray-project/release/builds/63731#_ (only manual tests are failing): aws_cluster_launcher_release_image --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]>
β¦t#57579) Signed-off-by: dayshah <[email protected]> Signed-off-by: joshlee <[email protected]> Co-authored-by: dayshah <[email protected]>
β¦AggType, U]) (ray-project#57281) ## Why are these changes needed? The current Generic types in `AggregateFnV2` are not tied to the class, so they are not picked up properly by static type checkers such as mypy. <!-- Please give a short summary of the change and the problem this solves. --> By adding the Generic[] in the class definition, we get full type checking support. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [N/A] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [N/A] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Arthur <[email protected]> Co-authored-by: Goutam <[email protected]>
``` REGRESSION 51.89%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 8378.589542828342 to 4030.5453313124744 in microbenchmark.json REGRESSION 33.66%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 4105.951978131054 to 2723.9690685388855 in microbenchmark.json REGRESSION 29.43%: client__tasks_and_put_batch (THROUGHPUT) regresses from 12873.97871447783 to 9085.541921590711 in microbenchmark.json REGRESSION 27.77%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 13779.913284159866 to 9952.762154617178 in microbenchmark.json REGRESSION 26.34%: client__get_calls (THROUGHPUT) regresses from 1129.1409512898194 to 831.689705073893 in microbenchmark.json REGRESSION 25.07%: multi_client_put_gigabytes (THROUGHPUT) regresses from 36.697734067834084 to 27.49866375110667 in microbenchmark.json REGRESSION 23.94%: actors_per_second (THROUGHPUT) regresses from 508.9808896382363 to 387.1219957094043 in benchmarks/many_actors.json REGRESSION 17.22%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4826.171895058453 to 3995.0258578261814 in microbenchmark.json REGRESSION 16.58%: single_client_tasks_async (THROUGHPUT) regresses from 7034.736389002367 to 5868.239300602419 in microbenchmark.json REGRESSION 14.30%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.9643252583791863 to 0.8264370292993273 in microbenchmark.json REGRESSION 14.27%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1037.1186014627438 to 889.14113884267 in microbenchmark.json REGRESSION 12.90%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1014.64288638885 to 883.7347522770161 in microbenchmark.json REGRESSION 11.37%: client__put_calls (THROUGHPUT) regresses from 805.4069136266919 to 713.82381443796 in microbenchmark.json REGRESSION 11.02%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5222.99132120111 to 4647.56843532278 in microbenchmark.json REGRESSION 7.95%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 12.272053704608084 to 11.296427707979271 in microbenchmark.json REGRESSION 7.86%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 524.6134993747014 to 483.38098840508496 in microbenchmark.json REGRESSION 7.26%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4498.3519827438895 to 4171.572402867286 in microbenchmark.json REGRESSION 6.47%: single_client_wait_1k_refs (THROUGHPUT) regresses from 4.700920788730696 to 4.396844484606209 in microbenchmark.json REGRESSION 6.40%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 2766.907182403518 to 2589.906655726785 in microbenchmark.json REGRESSION 5.06%: single_client_put_gigabytes (THROUGHPUT) regresses from 19.30103208209274 to 18.324991353469613 in microbenchmark.json REGRESSION 1.73%: 1_n_actor_calls_async (THROUGHPUT) regresses from 7474.798821945149 to 7345.613928457275 in microbenchmark.json REGRESSION 421.24%: dashboard_p99_latency_ms (LATENCY) regresses from 232.641 to 1212.615 in benchmarks/many_pgs.json REGRESSION 377.77%: dashboard_p95_latency_ms (LATENCY) regresses from 11.336 to 54.16 in benchmarks/many_pgs.json REGRESSION 306.02%: dashboard_p99_latency_ms (LATENCY) regresses from 749.022 to 3041.184 in benchmarks/many_tasks.json REGRESSION 162.57%: dashboard_p50_latency_ms (LATENCY) regresses from 11.744 to 30.836 in benchmarks/many_actors.json REGRESSION 94.47%: dashboard_p95_latency_ms (LATENCY) regresses from 487.355 to 947.76 in benchmarks/many_tasks.json REGRESSION 35.48%: dashboard_p99_latency_ms (LATENCY) regresses from 49.716 to 67.355 in benchmarks/many_nodes.json REGRESSION 33.15%: dashboard_p95_latency_ms (LATENCY) regresses from 2876.107 to 3829.61 in benchmarks/many_actors.json REGRESSION 27.55%: dashboard_p95_latency_ms (LATENCY) regresses from 13.982 to 17.834 in benchmarks/many_nodes.json REGRESSION 7.56%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 13.777409734000003 to 14.81957527099999 in scalability/object_store.json REGRESSION 4.48%: 3000_returns_time (LATENCY) regresses from 6.1422604579999955 to 6.417386639000014 in scalability/single_node.json REGRESSION 3.41%: avg_pg_remove_time_ms (LATENCY) regresses from 1.419495533032749 to 1.4678401576576676 in stress_tests/stress_test_placement_group.json REGRESSION 1.48%: 10000_get_time (LATENCY) regresses from 25.136106761999997 to 25.508083513999992 in scalability/single_node.json REGRESSION 0.76%: stage_2_avg_iteration_time (LATENCY) regresses from 36.08304100036621 to 36.358218574523924 in stress_tests/stress_test_many_tasks.json ``` Signed-off-by: kevin <[email protected]>
β¦ay-project#57631) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? To prevent unknown issues with streaming executor not completing after hours, we will add assert statements to rule out potential isses. This PR ensures that when an operator is completed - Internal Input queue is empty - Internal Output queue is empty - External Input Queue is empty The external output queue can be non-empty, because the downstream operators will consume from it <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>
β¦roject#58234) ## Description We need to make sure we're running tests on at least SF100 to make sure we capturing regressions that could fall under the noise level otherwise. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <[email protected]>
β¦/5) (ray-project#58068) Updating docgpubuild to run on python 3.10 updating minbuild-multiply job name to minbuild-serve Post merge test that uses the docgpubuild image: https://buildkite.com/ray-project/postmerge/builds/14073 --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Nikhil G <[email protected]>
β¦optimized download function (ray-project#57854) Signed-off-by: ahao-anyscale <[email protected]>
β¦ay-project#58261) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description This change make RD by dump verbose telemetry for `ResourceManager` into the `ray-data.log` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>
β¦GRADES(3/5) " (ray-project#58266) Reverts ray-project#58068
## Description Update formatting of FailurePolicy log message to be more readable. ## Additional information **Before:** ``` [FailurePolicy] Decision: FailureDecision.RAISE, Error source: controller, Error count / maximum errors allowed: 1/0, Error: Training failed due to controller error: Worker group is not active. Call WorkerGroup.create() to create a new worker group. ``` **After:** ``` [FailurePolicy] RAISE Source: controller Error count: 1 (max allowed: 0) Training failed due to controller error: Worker group is not active. Call WorkerGroup.create() to create a new worker group. ``` Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Yicheng-Lu-llll <[email protected]> Signed-off-by: Yicheng-Lu-llll <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
β¦#56742) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Fifth split of ray-project#56416 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Enables Ruff import sorting for rllib/examples by narrowing per-file ignores and updates example filesβ imports accordingly with no functional changes. > > - **Tooling/Lint**: > - Update `pyproject.toml` Ruff per-file-ignores (replace blanket `rllib/*` with targeted subpaths) to enable import-order linting for `rllib/examples`. > - **Examples**: > - Reorder and normalize imports across `rllib/examples/**` to satisfy Ruff isort rules; no logic or behavior changes. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 101586e. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Gagandeep Singh <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]> Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]>
β¦_options. (ray-project#58275) ## Description > Exclude IMPLICIT_RESOURCE_PREFIX from ReplicaConfig.ray_actor_options ## Related issues > Link related issues: "Fixes ray-project#58085" Signed-off-by: xingsuo-zbz <[email protected]>
β¦ + discrepancy fix in Python API 'serve.start' function (ray-project#57622) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? 1. Fix bug with 'proxy_location' set for 'serve run' CLI command `serve run` CLI command ignores `proxy_location` from config and uses default value `EveryNode`. Steps to reproduce: - have a script: ```python # hello_world.py from ray.serve import deployment @deployment async def hello_world(): return "Hello, world!" hello_world_app = hello_world.bind() ``` Execute: ``` ray stop ray start --head serve build -o config.yaml hello_world:hello_world_app ``` - change `proxy_location` in the `config.yaml`: EveryNode -> Disabled ``` serve run config.yaml curl -s -X GET "http://localhost:8265/api/serve/applications/" | jq -r '.proxy_location' ``` Output: ``` Before change: EveryNode - but Disabled expected After change: Disabled ``` 2. Fix discrepancy for 'proxy_location' in the Python API 'start' method `serve.start` function in Python API sets different `http_options.location` depending on if `http_options` is provided. Steps to reproduce: - have a script: ```python # discrepancy.py import time from ray import serve from ray.serve.context import _get_global_client if __name__ == '__main__': serve.start() client = _get_global_client() print(f"Empty http_options: `{client.http_config.location}`") serve.shutdown() time.sleep(5) serve.start(http_options={"host": "0.0.0.0"}) client = _get_global_client() print(f"Non empty http_options: `{client.http_config.location}`") ``` Execute: ``` ray stop ray start --head python -m discrepancy ``` Output: ``` Before change: Empty http_options: `EveryNode` Non empty http_options: `HeadOnly` After change: Empty http_options: `EveryNode` Non empty http_options: `EveryNode` ``` ------------------------------------------------------------- It changes current behavior in the following ways: 1. `serve run` CLI command respects `proxy_location` parameter from config instead of using the hardcoded `EveryNode`. 2. `serve.start` function in Python API stops using the default `HeadOnly` in case of empty `proxy_location` and provided `http_options` dictionary without `location` specified. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> Aims to simplify changes in the PR: ray-project#56507 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: axreldable <[email protected]>
## Description ### Status Quo This PR ray-project#54667 addressed issues of OOM by sampling a few lines of the file. However, this code always assumes the input file is seekable(ie, not compressed). This means zipped files are broken like this issue: ray-project#55356 ### Potential Workaround - Refractor reused code between JsonDatasource and FileDatasource - default to 10000 if zipped file found ## Related issues ray-project#55356 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <[email protected]>
β¦ray-project#58180) ## Expose Route Patterns in Proxy Metrics fixes ray-project#52212 ### Problem Proxy metrics (`ray_serve_num_http_requests_total`, `ray_serve_http_request_latency_ms`) only expose `route_prefix` (e.g., `/api`) instead of actual route patterns (e.g., `/api/users/{user_id}`). This prevents granular monitoring of individual endpoints without causing high cardinality from unique request paths. ### Design **Route Pattern Extraction & Propagation:** - Replicas extract route patterns from ASGI apps (FastAPI/Starlette) at initialization using `extract_route_patterns()` - Patterns propagate: Replica β `ReplicaMetadata` β `DeploymentState` β `EndpointInfo` β Proxy - Works with both normal patterns (routes in class) and factory patterns (callable returns app) **Proxy Route Matching:** - `ProxyRouter.match_route_pattern()` matches incoming requests to specific patterns using cached mock Starlette apps - Metrics tag requests with parameterized routes (e.g., `/api/users/{user_id}`) instead of prefixes - Fallback to `route_prefix` if patterns unavailable or matching fails **Performance:** Metric | Before | After -- | -- | -- Requests per second (RPS) | 403.39 | 397.82 Mean latency (ms) | 247.9 | 251.37 p50 (ms) | 224 | 223 p90 (ms) | 415 | 428 p99 (ms) | 526 | 544 ### Testing - Unit tests for `extract_route_patterns()` - Integration test verifying metrics use patterns and avoid high cardinality - Parametrized for both normal and factory patterns --------- Signed-off-by: abrar <[email protected]>
Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Han-Ju Chen (Future-Outlier) <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦#58060) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #664 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5020.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request represents an automated daily merge from the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an automated daily merge from master to main. It contains a large number of changes, primarily focused on a significant refactoring of the CI and build systems. Key changes include modularizing the build process, migrating from miniconda to miniforge, introducing a new dependency management tool (raydepsets) using uv, and updating various build configurations and scripts. These changes appear to be a positive step towards a more modern and maintainable build system. My review has identified one critical issue where an error check was removed, potentially leading to silent failures.
| memory_store_->Put( | ||
| ::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()), object_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status check for memory_store_->Put has been removed. The Put method can fail (e.g., by returning Status::OutOfMemory), and this failure is now silently ignored. This could lead to objects not being stored and subsequent hard-to-debug failures. The status check should be restored to ensure errors are properly handled.
auto status = memory_store_->Put(
::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()), object_id);
if (!status.ok()) {
throw RayException("Put object error: " + status.ToString());
}|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-10-30
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.