-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-10-20 #655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
β¦ be combined (ray-project#57240) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Original [PR](ray-project#56918) had while fixing all of the infra missed to delete the line in the end relaxing this constraint: 1. Removed constraint allowing AVSTT w/ diverging `ndim`s to be merged 2. Added tests ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>
β¦t#57227) Signed-off-by: dayshah <[email protected]>
β¦e same task are storing returns (ray-project#54904) Signed-off-by: dayshah <[email protected]>
β¦project#57061) Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Nikhil Ghosh <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Co-authored-by: Nikhil Ghosh <[email protected]>
^^ title says it RFC Link: ray-project#54652 --------- Signed-off-by: harshit <[email protected]> Co-authored-by: Douglas Strodtman <[email protected]>
This pull request adds a configurable `max_constructor_retry_count` for deployments, enabling users to define how many times a failing constructor should be retried. The value can now be set via both an environment variable and the deployment config. When both are provided, the environment variable takes precedence. GH issue link: ray-project#55786 --------- Signed-off-by: harshit <[email protected]> Co-authored-by: Cindy Zhang <[email protected]>
fix `text_embedding_*` release tests Signed-off-by: Lonnie Liu <[email protected]>
β¦ instead of `KEYS` when identifying GCS keys to clean up (ray-project#56907) Signed-off-by: acrewdson <[email protected]>
) moving configs to a configs directory for raydepsets --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦tion (ray-project#57247) Signed-off-by: dayshah <[email protected]>
ray-project#56826) This PR adds support to parse the `GangResourceRequest.bundle_selectors.resource_requests` field for gang resource requests in the V2 Autoscaler. This proto field replaces the deprecated `GangResourceRequest.resource_requests` ([definition](https://github.com/ray-project/ray/blob/3408fe94a687e0ed03f6861ab8f9e8708a68763a/src/ray/protobuf/autoscaler.proto#L85)) in order to support repeated selectors for fallback strategy. This change is required for autoscaling to work with the `bundle_label_selector` placement group option. This PR also adds an e2e test case for scaling up a placement group with `bundle_label_selector` specified. This tests verifies the behavior that the v2 scheduler will scale nodes satisfying the given label constraints, preferring nodes with the required `labels` over node types with sufficient resources, but lacking those labels. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]>
β¦y-project#57138) Signed-off-by: dayshah <[email protected]> Co-authored-by: dayshah <[email protected]>
^^ title says it Signed-off-by: harshit <[email protected]>
β¦project#56374) This PR contains only the python changes from ray-project#56369, adding `fallback_strategy` as an option to the remote decorator of Tasks/Actors. Fallback strategy consists of a list of dict of decorator options. The dict of decorator options are evaluated together, and the first satisfied strategy dict is scheduled. With this PR, the only supported option is `label_selector`. Example using `fallback_strategy` to schedule on different instance types: ``` @ray.remote( label_selector={"instance_type": "m5.16xlarge"}, fallback_strategy=[ # Fall back to selector for a "m5.large" instance type if "m5.16xlarge" # cannot be satisfied. {"label_selector": {"instance_type": "m5.large"}}, # Finally, fall back to an empty set of labels (no constraints). # neither desired m5 type can be sastisfied. {"label_selector": {}}, ], ) class A: pass ``` In the above field, first the `label_selector` field will be tried. Then, the scheduler will iterate through each dict in `fallback_strategy` and attempt to scheduling using the label selector specified there (first `{"instance_type": "m5.large"}` and then the empty set). The first satisfied `label_selector` will be scheduled. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We currently have inqueue metrics, we also need outqueue metrics (for the last operator). This PR creates 3 panels: - external outqueue bytes - external outqueue blocks - combined panels (overview section) <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Introduce external output-queue metrics (blocks/bytes), wire them into execution, and surface via new/renamed Grafana panels with combined output-queue view; update tests accordingly. > > - **Metrics/runtime (Ray Data)**: > - Add new metrics: `num_external_outqueue_blocks`, `num_external_outqueue_bytes` in `OpRuntimeMetrics`. > - Update `StreamingExecutor` state to increment/decrement external outqueue metrics on output produced, dispatched, and consumed. > - **Dashboard (Grafana panels)**: > - New panels: `EXTERNAL_OUTQUEUE_BLOCKS_PANEL`, `EXTERNAL_OUTQUEUE_BYTES_PANEL`, and `COMBINED_OUTQUEUE_BLOCKS_PANEL`. > - Add new panels to "Pending Outputs" and "Overview" rows. > - Rename panel titles from "Inqueue/Outqueue" to "Input/Output Queue" for clarity (internal/external, blocks/bytes). > - **Tests**: > - Extend expected metrics to include new external outqueue fields; adjust logging test to use `take_all()` and normalize expectations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 2dda787. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: iamjustinhsu <[email protected]>
) use core binary bits and dashboard builds from previous bits for testing. this caches c/c++ binary parts much more aggressively, and speeds up CI speed for python-only changes for about 15 minutes. Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Haoyuan Ge <[email protected]>
β¦d check (ray-project#57253) Signed-off-by: dayshah <[email protected]>
Signed-off-by: abrar <[email protected]>
## Why are these changes needed? The footsies environment tests are flaky as multiple actors can cause race conditions where multiple actors can download, unzip or rename files at the same time. ## Related issue number flaky tests ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Prevents concurrent download/extraction of Footsies binaries by adding file locks around these steps. > > - **FoOtsies env binary handling (`footsies_binary.py`)**: > - **Concurrency control**: Add `filelock.FileLock` to serialize binary download (`.footsies-download.lock`) and unzip/rename (`.footsies-unzip.lock`). > - **Download**: Skip when `full_download_path` exists; otherwise stream-download with lock. > - **Unzip/rename**: Skip when `renamed_path` exists; otherwise extract and rename within lock. > - **Imports**: Add `from filelock import FileLock`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 3a02473. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>
β¦ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <[email protected]>
β¦al frequency (ray-project#57273) The `image_embedding_from_jsonl_fixed_size_chaos` release test runs a large image embedding workload with a preemption every minute. Since this test features long-running tasks and frequent preemptions, it's expected to time out (it's not a regression). So, this PR changes the frequency to manual. --------- Signed-off-by: Balaji Veeramani <[email protected]>
β¦-project#57147) Replace `CheckpointManager`'s usage of pydantic v2 APIs with v1 APIs instead. --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: Justin Yu <[email protected]>
β¦ion manager (ray-project#57270) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? split text_execution_optimizer into multiple files + lower cardinality for test_consumption <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>
β¦ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <[email protected]>
With ray-project#56436 it is possible that we can get multiple unsubscribes with the following set of operations: 1.) Ref goes out of scope 2.) CommandBatch request sent (message immediately published to mailbox since ref is out of scope) 3.) CommandBatch reply lost 4.) Retry CommandBatch request (another message published to mailbox!) 5.) LongPollingResponse retrieves 2 WORKER_REF_REMOVED messages which will trigger unsubscribe twice and trigger the RAY_CHECK Since this function just cleans up local subscriber state, I don't think there's any issues with making it a void return type instead of returning a bool. --------- Signed-off-by: joshlee <[email protected]>
β¦-project#57230) # Summary We currently expose many data iterator metrics in `DatasetStats.to_summary` (https://github.com/ray-project/ray/blob/3408fe94a687e0ed03f6861ab8f9e8708a68763a/python/ray/data/_internal/stats.py#L1010) but not in Prometheus. This PR adds most of these metrics to Prometheus as well. # Testing I ran a typical Ray Train + Data job in an Anyscale workspace. The time metrics look reasonable <img width="3417" height="1125" alt="Screenshot 2025-10-06 at 7 04 23β―PM" src="https://github.com/user-attachments/assets/1db3caba-6c00-4846-b15b-f63875f64cd3" /> but the iteration blocks metrics show negative numbers for some reason <img width="3415" height="568" alt="Screenshot 2025-10-06 at 7 05 54β―PM" src="https://github.com/user-attachments/assets/d418bb2f-1dbc-4196-8ef5-12ce5f11d3f2" /> --------- Signed-off-by: Timothy Seah <[email protected]>
β¦ing cluster teardown (ray-project#57610) After running `ray down` several resources (a managed service identiy, network security group, etc) at left in the subscription and re-running `ray up` (without specifying --no-config-cache) will sometimes run into errors because of this. ## Related issue number Fixes: ray-project#55392 Signed-off-by: Mark Rossett <[email protected]>
The old expression evaluator did not correctly handle `is_in` which failed tests in `test_expression_evaluator` ## Related issues Fixes ray-project#57820 --------- Signed-off-by: Goutam <[email protected]>
Release tests, unit tests, doctests, and examples are all migrated to V2, so this PR turns V2 on by default. To run with Train V1 (deprecated), set `RAY_TRAIN_V2_ENABLED=0`. --------- Signed-off-by: Justin Yu <[email protected]>
β¦ct#57133) This PR adds a workspace template that walks users through how to integrate PyTorch Profiler with Ray Train. The purpose of this template is to walk user through how to generate trace/memory profile with Pytorch Profiler in the TorchTrainer. For a high level overview, this template covers: 1. A hands-on example of training an image classification model 2. A simple torch profiler integration script 3. Some more advanced use cases including `record_function` to customize the profiling experience. 4. a successful release test run: https://buildkite.com/ray-project/release/builds/63492#0199e3e4-fa8c-4f4f-a92c-f6d47a415c53 Testing: testing in Anyscale workspace --------- Signed-off-by: Lehui Liu <[email protected]>
Signed-off-by: abrar <[email protected]>
β¦uides and improved navigation (ray-project#57787) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: angelinalg <[email protected]>
β¦ss/postprocess (ray-project#57826) Signed-off-by: Nikhil Ghosh <[email protected]>
## Description This removes an orphaned code file that was previously used by the Preprocessor User Guide. ## Related issues Corresponding User Guide was removed in ray-project#44006. Closes ray-project#57867. ## Additional details This test started failing because of the new `XGBoostTrainer` API enabled by default with Ray Train V2. Rather than update the snippet, removing this code instead. Signed-off-by: Matthew Deng <[email protected]>
adding eslint and prettier script to precommit before getting rid of format.sh 1 step closer to replacing scripts/format.sh with pre-commit (pre-commit is currently missing eslint) tested locally: <img width="898" height="929" alt="image" src="https://github.com/user-attachments/assets/58c77fb7-bdde-47ae-ac2b-b864334b3f30" /> --------- Signed-off-by: elliot-barn <[email protected]>
First test running on AKS cloud! --------- Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
## Description Updating so that the module shows as `ray.train` rather than `ray.train.v2.api.exceptions` ## Testing https://anyscale-ray--57865.com.readthedocs.build/en/57865/train/api/doc/ray.train.v2.api.data_parallel_trainer.DataParallelTrainer.fit.html#ray.train.v2.api.data_parallel_trainer.DataParallelTrainer.fit <img width="960" height="302" alt="image" src="https://github.com/user-attachments/assets/02206542-54fe-4674-b2b4-1868fa7e8580" /> Signed-off-by: Matthew Deng <[email protected]>
- Add 2 hello world tests with regular base image & custom image running on GCE --------- Signed-off-by: kevin <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]>
## Description Bump from small to medium due to timeouts happening specifically in py3.12 tests. --------- Signed-off-by: Matthew Deng <[email protected]>
## Why are these changes needed? Computing the `num_module_steps_trained_(lifetime)_throughput` metrics are biased due to the way how we record throughput times in a loop over module batches. This PR offers a fix to this bias. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: simonsays1980 <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>
β¦orker` (ray-project#57859) ## Description The type annotation for `actor_location_tracker` is currently `ActorLocationTracker`, but it should be `ray.actor.ActorHandle[ActorLocationTracker]`. This PR fixes that issue. Signed-off-by: Balaji Veeramani <[email protected]>
ray-project#57834) Signed-off-by: Jiajun Yao <[email protected]>
β¦r'. (ray-project#57673) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? The type hints for `learner_connector` in `AlgorithmConfig.training` was deprecated still using the `RLModule` as parameter. This PR adjust type hints to the actual expected form of the callable. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: simonsays1980 <[email protected]>
`result_of_t` is deprecated Signed-off-by: Lonnie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #655 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 4756.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on a major refactoring and enhancement of the CI/CD pipeline, particularly around Bazel configurations, Docker image builds, and Python dependency management. It introduces a more modular approach to defining and building various Ray components and their dependencies, improves linting and code quality checks, and updates the CI testing infrastructure for better efficiency and clarity. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a large automated merge that includes a significant refactoring of the build system and CI pipelines. Key changes include migrating from setup.py based wheel builds to pip wheel, extensive reorganization of Bazel BUILD files for better modularity, and updates to CI configurations across various platforms. The PR also introduces a new dependency management tool raydepsets and updates many dependencies and linting configurations. My review found one potential issue where a status check was removed, which could lead to silent failures. Overall, the changes seem to be a major step towards modernizing and improving the maintainability of the project's build and CI infrastructure.
| memory_store_->Put( | ||
| ::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()), object_id); | ||
| if (!status) { | ||
| throw RayException("Put object error"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-10-20
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.