-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-11-10 #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦oject#57948) Thereβs a potential risk in the current midpoint calculation. It can be wrong when values are negative. Line 167: lower_bound + buckets[0] / 2.0 Line 171: (buckets[i] + buckets[i - 1]) / 2.0 I improved the formula and added a test to make sure it works. Signed-off-by: justwph <[email protected]>
## Description See ray-project#57924 ## Related issues Fixes ray-project#57924 ## Additional information - [x] Verify by manual testing -> confirmed with example using `write_numpy` Signed-off-by: kyuds <[email protected]>
## Why are these changes needed? With ray-project#55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: matthewdeng <[email protected]>
Signed-off-by: Sagar Sumit <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Currently RDT object metadata will stick around on the driver. This can lead to some weird behavior like the src actor living past expectations because the actor handle sticks around in the metadata. I'm now freeing the metadata on the owner at whenever we decide to tell the primary copy owner to free (when the ref counter decides it's "OutOfScope"). https://github.com/ray-project/ray/blob/d8b7a54be63638924369ac5c7d5e671a23f151a7/python/ray/experimental/gpu_object_manager/gpu_object_manager.py#L29-L38 A couple TODO's for the future: - Currently not freeing metadata on borrowers when you use NIXL to put -> get. - We currently don't support lineage reconstruction for RDT objects, once we do this, we may have to change the metadata free to happen on ref deletion rather than on OutOfScope. Or have some metadata recreation path? --------- Signed-off-by: dayshah <[email protected]>
run it continuously to capture potential issues Signed-off-by: Kevin H. Luu <[email protected]>
β¦project#57947) creating a depset for doc builds Pinning pydantic==2.50 for doc requirements due to api_policy_check failing due to dependencies (failure below) Installing ray without deps Failure due to dependencies: https://buildkite.com/ray-project/premerge/builds/52231#019a04a9-e9f8-4151-b5e5-d4bceb48a3cc Successful api_policy_check run on this PR: https://buildkite.com/ray-project/microcheck/builds/29312#019a05b8-ef40-4449-9ca2-6dd9d8e790a7 --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
## Description Change `train_colocate_trainer` release test frequency to be manual. ## Related issues Related to ray-project#49454. ## Additional information `ScalingConfig.trainer_resources` has been deprecated in Ray Train V2. As a result, we should disable the test for now. In the future, we can either: 1. Delete this test entirely. 2. Add back functionality for colocation & reenable this test. Signed-off-by: Matthew Deng <[email protected]>
β¦or (ray-project#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <[email protected]>
## Description Fix release tests that are still importing from `ray.train` to import from `ray.tune`, as described in ray-project#49454. ## Additional information Test Runs: | Test Name | Before | After| | --- | --- | ---| | `cluster_tune_scale_up_down.aws` | https://buildkite.com/ray-project/release/builds/64733#019a0546-dbe2-47b1-ba85-983d48098352 | https://buildkite.com/ray-project/release/builds/64809/steps/canvas?sid=019a0804-dc62-4177-a24a-b23806cd0d51 | | `cluster_tune_scale_up_down.kuberay` | https://buildkite.com/ray-project/release/builds/64733#019a0549-25dd-4dd3-ba84-df9d3ce5a72a | https://buildkite.com/ray-project/release/builds/64809/steps/canvas?sid=019a0804-dc63-49b3-8efc-ff3bd5fb8d28 | | `tune_worker_fault_tolerance` | https://buildkite.com/ray-project/release/builds/64734#019a058c-2f15-401c-88d9-894002152fff | https://buildkite.com/ray-project/release/builds/64845/steps/canvas?sid=019a08aa-6c85-43d4-83e0-bf56d49ab474 | --------- Signed-off-by: Matthew Deng <[email protected]>
Since we are now logging into Azure using certificate, these environment variables are not used anymore. Signed-off-by: Kevin H. Luu <[email protected]>
β¦API (ray-project#57977) ## Summary This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework. ## Key Changes ### Use `download()` expression instead of separate materialization **Before:** ```python file_paths = ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .take_all() ) file_paths = [row["uploaded_pdf_path"] for row in file_paths] ds = ray.data.read_binary_files(file_paths, include_paths=True) ``` **After:** ```python ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .with_column("bytes", download("uploaded_pdf_path")) ``` This change: - Eliminates the intermediate materialization with `take_all()`, which loads all data into memory - Uses the `download()` expression to lazily fetch file contents as part of the pipeline - Removes the need for a separate `read_binary_files()` call ### Method chaining for cleaner code All operations are now chained in a single pipeline, making the data flow more clear and idiomatic. ### Consistent column naming Updated references from `path` to `uploaded_pdf_path` throughout the code for consistency with the source data schema. Signed-off-by: Balaji Veeramani <[email protected]>
This PR addresses several failing release tests likely due to the recent Ray Train V2 default enablement. The following failing release tests are addressed: - huggingface_transformers - distributed_training.regular - distributed_training.chaos distributed_training fix: `distributed_training.regular` and `distributed_training.chaos` were failing due to relying on the deprecated Reporting free-floating metrics functionality. The tests attempted to access an non existent key in the `result.metrics` that were not reported. The fix uploads a checkpoint to ensure this key exists. huggingface_transformers: The `huggingface_transformers` test was failing due to outdated accelerate and peft versions. The fix leverages a post-build file to ensure the proper accelerate and peft versions. Tests: Test Name | Before | After -- | -- | -- huggingface_transformers | https://buildkite.com/ray-project/release/builds/64733#019a0559-25da-401f-8d7e-3128b8f7d287 | https://buildkite.com/ray-project/release/builds/64888#019a090d-5f53-4f7d-b0ac-ac8cf7c529b6 distributed_training.regular | https://buildkite.com/ray-project/release/builds/64733#019a0572-1095-4b6f-b3bc-b496227c9280 | https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b5-41b6-aaf6-2bbd443a0a1e distributed_training.chaos | https://buildkite.com/ray-project/release/builds/64733#019a0574-3862-4da2-a264-a9e11333bd72 | https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b6-4344-90f2-cbcd637aae3d --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
num_waiter == 0 does not necessarily mean that the request has been completed. --------- Signed-off-by: abrar <[email protected]>
β¦es (ray-project#57883) This PR adds a test to verify that DataOpTask handles node failures correctly during execution. To enable this testing, callback seams are added to DataOpTask that allow tests to simulate preemption scenarios by killing and restarting nodes at specific points during task execution. ## Summary - Add callback seams (`block_ready_callback` and `metadata_ready_callback`) to `DataOpTask` for testing purposes - Add `has_finished` property to track task completion state - Create `create_stub_streaming_gen` helper function to simplify test setup - Refactor existing `DataOpTask` tests to use the new helper function - Add new parametrized test `test_on_data_ready_with_preemption` to verify behavior when nodes fail during execution ## Test plan - Existing tests pass with refactored code - New preemption test validates that `on_data_ready` handles node failures correctly by testing both block and metadata callback scenarios --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description 1. the `mlflow.start_run()` does not have the `tracking_uri` arg: https://mlflow.org/docs/latest/api_reference/python_api/mlflow.html#mlflow.start_run 2. rewrite the mlflow set up as follow ``` mlflow.set_tracking_uri(uri="file://some_shared_storage_path/mlruns") mlflow.set_experiment("my_experiment") mlflow.start_run() ``` ## Related issues N/A --------- Signed-off-by: Lehui Liu <[email protected]>
β¦t#57855) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. ## Description 1. Add visitors for collecting column names from all expressions and renaming names across the tree. 2. Use expressions for rename_columns, with_column, select_columns and remove cols and cols_rename in Project 3. Modify Projection Pushdown to work with combinations of the above operators correctly ## Related issues Closes ray-project#56878, ray-project#57700 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <[email protected]>
β¦ject#55291) Resolves: ray-project#55288 (wrong `np.array` in `TensorType`) Furthermore changes: - Changed comments to (semi)docstring which will be displayed as tooltips by IDEs (e.g. VSCode + Pylance) making that information available to the user. - `AgentID: Any -> Hashable` as it used for dict keys - changed `DeviceType` to be not a TypeVar (makes no sense in the way it is currently used), also includes DeviceLikeType (`int | str | device`) from `torch`. IMO it can fully replace the current type but being defensive I only added it as an extra possible type - Used updated DeviceType to improve type of Runner._device and make it more correct - Used torch's own type in `data`, current code supports more than just `str`. I refrained from adding a reference to `rllib` despite it being nice if they would be in sync. - Some extra formatting that is forced by pre-commit <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Revamps `rllib.utils.typing` (NDArray-based `TensorType`, broader `DeviceType`, `AgentID` as `Hashable`, docstring cleanups) and updates call sites to use optional device typing and improved hints. > > - **Types**: > - Overhaul `rllib/utils/typing.py`: > - `TensorType` now uses `numpy.typing.NDArray`; heavy use of `TYPE_CHECKING` to avoid runtime deps on torch/tf/jax. > - `DeviceType` widened to `Union[str, torch.device, int]` (was `TypeVar`). > - `AgentID` tightened to `Hashable`; `NetworkType` uses `keras.Model`. > - Refined aliases (e.g., `FromConfigSpec`, `SpaceStruct`) and added concise docstrings. > - **Runners**: > - `Runner._device` now `Optional` (`Union[DeviceType, None]`) with updated docstring; same change in offline runnersβ `_device` properties. > - **Connectors**: > - `NumpyToTensor`: `device` param typed as `Optional[DeviceType]` (via `TYPE_CHECKING`). > - **Utils**: > - `from_config`: typed `config: Optional[FromConfigSpec]` with `TYPE_CHECKING` import. > - **Misc**: > - Minor formatting/import ordering and comment typo fixes. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit ae2e422. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Daniel Sperber <[email protected]> Signed-off-by: Daraan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>
β¦57993) Although Spark-on-Ray depends on the Java bindings, we `java` tests are triggered by all C++ changes and we don't want to run Spark-on-Ray tests every time we change C++ code. --------- Signed-off-by: Edward Oakes <[email protected]>
β¦ay-project#57771) Signed-off-by: Nikhil Ghosh <[email protected]>
β¦ay-project#57987) Signed-off-by: daiping8 <[email protected]>
β¦ect#57974) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all object-manager components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because object manager is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead. Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
## Description Use `tune.report` instead of `train.report`. Signed-off-by: Matthew Deng <[email protected]>
β¦t#57620) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This will be used to help control the targets that are returned. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: akyang-anyscale <[email protected]>
<!-- Thank you for contributing to Ray! π --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This PR adds a new check to make sure proxies are ready to serve traffic before finishing serve.run. For now, the check immediately finishes. <!-- Briefly describe what this PR accomplishes and why it's needed --> ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [ ] Enhancement π - [ ] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested β _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: akyang-anyscale <[email protected]>
β¦roject#57793) When deploying ray on Yarn using Skein, it's useful to expose the ray's dashboard via Skein's web ui. This PR shows how to expose that and update the related document. Signed-off-by: Zakelly <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
β¦cgroup even if they are drivers (ray-project#57955) For more details about the resource isolation project see ray-project#54703. Driver processes that are registered in ray's internal namespace (such as ray dashboard's job and serve modules) are considered system processes. Therefore, they will not be moved into the workers cgroup when they register with the raylet. --------- Signed-off-by: irabbani <[email protected]>
β¦t stats (ray-project#58422) ## Why These Changes Are Needed This PR adds a new metric to track the time spent retrieving `RefBundle` objects during dataset iteration. This metric provides better visibility into the performance breakdown of batch iteration, specifically capturing the time spent in `get_next_ref_bundle()` calls within the `prefetch_batches_locally` function. ## Related Issue Number N/A ## Example ``` dataloader/train = {'producer_throughput': 8361.841782656593, 'iter_stats': {'prefetch_block-avg': inf, 'prefetch_block-min': inf, 'prefetch_block-max': 0, 'prefetch_block-total': 0, 'get_ref_bundles-avg': 0.05172277254545271, 'get_ref_bundles-min': 1.1991999997462699e-05, 'get_ref_bundles-max': 11.057470971999976, 'get_ref_bundles-total': 15.361663445999454, 'fetch_block-avg': 0.31572694455743233, 'fetch_block-min': 0.0006362799999806157, 'fetch_block-max': 2.1665870369999993, 'fetch_block-total': 93.45517558899996, 'block_to_batch-avg': 0.001048687573988573, 'block_to_batch-min': 2.10620000302697e-05, 'block_to_batch-max': 0.049948245999985375, 'block_to_batch-total': 2.048086831999683, 'format_batch-avg': 0.0001013781433686053, 'format_batch-min': 1.415700000961806e-05, 'format_batch-max': 0.009682661999988795, 'format_batch-total': 0.19799151399888615, 'collate-avg': 0.01303446213312943, 'collate-min': 0.00025646699998560507, 'collate-max': 0.9855495820000328, 'collate-total': 25.456304546001775, 'finalize-avg': 0.012211385266257683, 'finalize-min': 0.004209667999987232, 'finalize-max': 0.3785081949999949, 'finalize-total': 23.848835425001255, 'time_spent_blocked-avg': 0.04783407008137157, 'time_spent_blocked-min': 1.2316999971062614e-05, 'time_spent_blocked-max': 12.46102861700001, 'time_spent_blocked-total': 93.46777293900004, 'time_spent_training-avg': 0.015053571562211652, 'time_spent_training-min': 1.3704999958008557e-05, 'time_spent_training-max': 1.079616685000019, 'time_spent_training-total': 29.399625260999358}} ``` ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]>
## Description: when token auth is enabled, the dashboard prompts the user to enter the valid auth token and caches it (as a browser cookie). when token based auth is disabled, existing behaviour is retained. all dashboard ui based rpc's to to the ray cluster set the authorization header in their requests ## Screenshots token popup <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/004c23a3-991e-4a2c-a2ad-5a0ce2e60893" /> on entering an invalid token <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/7183a798-ceb7-4657-8706-39ce5fe8e61e" /> --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
β¦ants (ray-project#57910) 1. **Remove direct environment variable access patterns** - Replace all instances of `os.getenv("RAY_enable_open_telemetry") == "1"` - Standardize to use `ray_constants.RAY_ENABLE_OPEN_TELEMETRY` consistently throughout the codebase 2. **Unify default value format for RAY_enable_open_telemetry** - Standardize the default value to `"true"` | `"false"` - Previously, the codebase had mixed usage of `"1"` and `"true"`, which is now unified 3. **Backward compatibility maintained** - Carefully verified that the existing `RAY_ENABLE_OPEN_TELEMETRY` constant properly handles both `"1"` and `"true"` values - This change will not introduce any breaking behavior - The `env_bool` helper function already supports both formats: ```python RAY_ENABLE_OPEN_TELEMETRY = env_bool("RAY_enable_open_telemetry", False) def env_bool(key, default): if key in os.environ: return ( True if os.environ[key].lower() == "true" or os.environ[key] == "1" else False ) return default ``` --- Most of the current code uses: `RAY_enable_open_telemetry: "1"` A smaller portion (not zero) uses: `RAY_enable_open_telemetry: "true"` https://github.com/ray-project/ray/blob/fe7ad00f9720a722fde5fecba5bb681234bcdb63/python/ray/tests/test_metrics_agent.py#L497 My personal preference is "true"βitβs concise and unambiguous. If itβs "1", I have to think/guess whether it means "true" or "false". --------- Signed-off-by: justwph <[email protected]>
β¦y-project#58217) Change the unit of `scheduler_placement_time` from seconds to mili-seconds. The current bucket is in the range of 0.1s to 2.5 hours which doesn't make sense. According to a sample of data, the range we are interested in would be from us to s. Thanks @ZacAttack for pointing this out. ``` Note: This is an internal (nonβpublic-facing) metric, so we only need to update its usage within Ray (e.g., the dashboard). A simple code change should suffice. ``` <img width="1609" height="421" alt="505491038-c5d81017-b86c-406f-acf4-614560752062" src="https://github.com/user-attachments/assets/cc647b97-42ec-42eb-bf01-4d1867940207" /> Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
β¦s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>
## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>
be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>
getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>
β¦tion on a single node (ray-project#58456) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>
## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>
β¦ct#58445) ## Summary Creates a dedicated `tests/unit/` directory for unit tests that don't require Ray runtime or external dependencies. ## Changes - Created `tests/unit/` directory structure - Moved 13 pure unit tests to `tests/unit/` - Added `conftest.py` with fixtures to prevent `ray.init()` and `time.sleep()` - Added `README.md` documenting unit test requirements - Updated `BUILD.bazel` to run unit tests with "small" size tag ## Test Files Moved 1. test_arrow_type_conversion.py 2. test_block.py 3. test_block_boundaries.py 4. test_data_batch_conversion.py 5. test_datatype.py 6. test_deduping_schema.py 7. test_expression_evaluator.py 8. test_expressions.py 9. test_filename_provider.py 10. test_logical_plan.py 11. test_object_extension.py 12. test_path_util.py 13. test_ruleset.py These tests are fast (<1s each), isolated (no Ray runtime), and deterministic (no time.sleep or randomness). --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
> Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Nikhil Ghosh <[email protected]>
β¦ in read-only mode (ray-project#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). ## Description Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME` set. As a result, in read-only mode the scheduler sees a node type name (ex. small-group) that never exists in the populated configs. This PR fixes the issue by using `ray_node_type_name` when it exists, and only falling back to node ID when it does not. ## Related issues Fixes ray-project#58227 Signed-off-by: Rueian <[email protected]>
β¦cess: bool (ray-project#58384) ## Description Pass in `status_code` directly into `do_reply`. This is a follow up to ray-project#58255 ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #673 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5283.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request is an automated daily merge from the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an automated daily merge that incorporates a massive set of refactoring and improvement changes from the master branch. Key changes include a significant overhaul of the Bazel build system for better modularity and hermeticity, a major refactoring of the Buildkite CI pipelines to improve organization and test coverage, and the introduction of a new dependency management tool raydepsets. The C++ code has been modernized, and numerous configurations for linting, code ownership, and Docker builds have been updated and improved. Overall, these changes represent a substantial step forward in the project's build system, CI infrastructure, and code quality practices. The changes appear to be well-executed and beneficial.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-11-10
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.