🔄 daily merge: master → main 2025-11-13 #676

antfin-oss · 2025-11-13T02:59:06Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-13
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

combining all depset checks into a single job TODO: add raydepset feature to build all depsets for the depset graph --------- Signed-off-by: elliot-barn <[email protected]>

- default deployment name was changed to _TaskConsumerWrapper after async inference implementation, fixed it now Signed-off-by: harshit <[email protected]>

…#58033) ## Description This change properly handles of pushing of the renaming projections into read ops (that support projections, like parquet reads). ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

…-project#58011) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

## Description This PR adds support for reading Unity Catalog Delta tables in Ray Data with automatic credential vending. This enables secure, temporary access to Delta Lake tables stored in Databricks Unity Catalog without requiring users to manage cloud credentials manually. ### What's Added - **`ray.data.read_unity_catalog()`** - Updated public API for reading Unity Catalog Delta tables - **`UnityCatalogConnector`** - Handles Unity Catalog REST API integration and credential vending - **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage, and Google Cloud Storage - **Automatic credential management** - Obtains temporary, least-privilege credentials via Unity Catalog API - **Delta Lake integration** - Properly configures PyArrow filesystem for Delta tables with session tokens ### Key Features ✅ **Production-ready credential vending API** - Uses stable, public Unity Catalog APIs ✅ **Secure by default** - Temporary credentials with automatic cleanup ✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud Storage) ✅ **Delta Lake optimized** - Handles session tokens and PyArrow filesystem configuration ✅ **Comprehensive error handling** - Helpful messages for common issues (deletion vectors, permissions, etc.) ✅ **Full logging support** - Debug and info logging throughout ### Usage Example ```python import ray # Read a Unity Catalog Delta table ds = ray.data.read_unity_catalog( table="main.sales.transactions", url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com", token="dapi...", region="us-west-2" # Optional, for AWS ) # Use standard Ray Data operations ds = ds.filter(lambda row: row["amount"] > 100) ds.show(5) ``` ### Implementation Notes This is a **simplified, focused implementation** that: - Supports **Unity Catalog tables only** (no volumes - that's in private preview) - Assumes **Delta Lake format** (most common Unity Catalog use case) - Uses **production-ready APIs** only (no private preview features) - Provides ~600 lines of clean, reviewable code The full implementation with volumes and multi-format support is available in the `data_uc_volumes` branch and can be added in a future PR once this foundation is reviewed. ### Testing - ✅ All ruff lint checks pass - ✅ Code formatted per Ray standards - ✅ Tested with real Unity Catalog Delta tables on AWS S3 - ✅ Proper PyArrow filesystem configuration verified - ✅ Credential vending flow validated ## Related issues Related to Unity Catalog and Delta Lake support requests in Ray Data. ## Additional information ### Architecture The implementation follows the **connector pattern** rather than a `Datasource` subclass because Unity Catalog is a metadata/credential layer, not a data format. The connector: 1. Fetches table metadata from Unity Catalog REST API 2. Obtains temporary credentials via credential vending API 3. Configures cloud-specific environment variables 4. Delegates to `ray.data.read_delta()` with proper filesystem configuration ### Delta Lake Special Handling Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration with session tokens (environment variables alone are insufficient). This implementation correctly creates and passes the filesystem object to the `deltalake` library. ### Cloud Provider Support | Provider | Credential Type | Implementation | |----------|----------------|----------------| | AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session token | | Azure Blob | SAS tokens | Environment variables (AZURE_STORAGE_SAS_TOKEN) | | GCP Cloud Storage | OAuth tokens / Service account | Environment variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) | ### Error Handling Comprehensive error messages for common issues: - **Deletion Vectors**: Guidance on upgrading deltalake library or disabling the feature - **Column Mapping**: Compatibility information and solutions - **Permissions**: Clear list of required Unity Catalog permissions - **Credential issues**: Detailed troubleshooting steps ### Future Enhancements Potential follow-up PRs: - Unity Catalog volumes support (when out of private preview) - Multi-format support (Parquet, CSV, JSON, images, etc.) - Custom datasource integration - Advanced Delta Lake features (time travel, partition filters) ### Dependencies - Requires `deltalake` package for Delta Lake support - Uses standard Ray Data APIs (`read_delta`, `read_datasource`) - Integrates with existing PyArrow filesystem infrastructure ### Documentation - Full docstrings with examples - Type hints throughout - Inline comments with references to external documentation - Comprehensive error messages with actionable guidance --------- Signed-off-by: soffer-anyscale <[email protected]>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <[email protected]>

These numbers are outdated, and the ones we report are not very useful. We will refresh them soon. Signed-off-by: Edward Oakes <[email protected]>

…54857) Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]>

## Description https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.to_numpy <img width="772" height="270" alt="Screenshot 2025-10-18 at 3 14 36 PM" src="https://github.com/user-attachments/assets/d9cbf986-4271-41e6-9c4c-96201d32d1c6" /> `zero_copy_only` is actually default to True, so we should explicit pass False, for pyarrow version < 13.0.0 https://github.com/ray-project/ray/blob/1e38c9408caa92c675f0aa3e8bb60409c2d9159f/python/ray/data/_internal/arrow_block.py#L540-L546 ## Related issues Closes ray-project#57819 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin (Owen) <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>

) Updating the default value calculation in the docstrings for the public API. Signed-off-by: irabbani <[email protected]>

…#58025) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…ay-project#58057) ## Description This PR allows users to override individual flags set by `RAY_SERVE_THROUGHPUT_OPTIMIZED`. This improves the current UX of having to set each flag individually if any of the flags is different from what `RAY_SERVE_THROUGHPUT_OPTIMIZED` sets. ## Related issues ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: akyang-anyscale <[email protected]>

…ct#57964) ## Description This PR fixes the import statement in `MCAPDatasource` to follow Python best practices for explicit imports. **Change**: Updated from `import mcap` + `mcap.reader.make_reader(f)` to `from mcap.reader import make_reader` for cleaner, more explicit importing. **Why this is needed**: - Follows Python best practices for explicit imports (PEP 8) - Improves code readability by making it clear what's being used - Better compatibility with different mcap package versions - Reduces namespace pollution ## Related issues None - this is a code quality improvement. ## Additional information ### What changed **Before:** ```python import mcap reader = mcap.reader.make_reader(f) ``` **After:** ```python from mcap.reader import make_reader reader = make_reader(f) ``` ### Impact - **Minimal risk**: This is a refactoring of the import statement only - **No API changes**: The `MCAPDatasource` public API remains unchanged - **No behavioral changes**: The functionality is identical - **Tested**: All ruff lint checks pass ### File modified - `python/ray/data/_internal/datasource/mcap_datasource.py` (2 lines changed) This change makes the code more maintainable and follows modern Python conventions for explicit imports. Signed-off-by: soffer-anyscale <[email protected]>

…ject#58002) ## Description Currently, the aggregators don't consider total available memory usage in scheduling, and will default to 1GiB if the `estimated_dataset_bytes` is None. This can be troublesome for smaller machines. See ray-project#57979 for details. This PR uses total available memory to calculate memory per aggregator. ## Related issues Fixes ray-project#57979 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <[email protected]>

…ray-project#58040) ## Description This change addresses the issues that currently upon column renaming we're not removing original columns. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: tianyi-ge <[email protected]> Signed-off-by: Tianyi <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Shanshan Shen <[email protected]>

…oject#56855) currently the `RequestRouterConfig` is initialized from the controller. When `RequestRouterConfig` is initialized, we try to import the Class and serialize it. This is unsafe because the controller does not have the application runtime-env. See the following stack trace for evidence of how `RequestRouterConfig` is initialized in controller ``` Unexpected error occurred while applying config for application 'app1': Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/_private/application_state.py", line 691, in _reconcile_build_app_task overrided_infos = override_deployment_info( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/_private/application_state.py", line 1333, in override_deployment_info override_options["deployment_config"] = DeploymentConfig(**original_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/main.py", line 339, in __init__ values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/main.py", line 1074, in validate_model v_, errors_ = field.validate(value, values, loc=field.alias, cls=cls_) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/fields.py", line 881, in validate v, errors = self._validate_singleton(v, values, loc, cls) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/fields.py", line 1098, in _validate_singleton return self._apply_validators(v, values, loc, cls, self.validators) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/fields.py", line 1154, in _apply_validators v = validator(cls, v, values, self, self.model_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/class_validators.py", line 337, in <lambda> return lambda cls, v, values, field, config: validator(v) ^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/main.py", line 711, in validate return cls(**value) ^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/config.py", line 135, in __init__ self._serialize_request_router_cls() File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/config.py", line 151, in _serialize_request_router_cls request_router_class = import_attr(request_router_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_common/utils.py", line 44, in import_attr module = importlib.import_module(module_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/importlib/__init__.py", line 90, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1387, in _gcd_import File "<frozen importlib._bootstrap>", line 1360, in _find_and_load File "<frozen importlib._bootstrap>", line 1324, in _find_and_load_unlocked ModuleNotFoundError: No module named 'custom_request_router' ``` This PR import the custom class from `build_serve_application` so that the operation is performed in context of runtime env The same idea applies to the deployment level autoscaling function and application-level autoscaling function. Another issue happens because `cloudpickle` sees the `UniformRequestRouter` class as **a top-level symbol in an importable module** (`custom_request_router`). When `cloudpickle.dumps()` runs, it recognizes that the object can be **re-imported** via `import custom_request_router; custom_request_router.UniformRequestRouter`, so instead of embedding the actual code, it just stores that import reference. That’s why the pickle bytes only contain `b'...custom_request_router...UniformRequestRouter...'` — no source or bytecode. but we want `cloudpickle` to embed the class definition instead of just referencing it, call: ```python import importlib, cloudpickle mod = importlib.import_module("custom_request_router") cloudpickle.register_pickle_by_value(mod) payload = cloudpickle.dumps(mod.UniformRequestRouter) ``` This tells `cloudpickle` to serialize everything from that module **by value (with code included)** rather than **by reference**. We cannot rely on the reference because custom request router class can be used in serve proxy running outside of runtime env, hence having the entire bytecode make `cloudpickle.loads` possible Changes ``` request_router_config: RequestRouterConfig = Field( default=RequestRouterConfig(), update_type=DeploymentOptionUpdateType.NeedsActorReconfigure, ) ``` to ``` request_router_config: RequestRouterConfig = Field( default_factory=RequestRouterConfig, update_type=DeploymentOptionUpdateType.NeedsActorReconfigure, ) ``` because `RequestRouterConfig` was getting initialized at import time and causing a problem is the docs build. also changed ``` created_at: float = field(default_factory=lambda: time.time()) ``` ``` self._get_curr_time_s = ( get_curr_time_s if get_curr_time_s is not None else lambda: time.time() ) ``` because a reference to `time.time` is not serializable by cloudpickle --------- Signed-off-by: abrar <[email protected]>

upgrading all ci tests to run python 3.10 (minus 3 jobs: ml: train minimal & serve: tests & api_policy_check ) previous build link: https://buildkite.com/ray-project/microcheck/builds/28045#_ Will convert the failing jobs in the future --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Nikhil G <[email protected]>

improving error messages for raydepsets Updated unit tests to verify error messages Signed-off-by: elliot-barn <[email protected]>

Add Azure CLI and dependencies into `base-extra` images --------- Signed-off-by: kevin <[email protected]>

…8088) those tests have been failing and jailed for quite some time related to: - ray-project#46687 - ray-project#49847 - ray-project#49846 Signed-off-by: Lonnie Liu <[email protected]>

removing format script and all references --------- Signed-off-by: elliot-barn <[email protected]>

…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <[email protected]> Co-authored-by: DPatel_7 <[email protected]>

… submission/block generation metrics (ray-project#57246)   ## Why are these changes needed? On executor shutdown, the metrics persist even after execution. The plan is to reset on streaming_executor.shutdown. This PR also includes 2 potential drive-by fixes for metric calculation  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>

Add some output example of the command to help the end-user to verify the execution result. Signed-off-by: fscnick <[email protected]>

## Description This PR adds a ‎`preserve_row` option to ‎`map_batches`. When ‎`preserve_row` is true, the limit operator can be pushed down through this ‎`map_batches` call for optimization. Note: ‎`map_group` is built on ‎`map_batches`, but limit pushdown support for ‎`map_group` is out of scope for this PR, so ‎`preserve_row_count` is set to false for it. ## Related issues ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>

~~Before:~~ ~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~ ~~In before, the progress bar won't update until the first tasks finishes.~~ ~~After: ~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~ ~~In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.~~ ~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~ Refractoring the progress bar estimates using known metrics. ## Why are these changes needed? Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>

…58046) This pr sets up the helper classes and utils to enable token based authentication for ray core rpc calls. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

…-project#56783) Signed-off-by: dayshah <[email protected]>

The python test step is failing on master now because of this. Probably a logical merge conflict. ``` FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary) ... [2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import ( -- | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils' ``` Signed-off-by: dayshah <[email protected]>

be consistent with the default build environment Signed-off-by: Lonnie Liu <[email protected]>

…ject#58543) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

## Description - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across codebase - Excluded healthcheck endpoints from authentication for Kubernetes compatibility - Fixed dashboard cookie handling to respect auth mode and clear stale tokens when switching clusters --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…ls (ray-project#58424) ## Description - Use client interceptor for adding auth tokens in grpc calls when `AUTH_MODE=token` - BuildChannel() will automatically include the interceptor - Removed `auth_token` parameter from `ClientCallImpl` - removed manual auth from `python_gcs_subscriber`.cc - tests to verify auth works for autoscaller apis --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…`) (ray-project#57090) When actors terminate gracefully, Ray calls the actor's `__ray_shutdown__()` method if defined, allowing for cleanup of resources. But, this is not invoked in case actor goes out of scope due to `del actor`. ### Why `del actor` doesn't invoke `__ray_shutdown__` Traced through the entire code path, and here's what happens: Flow when `del actor` is called: 1. **Python side**: `ActorHandle.__del__()` -> `worker.core_worker.remove_actor_handle_reference(actor_id)` https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040 2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` -> `reference_counter_->RemoveLocalReference()` - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed` callback https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506 3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` -> `AsyncReportActorOutOfScope()` to GCS https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183 https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51 4. **GCS receives notification**: `HandleReportActorOutOfScope()` - **THE PROBLEM IS HERE** ([line 279 in `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)): ```cpp DestroyActor(actor_id, GenActorOutOfScopeCause(actor), /*force_kill=*/true, // <-- HARDCODED TO TRUE! [reply, send_reply_callback]() { ``` 5. **Actor worker receives kill signal**: `HandleKillActor()` in [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970) ```cpp if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE ForceExit(...) // Skips __ray_shutdown__ } else { Exit(...) // Would call __ray_shutdown__ } ``` 6. **ForceExit path**: Bypasses graceful shutdown -> No `__ray_shutdown__` callback invoked. This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE actors. Also, updated the docs. --------- Signed-off-by: Sagar Sumit <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

Currently, a node is considered idle while pulling objects from the remote object store. This can lead to situations where a node is terminated as idle, causing the cluster to enter an infinite loop when pulling large objects that exceed the node idle termination timeout. This PR fixes the issue by treating object pulling as a busy activity. Note that nodes can still accept additional tasks while pulling objects (since pulling consumes no resources), but the auto-scaler will no longer terminate the node prematurely. Closes ray-project#54372 Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

…_FACTOR` to 2 (ray-project#58262) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description This was setting the value to be aligned with the previous default of 4. However, after some consideration i've realized that 4 is too high of a number so actually lowering this to 2 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <[email protected]>

…ray-project#58504) Signed-off-by: Nikhil Ghosh <[email protected]>

…y-project#58523) ## Description This PR improves documentation consistency in the `python/ray/data` module by converting all remaining rST-style docstrings (`:param:`, `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.). ## Additional information **Files modified:** - `python/ray/data/preprocessors/utils.py` - Converted `StatComputationPlan.add_callable_stat()` - `python/ray/data/preprocessors/encoder.py` - Converted `unique_post_fn()` - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()` and `BlockColumnAccessor.is_composed_of_lists()` - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` - Converted `DeltaSharingDatasource.setup_delta_sharing_connections()` Signed-off-by: Balaji Veeramani <[email protected]>

…oject#58549) ## Description The original `test_concurrency` function combined multiple test scenarios into a single test with complex control flow and expensive Ray cluster initialization. This refactoring extracts the parameter validation tests into focused, independent tests that are faster, clearer, and easier to maintain. Additionally, the original test included "validation" cases that tested valid concurrency parameters but didn't actually verify that concurrency was being limited correctly—they only checked that the output was correct, which isn't useful for validating the concurrency feature itself. **Key improvements:** - Split validation tests into `test_invalid_func_concurrency_raises` and `test_invalid_class_concurrency_raises` - Use parametrized tests for different invalid concurrency values - Switch from `shutdown_only` with explicit `ray.init()` to `ray_start_regular_shared` to eliminate cluster initialization overhead - Minimize test data from 10 blocks to 1 element since we're only validating parameter errors - Remove non-validation tests that didn't verify concurrency behavior ## Related issues N/A ## Additional information The validation tests now execute significantly faster and provide clearer failure messages. Each test has a single, well-defined purpose making maintenance and debugging easier. --------- Signed-off-by: Balaji Veeramani <[email protected]>

previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <[email protected]>

## Description Creates a ranker interface that will rank the best operator to run next in `select_operator_to_run`. This code only refractors the existing code. The ranking value must be something that is comparable. ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <[email protected]>

…#57783) 1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set to initialize jax.distributed: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38 2. Before this change, user will have to configure both `use_tpu=True` in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able to start jax.distributed. `JAX_PLATFORMS` can be comma separated string. 3. If user uses other jax.distributed libraries like Orbax, sometimes, it will leads to misleading error about distributed initialization. 4. After this change, if user sets `use_tpu=True`, we automatically add this to env var. 5. tpu unit test is not available this time, will explore for how to cover it later. --------- Signed-off-by: Lehui Liu <[email protected]>

and ask people to use that lock file for building docs. Signed-off-by: Lonnie Liu <[email protected]>

…regression (ray-project#58390) ## Description This PR address the performance regression introduced in the [PR to make ray.get thread safe](ray-project#57911). Specifically, the previous PR requires the worker to block and wait for AsyncGet to return with a reply of the request id needed for correctly cleaning up get requests. This additional synchronous step causes the plasma store Get to regress in performance. This PR moves the request id generation step to the plasma store, removing the blocking step to fix the perf regression. ## Related issues - [PR which introduced perf regression](ray-project#57911) - [PR which observed the regression](ray-project#58175) ## Additional information New performance of the change measured by `ray microbenchmark`. <img width="485" height="17" alt="image" src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0" /> Original performance prior to the change. Here we focus on the regressing `single client get calls (Plasma Store)` metric, where our new performance returns us back to the original 10k per second range compared to the existing sub 5k per second. <img width="811" height="355" alt="image" src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c" /> --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]>

## Description support token auth in ray client server by using the existing grpc interceptors. This pr refactors the code to: - add/rename sync and async client and server interceptors - create grpc utils to house grpc channel and server creation logic, python codebase is updated to use these methods - separate tests for sync and async interceptors - make existing authentication integration tests to run with RAY_CLIENT mode --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…oject#58371) ## Description Currently Ray Data has a preprocessor called `RobustScaler`. This scales the data based on given quantiles. Calculating the quantiles involves sorting the entire dataset by column for each column (C sorts for C number of columns), which, for a large dataset, will require a lot of calculations. ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch` as I couldn't actually find well-maintained tdigest libraries for python. ddsketch is better maintained. ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile` aggregator ## Related issues N/A ## Additional information N/A --------- Signed-off-by: kyuds <[email protected]> Signed-off-by: Daniel Shin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>

generating depsets for base extra python requirements Installing requirements in base extra image --------- Signed-off-by: elliot-barn <[email protected]>

sourcery-ai

The pull request #676 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5359.

gemini-code-assist · 2025-11-13T03:07:49Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest changes from the master branch into main, primarily focusing on enhancing and streamlining the project's continuous integration and deployment infrastructure. It introduces a more modular and robust system for managing build pipelines, Python dependencies, and Docker image creation. The changes aim to improve build reliability, expand platform support, and enforce stricter code quality standards, while also updating internal tooling and documentation practices.

Highlights

CI/CD Infrastructure Overhaul: Buildkite pipelines have been extensively refactored into new, modular YAML files, centralizing Docker image builds and documentation processes for better organization and maintainability.
Enhanced Dependency Management: The project now leverages uv for Python dependency management and introduces a new raydepsets tool, improving build determinism and control over package versions.
Expanded Platform and Python Support: CI matrices have been updated to include Python 3.10 and 3.13, alongside new CUDA versions (e.g., 12.8.1), and macOS builds are now standardized on arm64.
Refined Docker Image Strategy: New Dockerfiles and Wanda configurations enable more granular control over Ray component image builds, including extra variants for specialized environments.
C++ API and Runtime Modernization: Updates to the C++ API and runtime code reflect a move towards more modern C++ features and improved internal consistency, including changes to RemoteFunctionHolder and network utility functions.
Comprehensive Linting and Style Checks: New pre-commit hooks and linting tools like semgrep, vale, and eslint have been integrated to enforce stricter code quality and documentation style guidelines.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is a massive and impressive refactoring of the build system, CI pipelines, and dependency management. The move to modular Bazel files, rules_pkg, uv, and miniforge are all great steps forward for maintainability and build performance. The CI configuration is now much more organized. The codebase also sees some nice modernizations and cleanups. I've found a couple of minor issues to improve upon.

gemini-code-assist · 2025-11-13T03:12:14Z

ci/ray_ci/macos/macos_ci_build.sh

-}
-
-"$@"
+build "$@"


Calling build "$@" at the end of the script is a bit confusing. Since this script is called without arguments from the CI configuration, "$@" will be empty. However, if someone runs this script locally with arguments, they will be passed to the build function, which doesn't expect them. It would be clearer and more robust to just call build without arguments.

Suggested change

build "$@"

build

gemini-code-assist · 2025-11-13T03:12:14Z

ci/raydepsets/cli.py

+                f"Lock files are not up to date for config: {self.config_name}. Please update lock files and push the changes.\n"
+                + "".join(diffs)
+            )


The difflib.unified_diff with lineterm="" yields lines without trailing newlines. Joining them with "" will produce a single long line of diff output, which is hard to read. It would be better to join them with "\n" to make the diff readable in the error message.

Suggested change

f"Lock files are not up to date for config: {self.config_name}. Please update lock files and push the changes.\n"

+ "".join(diffs)

)

+ "\n".join(diffs)

github-actions · 2025-11-28T01:41:49Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

elliot-barn and others added 30 commits October 23, 2025 03:56

[deps][ci] compiling all depsets in single job (ray-project#57957)

f0c7b53

combining all depset checks into a single job TODO: add raydepset feature to build all depsets for the depset graph --------- Signed-off-by: elliot-barn <[email protected]>

fix default dep name for async inf (ray-project#57664)

42b58a4

- default deployment name was changed to _TaskConsumerWrapper after async inference implementation, fixed it now Signed-off-by: harshit <[email protected]>

[serve][llm] Add KV connector factory and MultiConnector support (ray…

9051773

…-project#58011) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[core] Mark the scalability envelope as WIP (ray-project#58055)

4f497a6

These numbers are outdated, and the ones we report are not very useful. We will refresh them soon. Signed-off-by: Edward Oakes <[email protected]>

[core] (cgroups 22/n) Updating public API doc-strings (ray-project#58059

4f2db49

) Updating the default value calculation in the docstrings for the public API. Signed-off-by: irabbani <[email protected]>

[doc][serve][llm] Add user guide for kv-cache offloading (ray-project…

ac943b3

…#58025) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[Data] - Fix expression mapping for Pandas (ray-project#57868)

d05ca1c

[core] report driver stat and add test (ray-project#58045)

752d9cd

Signed-off-by: tianyi-ge <[email protected]> Signed-off-by: Tianyi <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[Doc] Minor Fix for doc typos (ray-project#58038)

bc49352

Signed-off-by: Shanshan Shen <[email protected]>

[ci][deps] raydepsts: Improved error messages (ray-project#58058)

6da558b

improving error messages for raydepsets Updated unit tests to verify error messages Signed-off-by: elliot-barn <[email protected]>

[release] Add Azure CLI to base-extra image (ray-project#58012)

af33918

Add Azure CLI and dependencies into `base-extra` images --------- Signed-off-by: kevin <[email protected]>

[air] disable air example tests that have been failing (ray-project#5…

7255df0

…8088) those tests have been failing and jailed for quite some time related to: - ray-project#46687 - ray-project#49847 - ray-project#49846 Signed-off-by: Lonnie Liu <[email protected]>

[lint][ci] removing format script (ray-project#57799)

353bdcf

removing format script and all references --------- Signed-off-by: elliot-barn <[email protected]>

[serve][llm][transcription] Add support for Transcription in vLLM eng…

ca1f7d9

…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <[email protected]> Co-authored-by: DPatel_7 <[email protected]>

[Doc][KubeRay] add output example of the command (ray-project#58078)

14d2689

Add some output example of the command to help the end-user to verify the execution result. Signed-off-by: fscnick <[email protected]>

[Core] Add authentication token logic and related tests (ray-project#…

b197fa8

…58046) This pr sets up the helper classes and utils to enable token based authentication for ray core rpc calls. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

dayshah and others added 20 commits November 11, 2025 20:44

[core][rdt] Abort NIXL and allow actor reuse on failed transfers (ray…

20bf682

…-project#56783) Signed-off-by: dayshah <[email protected]>

[doc] downgrade readthedocs to use python 3.10 (ray-project#58536)

a15f5be

be consistent with the default build environment Signed-off-by: Lonnie Liu <[email protected]>

[doc][serve][llm] Attached the correct figure to the pd docs (ray-pro…

584f5ac

…ject#58543) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (…

7e87283

…ray-project#58504) Signed-off-by: Nikhil Ghosh <[email protected]>

[bazel] upgrade bazel python rules to 0.25.0 (ray-project#58535)

9d5a241

previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <[email protected]>

[doc] symlink the doc dependency lock file (ray-project#58520)

9e450e6

and ask people to use that lock file for building docs. Signed-off-by: Lonnie Liu <[email protected]>

[images][deps] raydepsets base extra depset (ray-project#58461)

0d56f3e

generating depsets for base extra python requirements Installing requirements in base extra image --------- Signed-off-by: elliot-barn <[email protected]>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 13, 2025 02:59

antfin-oss added auto-generated daily-merge labels Nov 13, 2025

antfin-oss assigned ffbin Nov 13, 2025

sourcery-ai bot reviewed Nov 13, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 13, 2025

View reviewed changes

github-actions bot added the stale label Nov 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-13 #676

🔄 daily merge: master → main 2025-11-13 #676

Uh oh!

antfin-oss commented Nov 13, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Uh oh!

gemini-code-assist bot Nov 13, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

84 participants

🔄 daily merge: master → main 2025-11-13 #676

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-13 #676

Uh oh!

Conversation

antfin-oss commented Nov 13, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

84 participants