🔄 daily merge: master → main 2025-11-14 #677

antfin-oss · 2025-11-14T02:57:40Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-14
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

~~Before:~~ ~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~ ~~In before, the progress bar won't update until the first tasks finishes.~~ ~~After: ~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~ ~~In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.~~ ~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~ Refractoring the progress bar estimates using known metrics. ## Why are these changes needed? Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>

…58046) This pr sets up the helper classes and utils to enable token based authentication for ray core rpc calls. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

I suspect that when we deploy the app config, we dont wait long enough before sending traffic, so requests could go to the wrong version --------- Signed-off-by: abrar <[email protected]>

Signed-off-by: Seiji Eicher <[email protected]>

Signed-off-by: ahao-anyscale <[email protected]>

…58092) Signed-off-by: Jiajun Yao <[email protected]>

…ay-project#57882) # Summary The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is `FINISHED`, but because there are pending validations, the `controller` actor is alive and results are inaccessible. This PR: * Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after the worker group finishes but before the controller shuts everything down. * Makes `ValidationManager` logging slightly cleaner. Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in `StateManager` logs and Grafana but not in the state export. We only want to show terminal states in the state export after `fit()` has returned and results are accessible. More concretely: * Finished/errored: The worker group finishes (Train Run is `RUNNING` but internal state is `SHUTTING_DOWN`), validation finishes (both Train Run and internal state say `FINISHED` or `ERRORED`), then results are accessible. * Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is `ABORTED`. However, this PR doesn't change the current behavior, in which the Train Run might be `ABORTED` before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run `ABORTED` in a future PR. I considered polling both the worker group and validations in `_step` itself, but decided to leave `_step` as a function that only cares about the worker group. # Testing Unit tests --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ject#57930) Add actor+job+node event to ray event export doc Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

Signed-off-by: Dhyey Shah <[email protected]> Signed-off-by: Qiaolin-Yu <[email protected]> Signed-off-by: Qiaolin Yu <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Stephanie Wang <[email protected]>

disabled the wrong test with a different name from the issue mistakenly associated issue: ray-project#46687 Signed-off-by: Lonnie Liu <[email protected]>

upgrading batch inference tests to py3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65258 all except for image_embedding_from_jsonl are running on python 3.10 --------- Signed-off-by: elliot-barn <[email protected]>

…t#58160) Reverts ray-project#58036

…ay-project#57636) Signed-off-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

…roject#57896) ## Description Add missing imports to autoscaling policy example ## Related issues Link related issues: ray-project#57876 (comment) --------- Signed-off-by: daiping8 <[email protected]> Signed-off-by: Ping Dai <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ect#57878) ## Description Now, we hava custom autoscaling policy name in logs. 1. Deployment level Test code from https://docs.ray.io/en/master/serve/advanced-guides/advanced-autoscaling.html#custom-policy-for-deployment <img width="1333" height="294" alt="image" src="https://github.com/user-attachments/assets/bd26e576-fd3c-489b-94c2-4c11b25bb400" /> 2. Application level Test code from https://docs.ray.io/en/master/serve/advanced-guides/advanced-autoscaling.html#application-level-autoscaling <img width="1321" height="431" alt="image" src="https://github.com/user-attachments/assets/d51f4952-faf5-47eb-80d0-6be357437505" /> ## Related issues Closes ray-project#57846 ## Additional information Signed-off-by: daiping8 <[email protected]>

## Description As title states. Example: ``` from ray.data.expressions import col, lit expr = (col("x") + lit(5)) * col("y") print(expr) MUL ├── left: ADD │ ├── left: COL('x') │ └── right: LIT(5) └── right: COL('y') ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]>

fixes min setup build Signed-off-by: Lonnie Liu <[email protected]>

…oject#58030) ## Description Currently, we implicitly assume that `RefBundle` holds exactly 1 block. That's not a safe assumption, and this change is addressing that by explicitly referring to number of blocks instead ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

…ring. (ray-project#58212) Referenced `ray._private.worker.global_worker`, which users don't know or care about. Also cleaned up the wording for `get_node_id` and moved the Ray client note there. --------- Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…ay-project#56723) This PR adds to the utility library for TPU slice placement group scheduling. We generalize the 2 phase approach that the JaxTrainer uses to reserve and schedule the workers on the TPU slice. ray-project#55162 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Aaron Liang <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]>

adding --all-configs option to raydepsets to build all configs. Added cli and workspace unit tests --------- Signed-off-by: elliot-barn <[email protected]>

…ay-project#57788)    ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [ ] No  **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: Alexey Kudinkin <[email protected]>

running serve tests on py3.10 failing tests already are set to manual frequency: https://buildkite.com/ray-project/release/builds/62142#_ --------- Signed-off-by: elliot-barn <[email protected]>

## Description in the original taskEvent proto, worker_id is marked as optional https://github.com/ray-project/ray/blob/830a456b9b558028853423c9042f7e2763ec5283/src/ray/protobuf/gcs.proto#L201 but in ray event it is not https://github.com/ray-project/ray/blob/f635de7c86d0d0f813a305a9fd5e864a64257894/src/ray/protobuf/public/events_task_lifecycle_event.proto#L42 in the converter we always set the worker_id field even if its an empty string https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_ray_event_converter.cc#L145. if an optional field is set, even if it is empty(default proto value) it is considered as having a value, and during mergeFrom() calls the value is considered and overwrites the destination objects existing value. source:https://protobuf.dev/programming-guides/field_presence/ Explicitly set fields – including default values – are merged-from this pr fixes this gap in the conversion logic --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

…y-project#58022) Deprecate `CheckpointConfig(checkpoint_at_end, checkpoint_frequency)` and mark the `resume_from_checkpoint, metadata` Trainer constructor arguments as deprecated in the docstrings. Update the "inspecting results" user guide doc code to show how to catch and inspect errors raised by trainer.fit(). The previous recommendation to check result.error is unusable because we always raise the error which prevents the user from accessing the result object. --------- Signed-off-by: Justin Yu <[email protected]>

…pported operator filter (ray-project#57970) ## Description The per node metrics at OSS Ray Data dashboard are not displayed as expected. Because of this code change ray-project#55495, the following three metrics were added a filter for `operator`, which is [not supported](https://github.com/ray-project/ray/blob/e51f8039bc6992d37834bcff109a3d340e78fcde/python/ray/data/_internal/stats.py#L448) by per node metrics, and causes empty result. ray_data_num_tasks_finished_per_node ray_data_bytes_outputs_of_finished_tasks_per_node ray_data_blocks_outputs_of_finished_tasks_per_node Signed-off-by: cong.qian <[email protected]>

…#58223) upgrading tune scalability release tests to python 3.10 Successful release test run : https://buildkite.com/ray-project/release/builds/65669#_ only disabled GCE tests failing --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

- Add Azure VM launcher release test - Change region for the Azure cluster to be in `centralus` since `westus2` has trouble with availability. - Add helper function to authenticate with Azure using service principal in launch cluster script --------- Signed-off-by: kevin <[email protected]>

## Description 1. Update docs 2. catch exception and redirect users to docs ## Related issues ray-project#56855 ## Additional information Hard to write tests for this situation. Manually verified that this is the right exception to catch ``` try: cloudpickle.loads(b'\x80\x05\x95\xbc\x03\x00\x00\x00\x00\x00\x00\x8c\x1bray.cloudpickle.cloudpickle\x94\x8c\x0e_make_function\x94\x93\x94(h\x00\x8c\r_builtin_type\x94\x93\x94\x8c\x08CodeType\x94\x85\x94R\x94(K\x01K\x00K\x00K\x03K\x03KCCnt\x00\xa0\x01\xa1\x00}\x01|\x01j\x02}\x02t\x03t\x04\x83\x01\x01\x00d\x01|\x02\x04\x00\x03\x00k\x01r*d\x02k\x00r:n\x04\x01\x00n\x0cd\x03d\x04d\x05i\x01f\x02S\x00d\x06|\x02\x04\x00\x03\x00k\x01rNd\x07k\x00r^n\x04\x01\x00n\x0cd\x08d\x04d\ti\x01f\x02S\x00d\nd\x04d\x0bi\x01f\x02S\x00d\x00S\x00\x94(NK\tK\x11K\x02\x8c\x06reason\x94\x8c\x0eBusiness hours\x94K\x12K\x14K\x04\x8c\x18Evening batch processing\x94K\x01\x8c\x0eOff-peak hours\x94t\x94(\x8c\x08datetime\x94\x8c\x03now\x94\x8c\x04hour\x94\x8c\x05print\x94\x8c\x04avro\x94t\x94\x8c\x03ctx\x94\x8c\x0ccurrent_time\x94\x8c\x0ccurrent_hour\x94\x87\x94\x8c\x1b/home/ubuntu/apps/policy.py\x94\x8c!scheduled_batch_processing_policy\x94K\x0eC\x10\x00\x03\x08\x01\x06\x01\x08\x02\x18\x01\x0c\x02\x18\x01\x0c\x03\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94N\x8c\x08__name__\x94\x8c\x08__main__\x94\x8c\x08__file__\x94h\x18uNNNt\x94R\x94h\x00\x8c\x12_function_setstate\x94\x93\x94h#}\x94}\x94(h\x1fh\x19\x8c\x0c__qualname__\x94h\x19\x8c\x0f__annotations__\x94}\x94(h\x14\x8c\x10ray.serve.config\x94\x8c\x12AutoscalingContext\x94\x93\x94\x8c\x06return\x94h\x04\x8c\x0cGenericAlias\x94\x85\x94R\x94\x8c\x08builtins\x94\x8c\x05tuple\x94\x93\x94h2\x8c\x03int\x94\x93\x94\x8c\t_operator\x94\x8c\x07getitem\x94\x93\x94\x8c\x06typing\x94\x8c\x04Dict\x94\x93\x94h2\x8c\x03str\x94\x93\x94h:\x8c\x03Any\x94\x93\x94\x86\x94\x86\x94R\x94\x86\x94\x86\x94R\x94u\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h \x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94(h\x0eh\x0e\x8c\x08datetime\x94\x93\x94h\x12h\x00\x8c\tsubimport\x94\x93\x94h\x12\x85\x94R\x94uu\x86\x94\x86R0.') except (ModuleNotFoundError, ImportError) as e: print(f"caused by {e} {type(e)}") ``` ``` ❯ python policy.py caused by No module named 'avro' <class 'ModuleNotFoundError'> ``` --------- Signed-off-by: abrar <[email protected]>

…56481) http://github.com/ray-project/ray/pull/50092 warned that we'd be changing the default `file_extensions` for Parquet from `None` to `[parquet]`. This was the motivation: > People often have non-Parquet files in their datasets (e.g., _SUCCESS or stale files). However, the default for file_extensions is None, so read_parquet tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like ["parquet"]. This PR adds a warning for that change. This PR follows up on actually changes the default. --------- Signed-off-by: Balaji Veeramani <[email protected]>

…ect#58577) Signed-off-by: irabbani <[email protected]>

nothing is using it anymore Signed-off-by: Lonnie Liu <[email protected]>

…58580) Adding optional `include_setuptools` flag for depset configuration If the flag is set on a depset config --unsafe-package setuptools will not be included for depset compilation If the flag does not exist (default false) on a depset config --unsafe-package setuptools will be appended to the default arguments --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <[email protected]>

…ay-project#58542) ## What does this PR do? Fixes HTTP streaming file downloads in Ray Data's download operation. Some URIs (especially HTTP streams) require `open_input_stream` instead of `open_input_file`. ## Changes - Modified `download_bytes_threaded` in `plan_download_op.py` to try both `open_input_file` and `open_input_stream` for each URI - Improved error handling to distinguish between different error types - Failed downloads now return `None` gracefully instead of crashing ## Testing ``` import pyarrow as pa from ray.data.context import DataContext from ray.data._internal.planner.plan_download_op import download_bytes_threaded # Test URLs: one valid, one 404 urls = [ "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&", ] # Create PyArrow table and call download function table = pa.table({"url": urls}) ctx = DataContext.get_current() results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) # Check results result_table = results[0] for i in range(result_table.num_rows): url = result_table['url'][i].as_py() bytes_data = result_table['bytes'][i].as_py() if bytes_data is None: print(f"Row {i}: FAILED (None) - try-catch worked ✓") else: print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)") print(f" URL: {url[:60]}...") print("\n✅ Test passed: Failed downloads return None instead of crashing.") ``` Before the fix: ``` TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/ray/default/test_streaming_fallback.py", line 110, in <module> test_download_expression_with_streaming_fallback() File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__ if not self.__exit__(*sys.exc_info()): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__ setattr(self.target, self.attribute, self.temp_original) TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' (base) ray@ip-10-0-39-21:~/default$ python test.py 2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker! Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file Traceback (most recent call last): File "/home/ray/default/test.py", line 16, in <module> results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded uri_bytes = list( ^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen raise item File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file ``` After the fix: ``` Row 0: SUCCESS (189370 bytes) URL: https://static-assets.tesla.com/configurator/compositor?cont... ``` Tested with HTTP streaming URLs (e.g., Tesla configurator images) that previously failed: - ✅ Successfully downloads HTTP stream files - ✅ Gracefully handles failed downloads (returns None) - ✅ Maintains backward compatibility with existing file downloads --------- Signed-off-by: xyuzh <[email protected]> Signed-off-by: Robert Nishihara <[email protected]> Co-authored-by: Robert Nishihara <[email protected]>

## Description We today have very little observability into pubsub. On a raylet one of the most important states that need to be propagated through the cluster via pubsub is cluster membership. All raylets should in an eventual BUT timely fashion agree on the list of available nodes. This metric just emits a simple counter to keep track of the node count. More pubsub observability to come. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

all tests are passing Signed-off-by: Lonnie Liu <[email protected]>

…#58587) also stops building python 3.9 aarch64 images Signed-off-by: Lonnie Liu <[email protected]>

so that importing test.py does not always import github github repo imports jwt, which then imports cryptography and can lead to issues on windows. Signed-off-by: Lonnie Liu <[email protected]>

this makes it possible to run on a different python version than the CI wrapper code. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>

…ecurity (ray-project#58591) Migrates Ray dashboard authentication from JavaScript-managed cookies to server-side HttpOnly cookies to enhance security against XSS attacks. This addresses code review feedback to improve the authentication implementation (ray-project#58368) main changes: - authentication middleware first looks for `Authorization` header, if not found it then looks at cookies to look for the auth token - new `api/authenticate` endpoint for verifying token and setting the auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and `secure=true` (when using https)) - removed javascript based cookie manipulation utils and axios interceptors (were previously responsible for setting cookies) - cookies are deleted when connecting to a cluster with `AUTH_MODE=disabled`. connecting to a different ray cluster (with different auth token) using the same endpoint (eg due to port-forwarding or local testing) will reshow the popup and ask users to input the right token. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

add support for `ray get-auth-token` cli command + test --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

ray-project#57590) As discovered in the [PR to better define the interface for reference counter](ray-project#57177 (review)), plasma store provider and memory store both share thin dependencies on reference counter that can be refactored out. This will reduce entanglement in our code base and improve maintainability. The main logic changes are located in * src/ray/core_worker/store_provider/plasma_store_provider.cc, where reference counter related logic is refactor into core worker * src/ray/core_worker/core_worker.cc, where factored out reference counter logic is resolved * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where logic related to reference counter has either been removed due to the fact that it is tech debt or refactored into caller functions.  ## Related issue number  ## Checks Microbenchmark: ``` single client get calls (Plasma Store) per second 10592.56 +- 535.86 single client put calls (Plasma Store) per second 4908.72 +- 41.55 multi client put calls (Plasma Store) per second 14260.79 +- 265.48 single client put gigabytes per second 11.92 +- 10.21 single client tasks and get batch per second 8.33 +- 0.19 multi client put gigabytes per second 32.09 +- 1.63 single client get object containing 10k refs per second 13.38 +- 0.13 single client wait 1k refs per second 5.04 +- 0.05 single client tasks sync per second 960.45 +- 15.76 single client tasks async per second 7955.16 +- 195.97 multi client tasks async per second 17724.1 +- 856.8 1:1 actor calls sync per second 2251.22 +- 63.93 1:1 actor calls async per second 9342.91 +- 614.74 1:1 actor calls concurrent per second 6427.29 +- 50.3 1:n actor calls async per second 8221.63 +- 167.83 n:n actor calls async per second 22876.04 +- 436.98 n:n actor calls with arg async per second 3531.21 +- 39.38 1:1 async-actor calls sync per second 1581.31 +- 34.01 1:1 async-actor calls async per second 5651.2 +- 222.21 1:1 async-actor calls with args async per second 3618.34 +- 76.02 1:n async-actor calls async per second 7379.2 +- 144.83 n:n async-actor calls async per second 19768.79 +- 211.95 ``` This PR mainly makes logic changes to the `ray.get` call chain. As we can see from the benchmark above, the single clientget calls performance matches pre-regression levels. --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

…ay-project#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR ray-project#58473 --------- Signed-off-by: abrar <[email protected]>

Currently, Ray metrics and events are exported through a centralized process called the Dashboard Agent. This process functions as a gRPC server, receiving data from all other components (GCS, Raylet, workers, etc.). However, during a node shutdown, the Dashboard Agent may terminate before the other components, resulting in gRPC errors and potential loss of metrics and events. As this issue occurs, the otel sdk logs become very noisy. Add a default options to disable otel sdk logs to avoid confusion. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`, which is a non-flaky version of `fetch_prometheus`. Update all of test usage accordingly. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

… RD Datatype (ray-project#58225) ## Description As title suggests ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <[email protected]>

…ay-project#58581) allowing for py3.13 images (cpu & cu123) in release tests Signed-off-by: elliot-barn <[email protected]>

## Description Add avg prompt length metric When using uniform prompt length (especially in testing), the P50 and P90 computations are skewed due to the 1_2_5 buckets used in vLLM. Average prompt length provides another useful dimension to look at and validate. For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows 9400, and avg accurately shows 5000. <img width="1186" height="466" alt="image" src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a" /> ## Related issues ## Additional information --------- Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Prometheus auto-append the `_total` suffix to all Counter metrics. Ray historically has been supported counter metric with and without `_total` suffix for backward compatibility, but it is now time to drop the support (2 years since the warning was added). There is one place in ray serve dashboard that still doesn't use the `_total` suffix so fix it in this PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray will delegate authentication and authorization of Ray access to Kubernetes TokenReview and SubjectAccessReview APIs. --------- Signed-off-by: Andrew Sy Kim <[email protected]>

unifying to python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

sourcery-ai

The pull request #677 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5363.

gemini-code-assist · 2025-11-14T03:07:36Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a routine daily merge from the master branch into main, focusing on comprehensive updates to the project's build and continuous integration systems. The changes streamline Docker image creation, enhance dependency resolution, and refine code quality checks, ensuring a more robust and maintainable development workflow.

Highlights

CI/CD Infrastructure Updates: The CI/CD pipeline has undergone significant refactoring, including a migration from miniconda to miniforge for Python environments, and a consolidation of Docker image build definitions into a new _images.rayci.yml file. Documentation-related CI steps are now centralized in doc.rayci.yml.
Bazel Build Configuration Enhancements: Bazel build configurations have been updated to enable strict_action_env by default, introduce new Linux workspace status commands, and add UTF-8 support for Windows CXX options. Many RPC and core component build rules have been refactored into sub-BUILD files for better organization.
Python Dependency Management with RayDepsets: A new raydepsets tool has been introduced for more robust Python dependency management, allowing for precise control over dependency compilation, subsetting, and expansion across various Ray components and environments.
Docker Image Tagging and Registry Updates: Docker image tagging logic has been enhanced to include rayci_build_id in tags and support new *-extra image types. Azure Container Registry (ACR) integration has been added for Anyscale Docker images, alongside existing AWS ECR and GCP registries.
Code Ownership and Linting Refinements: Code ownership rules in .github/CODEOWNERS have been consolidated and expanded. Pre-commit hooks have been updated to include semgrep, vale, cython-lint, and eslint for improved code quality and consistency.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is an automated daily merge from master to main. It incorporates a vast number of changes, primarily focused on a large-scale refactoring and improvement of the build and CI systems. Key themes include:

CI/CD Refactoring: The Buildkite pipelines have been significantly modularized. Steps for building images, running documentation checks, and managing dependencies have been moved into separate, dedicated pipeline files (_images.rayci.yml, doc.rayci.yml, dependencies.rayci.yml). This improves clarity and maintainability.
Bazel Overhaul: The build system has undergone a major cleanup. The root BUILD.bazel file has been simplified, with many targets moved to more appropriate sub-packages. The use of pkg_zip and pkg_files replaces older genrule scripts for packaging, which is a move towards more hermetic and declarative builds. The workspace name has been updated to io_ray, following Bazel best practices.
Dependency Management: There's a clear shift towards more robust dependency management. This includes the introduction of a new raydepsets tool, the adoption of uv for Python dependency resolution, and a switch from miniconda to miniforge.
Linting and Static Analysis: The .pre-commit-config.yaml has been greatly expanded with more tools like semgrep, vale, and eslint, enhancing code quality and consistency across the repository.
Code Modernization: Several C++ files have been updated to use modern language features (e.g., std::invoke_result_t instead of the deprecated std::result_of_t) and to align with internal API refactorings.

Overall, these changes represent a significant step forward in the project's engineering practices. The refactoring makes the build system more robust, maintainable, and easier to understand. I have reviewed the changes and found no issues of medium or higher severity. The modifications are consistent and well-aligned with the goal of improving the development infrastructure.

iamjustinhsu and others added 30 commits October 24, 2025 12:29

[Core] Add authentication token logic and related tests (ray-project#…

b197fa8

…58046) This pr sets up the helper classes and utils to enable token based authentication for ray core rpc calls. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

add deployment status check in test (ray-project#58087)

226a414

I suspect that when we deploy the app config, we dont wait long enough before sending traffic, so requests could go to the wrong version --------- Signed-off-by: abrar <[email protected]>

[llm][data] Change example pip install to ray[llm] (ray-project#58096)

05c0aff

Signed-off-by: Seiji Eicher <[email protected]>

[doc][serve][llm] Model loading Docs (ray-project#57922)

a1cf87c

Signed-off-by: ahao-anyscale <[email protected]>

[Core] Fix RAY_CHECK(inserted) inside reference counter (ray-project#…

73deda6

…58092) Signed-off-by: Jiajun Yao <[email protected]>

[core][doc] add actor+job+node event to ray event export doc (ray-pro…

779586f

…ject#57930) Add actor+job+node event to ray event export doc Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

[air] enable a test that was mistakenly disabled (ray-project#58089)

6efcd02

disabled the wrong test with a different name from the issue mistakenly associated issue: ray-project#46687 Signed-off-by: Lonnie Liu <[email protected]>

Revert "[ci] enabling python 3.10 for subset of ci tests" (ray-projec…

4c82e82

…t#58160) Reverts ray-project#58036

[Core][TaskEvents] Add Integration Tests for Task Event Generation (r…

9cbe131

…ay-project#57636) Signed-off-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>

[ci] pin pip version in min setup (ray-project#58215)

f67f126

fixes min setup build Signed-off-by: Lonnie Liu <[email protected]>

[ci][deps] raydepsets build all depsets option (ray-project#58065)

62004a2

adding --all-configs option to raydepsets to build all configs. Added cli and workspace unit tests --------- Signed-off-by: elliot-barn <[email protected]>

[release] updating serve tests to run on py310 (ray-project#57145)

e34b1de

running serve tests on py3.10 failing tests already are set to manual frequency: https://buildkite.com/ray-project/release/builds/62142#_ --------- Signed-off-by: elliot-barn <[email protected]>

israbbani and others added 22 commits November 13, 2025 02:48

[core] (cgroups) Use /proc/mounts if mount file is missing. (ray-proj…

0cdbe3f

…ect#58577) Signed-off-by: irabbani <[email protected]>

[serve] remove minbuild-serve-py3.9 (ray-project#58585)

292b977

nothing is using it anymore Signed-off-by: Lonnie Liu <[email protected]>

[ci] pin docker client version (ray-project#58579)

438d6dc

otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <[email protected]>

[serve] run tests in python 3.10 (ray-project#58586)

33e855e

all tests are passing Signed-off-by: Lonnie Liu <[email protected]>

[wheel] stop building python 3.9 wheels on the pipelines (ray-project…

208970b

…#58587) also stops building python 3.9 aarch64 images Signed-off-by: Lonnie Liu <[email protected]>

[ci] mark github.Repository as typechecking (ray-project#58582)

0bbd8fd

so that importing test.py does not always import github github repo imports jwt, which then imports cryptography and can lead to issues on windows. Signed-off-by: Lonnie Liu <[email protected]>

[ci] doc check: remove dependency on ray_ci (ray-project#58516)

0905c77

this makes it possible to run on a different python version than the CI wrapper code. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>

[release] allowing for py3.13 images (cpu & cu123) in release tests (r…

0c4dcb0

…ay-project#58581) allowing for py3.13 images (cpu & cu123) in release tests Signed-off-by: elliot-barn <[email protected]>

[doc] remove python 3.12 in doc building (ray-project#58572)

b3a8434

unifying to python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 14, 2025 02:57

antfin-oss added auto-generated daily-merge labels Nov 14, 2025

antfin-oss assigned ffbin Nov 14, 2025

sourcery-ai bot reviewed Nov 14, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-14 #677

🔄 daily merge: master → main 2025-11-14 #677

Uh oh!

antfin-oss commented Nov 14, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

80 participants

🔄 daily merge: master → main 2025-11-14 #677

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-14 #677

Uh oh!

Conversation

antfin-oss commented Nov 14, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

80 participants