-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-11-14 #677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ~~Before:~~ ~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~ ~~In before, the progress bar won't update until the first tasks finishes.~~ ~~After: ~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~ ~~In After, the progress bar won't update until the first task generates output. If a task generates 10 blocks, we will update the progress bar while it's generating blocks, even if the task hasn't finished. Once the task finishes, we default back to the way it was before.~~ ~~This is better because the very 1st progress bar update will occur sooner, and won't feel abrupt to the user.~~ Refractoring the progress bar estimates using known metrics. ## Why are these changes needed? Currently we use number of finished tasks. This is OK, but since we use streaming geneator, 1 task = thousands of blocks. This is troublesome for additional split factor (split blocks) in read parquet <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>
β¦58046) This pr sets up the helper classes and utils to enable token based authentication for ray core rpc calls. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
I suspect that when we deploy the app config, we dont wait long enough before sending traffic, so requests could go to the wrong version --------- Signed-off-by: abrar <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
Signed-off-by: ahao-anyscale <[email protected]>
β¦58092) Signed-off-by: Jiajun Yao <[email protected]>
β¦ay-project#57882) # Summary The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is `FINISHED`, but because there are pending validations, the `controller` actor is alive and results are inaccessible. This PR: * Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after the worker group finishes but before the controller shuts everything down. * Makes `ValidationManager` logging slightly cleaner. Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in `StateManager` logs and Grafana but not in the state export. We only want to show terminal states in the state export after `fit()` has returned and results are accessible. More concretely: * Finished/errored: The worker group finishes (Train Run is `RUNNING` but internal state is `SHUTTING_DOWN`), validation finishes (both Train Run and internal state say `FINISHED` or `ERRORED`), then results are accessible. * Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is `ABORTED`. However, this PR doesn't change the current behavior, in which the Train Run might be `ABORTED` before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run `ABORTED` in a future PR. I considered polling both the worker group and validations in `_step` itself, but decided to leave `_step` as a function that only cares about the worker group. # Testing Unit tests --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ject#57930) Add actor+job+node event to ray event export doc Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Dhyey Shah <[email protected]> Signed-off-by: Qiaolin-Yu <[email protected]> Signed-off-by: Qiaolin Yu <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Stephanie Wang <[email protected]>
disabled the wrong test with a different name from the issue mistakenly associated issue: ray-project#46687 Signed-off-by: Lonnie Liu <[email protected]>
upgrading batch inference tests to py3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65258 all except for image_embedding_from_jsonl are running on python 3.10 --------- Signed-off-by: elliot-barn <[email protected]>
β¦ay-project#57636) Signed-off-by: Mengjin Yan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
β¦roject#57896) ## Description Add missing imports to autoscaling policy example ## Related issues Link related issues: ray-project#57876 (comment) --------- Signed-off-by: daiping8 <[email protected]> Signed-off-by: Ping Dai <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ect#57878) ## Description Now, we hava custom autoscaling policy name in logs. 1. Deployment level Test code from https://docs.ray.io/en/master/serve/advanced-guides/advanced-autoscaling.html#custom-policy-for-deployment <img width="1333" height="294" alt="image" src="https://github.com/user-attachments/assets/bd26e576-fd3c-489b-94c2-4c11b25bb400" /> 2. Application level Test code from https://docs.ray.io/en/master/serve/advanced-guides/advanced-autoscaling.html#application-level-autoscaling <img width="1321" height="431" alt="image" src="https://github.com/user-attachments/assets/d51f4952-faf5-47eb-80d0-6be357437505" /> ## Related issues Closes ray-project#57846 ## Additional information Signed-off-by: daiping8 <[email protected]>
## Description
As title states.
Example:
```
from ray.data.expressions import col, lit
expr = (col("x") + lit(5)) * col("y")
print(expr)
MUL
βββ left: ADD
β βββ left: COL('x')
β βββ right: LIT(5)
βββ right: COL('y')
```
## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".
## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: Goutam <[email protected]>
fixes min setup build Signed-off-by: Lonnie Liu <[email protected]>
β¦oject#58030) ## Description Currently, we implicitly assume that `RefBundle` holds exactly 1 block. That's not a safe assumption, and this change is addressing that by explicitly referring to number of blocks instead ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>
β¦ring. (ray-project#58212) Referenced `ray._private.worker.global_worker`, which users don't know or care about. Also cleaned up the wording for `get_node_id` and moved the Ray client note there. --------- Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦ay-project#56723) This PR adds to the utility library for TPU slice placement group scheduling. We generalize the 2 phase approach that the JaxTrainer uses to reserve and schedule the workers on the TPU slice. ray-project#55162 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Aaron Liang <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]>
adding --all-configs option to raydepsets to build all configs. Added cli and workspace unit tests --------- Signed-off-by: elliot-barn <[email protected]>
β¦ay-project#57788) <!-- Thank you for contributing to Ray! π --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix π - [ ] New feature β¨ - [ ] Enhancement π - [ ] Code refactoring π§ - [ ] Documentation update π - [ ] Chore π§Ή - [ ] Style π¨ ## Checklist **Does this PR introduce breaking changes?** - [ ] Yesβ οΈ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested β _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <[email protected]>
running serve tests on py3.10 failing tests already are set to manual frequency: https://buildkite.com/ray-project/release/builds/62142#_ --------- Signed-off-by: elliot-barn <[email protected]>
## Description in the original taskEvent proto, worker_id is marked as optional https://github.com/ray-project/ray/blob/830a456b9b558028853423c9042f7e2763ec5283/src/ray/protobuf/gcs.proto#L201 but in ray event it is not https://github.com/ray-project/ray/blob/f635de7c86d0d0f813a305a9fd5e864a64257894/src/ray/protobuf/public/events_task_lifecycle_event.proto#L42 in the converter we always set the worker_id field even if its an empty string https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_ray_event_converter.cc#L145. if an optional field is set, even if it is empty(default proto value) it is considered as having a value, and during mergeFrom() calls the value is considered and overwrites the destination objects existing value. source:https://protobuf.dev/programming-guides/field_presence/ Explicitly set fields β including default values β are merged-from this pr fixes this gap in the conversion logic --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
β¦y-project#58022) Deprecate `CheckpointConfig(checkpoint_at_end, checkpoint_frequency)` and mark the `resume_from_checkpoint, metadata` Trainer constructor arguments as deprecated in the docstrings. Update the "inspecting results" user guide doc code to show how to catch and inspect errors raised by trainer.fit(). The previous recommendation to check result.error is unusable because we always raise the error which prevents the user from accessing the result object. --------- Signed-off-by: Justin Yu <[email protected]>
β¦pported operator filter (ray-project#57970) ## Description The per node metrics at OSS Ray Data dashboard are not displayed as expected. Because of this code change ray-project#55495, the following three metrics were added a filter for `operator`, which is [not supported](https://github.com/ray-project/ray/blob/e51f8039bc6992d37834bcff109a3d340e78fcde/python/ray/data/_internal/stats.py#L448) by per node metrics, and causes empty result. ray_data_num_tasks_finished_per_node ray_data_bytes_outputs_of_finished_tasks_per_node ray_data_blocks_outputs_of_finished_tasks_per_node Signed-off-by: cong.qian <[email protected]>
β¦#58223) upgrading tune scalability release tests to python 3.10 Successful release test run : https://buildkite.com/ray-project/release/builds/65669#_ only disabled GCE tests failing --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
- Add Azure VM launcher release test - Change region for the Azure cluster to be in `centralus` since `westus2` has trouble with availability. - Add helper function to authenticate with Azure using service principal in launch cluster script --------- Signed-off-by: kevin <[email protected]>
## Description 1. Update docs 2. catch exception and redirect users to docs ## Related issues ray-project#56855 ## Additional information Hard to write tests for this situation. Manually verified that this is the right exception to catch ``` try: cloudpickle.loads(b'\x80\x05\x95\xbc\x03\x00\x00\x00\x00\x00\x00\x8c\x1bray.cloudpickle.cloudpickle\x94\x8c\x0e_make_function\x94\x93\x94(h\x00\x8c\r_builtin_type\x94\x93\x94\x8c\x08CodeType\x94\x85\x94R\x94(K\x01K\x00K\x00K\x03K\x03KCCnt\x00\xa0\x01\xa1\x00}\x01|\x01j\x02}\x02t\x03t\x04\x83\x01\x01\x00d\x01|\x02\x04\x00\x03\x00k\x01r*d\x02k\x00r:n\x04\x01\x00n\x0cd\x03d\x04d\x05i\x01f\x02S\x00d\x06|\x02\x04\x00\x03\x00k\x01rNd\x07k\x00r^n\x04\x01\x00n\x0cd\x08d\x04d\ti\x01f\x02S\x00d\nd\x04d\x0bi\x01f\x02S\x00d\x00S\x00\x94(NK\tK\x11K\x02\x8c\x06reason\x94\x8c\x0eBusiness hours\x94K\x12K\x14K\x04\x8c\x18Evening batch processing\x94K\x01\x8c\x0eOff-peak hours\x94t\x94(\x8c\x08datetime\x94\x8c\x03now\x94\x8c\x04hour\x94\x8c\x05print\x94\x8c\x04avro\x94t\x94\x8c\x03ctx\x94\x8c\x0ccurrent_time\x94\x8c\x0ccurrent_hour\x94\x87\x94\x8c\x1b/home/ubuntu/apps/policy.py\x94\x8c!scheduled_batch_processing_policy\x94K\x0eC\x10\x00\x03\x08\x01\x06\x01\x08\x02\x18\x01\x0c\x02\x18\x01\x0c\x03\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94N\x8c\x08__name__\x94\x8c\x08__main__\x94\x8c\x08__file__\x94h\x18uNNNt\x94R\x94h\x00\x8c\x12_function_setstate\x94\x93\x94h#}\x94}\x94(h\x1fh\x19\x8c\x0c__qualname__\x94h\x19\x8c\x0f__annotations__\x94}\x94(h\x14\x8c\x10ray.serve.config\x94\x8c\x12AutoscalingContext\x94\x93\x94\x8c\x06return\x94h\x04\x8c\x0cGenericAlias\x94\x85\x94R\x94\x8c\x08builtins\x94\x8c\x05tuple\x94\x93\x94h2\x8c\x03int\x94\x93\x94\x8c\t_operator\x94\x8c\x07getitem\x94\x93\x94\x8c\x06typing\x94\x8c\x04Dict\x94\x93\x94h2\x8c\x03str\x94\x93\x94h:\x8c\x03Any\x94\x93\x94\x86\x94\x86\x94R\x94\x86\x94\x86\x94R\x94u\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h \x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94(h\x0eh\x0e\x8c\x08datetime\x94\x93\x94h\x12h\x00\x8c\tsubimport\x94\x93\x94h\x12\x85\x94R\x94uu\x86\x94\x86R0.') except (ModuleNotFoundError, ImportError) as e: print(f"caused by {e} {type(e)}") ``` ``` β― python policy.py caused by No module named 'avro' <class 'ModuleNotFoundError'> ``` --------- Signed-off-by: abrar <[email protected]>
β¦56481) http://github.com/ray-project/ray/pull/50092 warned that we'd be changing the default `file_extensions` for Parquet from `None` to `[parquet]`. This was the motivation: > People often have non-Parquet files in their datasets (e.g., _SUCCESS or stale files). However, the default for file_extensions is None, so read_parquet tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like ["parquet"]. This PR adds a warning for that change. This PR follows up on actually changes the default. --------- Signed-off-by: Balaji Veeramani <[email protected]>
β¦ect#58577) Signed-off-by: irabbani <[email protected]>
nothing is using it anymore Signed-off-by: Lonnie Liu <[email protected]>
β¦58580) Adding optional `include_setuptools` flag for depset configuration If the flag is set on a depset config --unsafe-package setuptools will not be included for depset compilation If the flag does not exist (default false) on a depset config --unsafe-package setuptools will be appended to the default arguments --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <[email protected]>
β¦ay-project#58542) ## What does this PR do? Fixes HTTP streaming file downloads in Ray Data's download operation. Some URIs (especially HTTP streams) require `open_input_stream` instead of `open_input_file`. ## Changes - Modified `download_bytes_threaded` in `plan_download_op.py` to try both `open_input_file` and `open_input_stream` for each URI - Improved error handling to distinguish between different error types - Failed downloads now return `None` gracefully instead of crashing ## Testing ``` import pyarrow as pa from ray.data.context import DataContext from ray.data._internal.planner.plan_download_op import download_bytes_threaded # Test URLs: one valid, one 404 urls = [ "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&", ] # Create PyArrow table and call download function table = pa.table({"url": urls}) ctx = DataContext.get_current() results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) # Check results result_table = results[0] for i in range(result_table.num_rows): url = result_table['url'][i].as_py() bytes_data = result_table['bytes'][i].as_py() if bytes_data is None: print(f"Row {i}: FAILED (None) - try-catch worked β") else: print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)") print(f" URL: {url[:60]}...") print("\nβ Test passed: Failed downloads return None instead of crashing.") ``` Before the fix: ``` TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/ray/default/test_streaming_fallback.py", line 110, in <module> test_download_expression_with_streaming_fallback() File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__ if not self.__exit__(*sys.exc_info()): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__ setattr(self.target, self.attribute, self.temp_original) TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' (base) ray@ip-10-0-39-21:~/default$ python test.py 2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker! Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file Traceback (most recent call last): File "/home/ray/default/test.py", line 16, in <module> results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded uri_bytes = list( ^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen raise item File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file ``` After the fix: ``` Row 0: SUCCESS (189370 bytes) URL: https://static-assets.tesla.com/configurator/compositor?cont... ``` Tested with HTTP streaming URLs (e.g., Tesla configurator images) that previously failed: - β Successfully downloads HTTP stream files - β Gracefully handles failed downloads (returns None) - β Maintains backward compatibility with existing file downloads --------- Signed-off-by: xyuzh <[email protected]> Signed-off-by: Robert Nishihara <[email protected]> Co-authored-by: Robert Nishihara <[email protected]>
## Description We today have very little observability into pubsub. On a raylet one of the most important states that need to be propagated through the cluster via pubsub is cluster membership. All raylets should in an eventual BUT timely fashion agree on the list of available nodes. This metric just emits a simple counter to keep track of the node count. More pubsub observability to come. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
all tests are passing Signed-off-by: Lonnie Liu <[email protected]>
β¦#58587) also stops building python 3.9 aarch64 images Signed-off-by: Lonnie Liu <[email protected]>
so that importing test.py does not always import github github repo imports jwt, which then imports cryptography and can lead to issues on windows. Signed-off-by: Lonnie Liu <[email protected]>
this makes it possible to run on a different python version than the CI wrapper code. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>
β¦ecurity (ray-project#58591) Migrates Ray dashboard authentication from JavaScript-managed cookies to server-side HttpOnly cookies to enhance security against XSS attacks. This addresses code review feedback to improve the authentication implementation (ray-project#58368) main changes: - authentication middleware first looks for `Authorization` header, if not found it then looks at cookies to look for the auth token - new `api/authenticate` endpoint for verifying token and setting the auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and `secure=true` (when using https)) - removed javascript based cookie manipulation utils and axios interceptors (were previously responsible for setting cookies) - cookies are deleted when connecting to a cluster with `AUTH_MODE=disabled`. connecting to a different ray cluster (with different auth token) using the same endpoint (eg due to port-forwarding or local testing) will reshow the popup and ask users to input the right token. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
add support for `ray get-auth-token` cli command + test --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
ray-project#57590) As discovered in the [PR to better define the interface for reference counter](ray-project#57177 (review)), plasma store provider and memory store both share thin dependencies on reference counter that can be refactored out. This will reduce entanglement in our code base and improve maintainability. The main logic changes are located in * src/ray/core_worker/store_provider/plasma_store_provider.cc, where reference counter related logic is refactor into core worker * src/ray/core_worker/core_worker.cc, where factored out reference counter logic is resolved * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where logic related to reference counter has either been removed due to the fact that it is tech debt or refactored into caller functions. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks Microbenchmark: ``` single client get calls (Plasma Store) per second 10592.56 +- 535.86 single client put calls (Plasma Store) per second 4908.72 +- 41.55 multi client put calls (Plasma Store) per second 14260.79 +- 265.48 single client put gigabytes per second 11.92 +- 10.21 single client tasks and get batch per second 8.33 +- 0.19 multi client put gigabytes per second 32.09 +- 1.63 single client get object containing 10k refs per second 13.38 +- 0.13 single client wait 1k refs per second 5.04 +- 0.05 single client tasks sync per second 960.45 +- 15.76 single client tasks async per second 7955.16 +- 195.97 multi client tasks async per second 17724.1 +- 856.8 1:1 actor calls sync per second 2251.22 +- 63.93 1:1 actor calls async per second 9342.91 +- 614.74 1:1 actor calls concurrent per second 6427.29 +- 50.3 1:n actor calls async per second 8221.63 +- 167.83 n:n actor calls async per second 22876.04 +- 436.98 n:n actor calls with arg async per second 3531.21 +- 39.38 1:1 async-actor calls sync per second 1581.31 +- 34.01 1:1 async-actor calls async per second 5651.2 +- 222.21 1:1 async-actor calls with args async per second 3618.34 +- 76.02 1:n async-actor calls async per second 7379.2 +- 144.83 n:n async-actor calls async per second 19768.79 +- 211.95 ``` This PR mainly makes logic changes to the `ray.get` call chain. As we can see from the benchmark above, the single clientget calls performance matches pre-regression levels. --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>
β¦ay-project#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR ray-project#58473 --------- Signed-off-by: abrar <[email protected]>
Currently, Ray metrics and events are exported through a centralized process called the Dashboard Agent. This process functions as a gRPC server, receiving data from all other components (GCS, Raylet, workers, etc.). However, during a node shutdown, the Dashboard Agent may terminate before the other components, resulting in gRPC errors and potential loss of metrics and events. As this issue occurs, the otel sdk logs become very noisy. Add a default options to disable otel sdk logs to avoid confusion. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`, which is a non-flaky version of `fetch_prometheus`. Update all of test usage accordingly. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ RD Datatype (ray-project#58225) ## Description As title suggests ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <[email protected]>
β¦ay-project#58581) allowing for py3.13 images (cpu & cu123) in release tests Signed-off-by: elliot-barn <[email protected]>
## Description Add avg prompt length metric When using uniform prompt length (especially in testing), the P50 and P90 computations are skewed due to the 1_2_5 buckets used in vLLM. Average prompt length provides another useful dimension to look at and validate. For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows 9400, and avg accurately shows 5000. <img width="1186" height="466" alt="image" src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a" /> ## Related issues ## Additional information --------- Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Prometheus auto-append the `_total` suffix to all Counter metrics. Ray historically has been supported counter metric with and without `_total` suffix for backward compatibility, but it is now time to drop the support (2 years since the warning was added). There is one place in ray serve dashboard that still doesn't use the `_total` suffix so fix it in this PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray will delegate authentication and authorization of Ray access to Kubernetes TokenReview and SubjectAccessReview APIs. --------- Signed-off-by: Andrew Sy Kim <[email protected]>
unifying to python 3.10 Signed-off-by: Lonnie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #677 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5363.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request represents a routine daily merge from the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an automated daily merge from master to main. It incorporates a vast number of changes, primarily focused on a large-scale refactoring and improvement of the build and CI systems. Key themes include:
- CI/CD Refactoring: The Buildkite pipelines have been significantly modularized. Steps for building images, running documentation checks, and managing dependencies have been moved into separate, dedicated pipeline files (
_images.rayci.yml,doc.rayci.yml,dependencies.rayci.yml). This improves clarity and maintainability. - Bazel Overhaul: The build system has undergone a major cleanup. The root
BUILD.bazelfile has been simplified, with many targets moved to more appropriate sub-packages. The use ofpkg_zipandpkg_filesreplaces oldergenrulescripts for packaging, which is a move towards more hermetic and declarative builds. The workspace name has been updated toio_ray, following Bazel best practices. - Dependency Management: There's a clear shift towards more robust dependency management. This includes the introduction of a new
raydepsetstool, the adoption ofuvfor Python dependency resolution, and a switch fromminicondatominiforge. - Linting and Static Analysis: The
.pre-commit-config.yamlhas been greatly expanded with more tools likesemgrep,vale, andeslint, enhancing code quality and consistency across the repository. - Code Modernization: Several C++ files have been updated to use modern language features (e.g.,
std::invoke_result_tinstead of the deprecatedstd::result_of_t) and to align with internal API refactorings.
Overall, these changes represent a significant step forward in the project's engineering practices. The refactoring makes the build system more robust, maintainable, and easier to understand. I have reviewed the changes and found no issues of medium or higher severity. The modifications are consistent and well-aligned with the goal of improving the development infrastructure.
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-11-14
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.