Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-11-14
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

iamjustinhsu and others added 30 commits October 24, 2025 12:29
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

~~Before:~~

~~https://github.com/user-attachments/assets/9db00f37-0c37-4e99-874a-a14481878e4a~~
~~In before, the progress bar won't update until the first tasks
finishes.~~

~~After:

~~https://github.com/user-attachments/assets/99877a3f-7b52-4293-aae5-7702edfaabec~~

~~In After, the progress bar won't update until the first task generates
output. If a task generates 10 blocks, we will update the progress bar
while it's generating blocks, even if the task hasn't finished. Once the
task finishes, we default back to the way it was before.~~

~~This is better because the very 1st progress bar update will occur
sooner, and won't feel abrupt to the user.~~

Refractoring the progress bar estimates using known metrics.

## Why are these changes needed?
Currently we use number of finished tasks. This is OK, but since we use
streaming geneator, 1 task = thousands of blocks. This is troublesome
for additional split factor (split blocks) in read parquet
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
…58046)

This pr sets up the helper classes and utils to enable token based
authentication for ray core rpc calls.

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
I suspect that when we deploy the app config, we dont wait long enough
before sending traffic, so requests could go to the wrong version

---------

Signed-off-by: abrar <[email protected]>
…ay-project#57882)

# Summary

The crux of the issue is that in the past, train run status was
synonymous with final worker group status, but now, when there are
pending validations, the worker group is finished but the train run is
not. This leads to confusing situations in which the Train Run is
`FINISHED`, but because there are pending validations, the `controller`
actor is alive and results are inaccessible.

This PR:
* Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after
the worker group finishes but before the controller shuts everything
down.
* Makes `ValidationManager` logging slightly cleaner.
 
Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in
`StateManager` logs and Grafana but not in the state export. We only
want to show terminal states in the state export after `fit()` has
returned and results are accessible. More concretely:
* Finished/errored: The worker group finishes (Train Run is `RUNNING`
but internal state is `SHUTTING_DOWN`), validation finishes (both Train
Run and internal state say `FINISHED` or `ERRORED`), then results are
accessible.
* Aborted: Ideally, the worker group should be aborted and in-flight
validation tasks canceled before the Train Run is `ABORTED`. However,
this PR doesn't change the current behavior, in which the Train Run
might be `ABORTED` before reference counting cleans up the validation
tasks. I will cancel validation tasks before marking the train run
`ABORTED` in a future PR.

I considered polling both the worker group and validations in `_step`
itself, but decided to leave `_step` as a function that only cares about
the worker group.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ject#57930)

Add actor+job+node event to ray event export doc

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Dhyey Shah <[email protected]>
Signed-off-by: Qiaolin-Yu <[email protected]>
Signed-off-by: Qiaolin Yu <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
disabled the wrong test with a different name from the issue

mistakenly associated issue:
ray-project#46687

Signed-off-by: Lonnie Liu <[email protected]>
upgrading batch inference tests to py3.10

Successful release test run:
https://buildkite.com/ray-project/release/builds/65258

all except for image_embedding_from_jsonl are running on python 3.10

---------

Signed-off-by: elliot-barn <[email protected]>
…roject#57896)

## Description
Add missing imports to autoscaling policy example

## Related issues
Link related issues:
ray-project#57876 (comment)

---------

Signed-off-by: daiping8 <[email protected]>
Signed-off-by: Ping Dai <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description
As title states.

Example:

```
from ray.data.expressions import col, lit
expr = (col("x") + lit(5)) * col("y")
print(expr)
  MUL
  β”œβ”€β”€ left: ADD
  β”‚   β”œβ”€β”€ left: COL('x')
  β”‚   └── right: LIT(5)
  └── right: COL('y')
```

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
fixes min setup build

Signed-off-by: Lonnie Liu <[email protected]>
…oject#58030)

## Description

Currently, we implicitly assume that `RefBundle` holds exactly 1 block.
That's not a safe assumption, and this change is addressing that by
explicitly referring to number of blocks instead

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
…ring. (ray-project#58212)

Referenced `ray._private.worker.global_worker`, which users don't know
or care about.

Also cleaned up the wording for `get_node_id` and moved the Ray client
note there.

---------

Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…ay-project#56723)

This PR adds to the utility library for TPU slice placement group
scheduling. We generalize the 2 phase approach that the JaxTrainer uses
to reserve and schedule the workers on the TPU slice.

ray-project#55162

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Aaron Liang <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
adding --all-configs option to raydepsets to build all configs.

Added cli and workspace unit tests

---------

Signed-off-by: elliot-barn <[email protected]>
…ay-project#57788)

<!-- Thank you for contributing to Ray! πŸš€ -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

This change primarily converts `OpResourceAllocator` APIs to make data
flow explicit by exposing required params in the APIs.

Additionally:

1. Abstracting common methods inside `OpResourceAllocator` base-class.
2. Adding allocation to progress bar in verbose mode logging budgets &
allocations.
3. Adding byte-size of all enqueued blocks to the progress bar

## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix πŸ›
- [ ] New feature ✨
- [ ] Enhancement πŸš€
- [ ] Code refactoring πŸ”§
- [ ] Documentation update πŸ“–
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [ ] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [ ] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [ ] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
running serve tests on py3.10
failing tests already are set to manual frequency:
https://buildkite.com/ray-project/release/builds/62142#_

---------

Signed-off-by: elliot-barn <[email protected]>
## Description
in the original taskEvent proto, worker_id is marked as optional
https://github.com/ray-project/ray/blob/830a456b9b558028853423c9042f7e2763ec5283/src/ray/protobuf/gcs.proto#L201

but in ray event it is not
https://github.com/ray-project/ray/blob/f635de7c86d0d0f813a305a9fd5e864a64257894/src/ray/protobuf/public/events_task_lifecycle_event.proto#L42

in the converter we always set the worker_id field even if its an empty
string
https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_ray_event_converter.cc#L145.
if an optional field is set, even if it is empty(default proto value) it
is considered as having a value, and during mergeFrom() calls the value
is considered and overwrites the destination objects existing value.

source:https://protobuf.dev/programming-guides/field_presence/
Explicitly set fields – including default values – are merged-from


this pr fixes this gap in the conversion logic

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
…y-project#58022)

Deprecate `CheckpointConfig(checkpoint_at_end, checkpoint_frequency)`
and mark the `resume_from_checkpoint, metadata` Trainer constructor
arguments as deprecated in the docstrings.

Update the "inspecting results" user guide doc code to show how to catch
and inspect errors raised by trainer.fit(). The previous recommendation
to check result.error is unusable because we always raise the error
which prevents the user from accessing the result object.

---------

Signed-off-by: Justin Yu <[email protected]>
…pported operator filter (ray-project#57970)

## Description
The per node metrics at OSS Ray Data dashboard are not displayed as
expected.
Because of this code change ray-project#55495, the following three metrics were
added a filter for `operator`, which is [not
supported](https://github.com/ray-project/ray/blob/e51f8039bc6992d37834bcff109a3d340e78fcde/python/ray/data/_internal/stats.py#L448)
by per node metrics, and causes empty result.
ray_data_num_tasks_finished_per_node
ray_data_bytes_outputs_of_finished_tasks_per_node
ray_data_blocks_outputs_of_finished_tasks_per_node

Signed-off-by: cong.qian <[email protected]>
…#58223)

upgrading tune scalability release tests to python 3.10

Successful release test run :
https://buildkite.com/ray-project/release/builds/65669#_
only disabled GCE tests failing

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
- Add Azure VM launcher release test
- Change region for the Azure cluster to be in `centralus` since
`westus2` has trouble with availability.
- Add helper function to authenticate with Azure using service principal
in launch cluster script

---------

Signed-off-by: kevin <[email protected]>
## Description
1. Update docs
2. catch exception and redirect users to docs

## Related issues
ray-project#56855

## Additional information
Hard to write tests for this situation.

Manually verified that this is the right exception to catch
```
try:
    cloudpickle.loads(b'\x80\x05\x95\xbc\x03\x00\x00\x00\x00\x00\x00\x8c\x1bray.cloudpickle.cloudpickle\x94\x8c\x0e_make_function\x94\x93\x94(h\x00\x8c\r_builtin_type\x94\x93\x94\x8c\x08CodeType\x94\x85\x94R\x94(K\x01K\x00K\x00K\x03K\x03KCCnt\x00\xa0\x01\xa1\x00}\x01|\x01j\x02}\x02t\x03t\x04\x83\x01\x01\x00d\x01|\x02\x04\x00\x03\x00k\x01r*d\x02k\x00r:n\x04\x01\x00n\x0cd\x03d\x04d\x05i\x01f\x02S\x00d\x06|\x02\x04\x00\x03\x00k\x01rNd\x07k\x00r^n\x04\x01\x00n\x0cd\x08d\x04d\ti\x01f\x02S\x00d\nd\x04d\x0bi\x01f\x02S\x00d\x00S\x00\x94(NK\tK\x11K\x02\x8c\x06reason\x94\x8c\x0eBusiness hours\x94K\x12K\x14K\x04\x8c\x18Evening batch processing\x94K\x01\x8c\x0eOff-peak hours\x94t\x94(\x8c\x08datetime\x94\x8c\x03now\x94\x8c\x04hour\x94\x8c\x05print\x94\x8c\x04avro\x94t\x94\x8c\x03ctx\x94\x8c\x0ccurrent_time\x94\x8c\x0ccurrent_hour\x94\x87\x94\x8c\x1b/home/ubuntu/apps/policy.py\x94\x8c!scheduled_batch_processing_policy\x94K\x0eC\x10\x00\x03\x08\x01\x06\x01\x08\x02\x18\x01\x0c\x02\x18\x01\x0c\x03\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94N\x8c\x08__name__\x94\x8c\x08__main__\x94\x8c\x08__file__\x94h\x18uNNNt\x94R\x94h\x00\x8c\x12_function_setstate\x94\x93\x94h#}\x94}\x94(h\x1fh\x19\x8c\x0c__qualname__\x94h\x19\x8c\x0f__annotations__\x94}\x94(h\x14\x8c\x10ray.serve.config\x94\x8c\x12AutoscalingContext\x94\x93\x94\x8c\x06return\x94h\x04\x8c\x0cGenericAlias\x94\x85\x94R\x94\x8c\x08builtins\x94\x8c\x05tuple\x94\x93\x94h2\x8c\x03int\x94\x93\x94\x8c\t_operator\x94\x8c\x07getitem\x94\x93\x94\x8c\x06typing\x94\x8c\x04Dict\x94\x93\x94h2\x8c\x03str\x94\x93\x94h:\x8c\x03Any\x94\x93\x94\x86\x94\x86\x94R\x94\x86\x94\x86\x94R\x94u\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h \x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94(h\x0eh\x0e\x8c\x08datetime\x94\x93\x94h\x12h\x00\x8c\tsubimport\x94\x93\x94h\x12\x85\x94R\x94uu\x86\x94\x86R0.')
except (ModuleNotFoundError, ImportError) as e:
    print(f"caused by {e} {type(e)}")
```

```
❯ python policy.py
caused by No module named 'avro' <class 'ModuleNotFoundError'>
```

---------

Signed-off-by: abrar <[email protected]>
…56481)

http://github.com/ray-project/ray/pull/50092 warned that we'd be
changing the default `file_extensions` for Parquet from `None` to
`[parquet]`. This was the motivation:
> People often have non-Parquet files in their datasets (e.g., _SUCCESS
or stale files). However, the default for file_extensions is None, so
read_parquet tries reading the non-Parquet files. To avoid this issue,
we'll change the default file extensions to something like ["parquet"].
This PR adds a warning for that change.

This PR follows up on actually changes the default.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
israbbani and others added 22 commits November 13, 2025 02:48
nothing is using it anymore

Signed-off-by: Lonnie Liu <[email protected]>
…58580)

Adding optional `include_setuptools` flag for depset configuration

If the flag is set on a depset config --unsafe-package setuptools will
not be included for depset compilation

If the flag does not exist (default false) on a depset config
--unsafe-package setuptools will be appended to the default arguments

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
otherwise, the newer docker client will refuse to communicate with the
docker daemon that is on an older version.

Signed-off-by: Lonnie Liu <[email protected]>
…ay-project#58542)

## What does this PR do?
   
Fixes HTTP streaming file downloads in Ray Data's download operation.
Some URIs (especially HTTP streams) require `open_input_stream` instead
of `open_input_file`.
   
   ## Changes
   
- Modified `download_bytes_threaded` in `plan_download_op.py` to try
both `open_input_file` and `open_input_stream` for each URI
- Improved error handling to distinguish between different error types
   - Failed downloads now return `None` gracefully instead of crashing
   
   ## Testing
```
import pyarrow as pa
from ray.data.context import DataContext
from ray.data._internal.planner.plan_download_op import download_bytes_threaded

# Test URLs: one valid, one 404
urls = [    
    "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
]

# Create PyArrow table and call download function
table = pa.table({"url": urls})
ctx = DataContext.get_current()
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))

# Check results
result_table = results[0]
for i in range(result_table.num_rows):
    url = result_table['url'][i].as_py()
    bytes_data = result_table['bytes'][i].as_py()
    
    if bytes_data is None:
        print(f"Row {i}: FAILED (None) - try-catch worked βœ“")
    else:
        print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
    print(f"  URL: {url[:60]}...")

print("\nβœ… Test passed: Failed downloads return None instead of crashing.")
```

Before the fix:
```
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
    test_download_expression_with_streaming_fallback()
  File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
    with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
    if not self.__exit__(*sys.exc_info()):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
    setattr(self.target, self.attribute, self.temp_original)
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
(base) ray@ip-10-0-39-21:~/default$ python test.py
2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
    for result in fn(input_queue_iter):
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
    yield f.read()
          ^^^^^^^^
  File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
  File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
  File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
    raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
Traceback (most recent call last):
  File "/home/ray/default/test.py", line 16, in <module>
    results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
    uri_bytes = list(
                ^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
    for result in fn(input_queue_iter):
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
    yield f.read()
          ^^^^^^^^
  File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
  File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
  File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
    raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
```
After the fix:
```
Row 0: SUCCESS (189370 bytes)
  URL: https://static-assets.tesla.com/configurator/compositor?cont...
```
   
Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
previously failed:
   - βœ… Successfully downloads HTTP stream files
   - βœ… Gracefully handles failed downloads (returns None)
   - βœ… Maintains backward compatibility with existing file downloads

---------

Signed-off-by: xyuzh <[email protected]>
Signed-off-by: Robert Nishihara <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
## Description

We today have very little observability into pubsub. On a raylet one of
the most important states that need to be propagated through the cluster
via pubsub is cluster membership. All raylets should in an eventual BUT
timely fashion agree on the list of available nodes. This metric just
emits a simple counter to keep track of the node count.

More pubsub observability to come.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: zac <[email protected]>
Signed-off-by: Zac Policzer <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
all tests are passing

Signed-off-by: Lonnie Liu <[email protected]>
…#58587)

also stops building python 3.9 aarch64 images

Signed-off-by: Lonnie Liu <[email protected]>
so that importing test.py does not always import github

github repo imports jwt, which then imports cryptography and can lead to
issues on windows.

Signed-off-by: Lonnie Liu <[email protected]>
this makes it possible to run on a different python version than the CI
wrapper code.

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
…ecurity (ray-project#58591)

Migrates Ray dashboard authentication from JavaScript-managed cookies to
server-side HttpOnly cookies to enhance security against XSS attacks.
This addresses code review feedback to improve the authentication
implementation (ray-project#58368)

main changes:
- authentication middleware first looks for `Authorization` header, if
not found it then looks at cookies to look for the auth token
- new `api/authenticate` endpoint for verifying token and setting the
auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
`secure=true` (when using https))
- removed javascript based cookie manipulation utils and axios
interceptors (were previously responsible for setting cookies)
- cookies are deleted when connecting to a cluster with
`AUTH_MODE=disabled`. connecting to a different ray cluster (with
different auth token) using the same endpoint (eg due to port-forwarding
or local testing) will reshow the popup and ask users to input the right
token.

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
add support for `ray get-auth-token` cli command + test

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Sampan S Nayak <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
ray-project#57590)

As discovered in the [PR to better define the interface for reference
counter](ray-project#57177 (review)),
plasma store provider and memory store both share thin dependencies on
reference counter that can be refactored out. This will reduce
entanglement in our code base and improve maintainability.

The main logic changes are located in 
* src/ray/core_worker/store_provider/plasma_store_provider.cc, where
reference counter related logic is refactor into core worker
* src/ray/core_worker/core_worker.cc, where factored out reference
counter logic is resolved
* src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
logic related to reference counter has either been removed due to the
fact that it is tech debt or refactored into caller functions.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks
Microbenchmark:
```
single client get calls (Plasma Store) per second 10592.56 +- 535.86
single client put calls (Plasma Store) per second 4908.72 +- 41.55
multi client put calls (Plasma Store) per second 14260.79 +- 265.48
single client put gigabytes per second 11.92 +- 10.21
single client tasks and get batch per second 8.33 +- 0.19
multi client put gigabytes per second 32.09 +- 1.63
single client get object containing 10k refs per second 13.38 +- 0.13
single client wait 1k refs per second 5.04 +- 0.05
single client tasks sync per second 960.45 +- 15.76
single client tasks async per second 7955.16 +- 195.97
multi client tasks async per second 17724.1 +- 856.8
1:1 actor calls sync per second 2251.22 +- 63.93
1:1 actor calls async per second 9342.91 +- 614.74
1:1 actor calls concurrent per second 6427.29 +- 50.3
1:n actor calls async per second 8221.63 +- 167.83
n:n actor calls async per second 22876.04 +- 436.98
n:n actor calls with arg async per second 3531.21 +- 39.38
1:1 async-actor calls sync per second 1581.31 +- 34.01
1:1 async-actor calls async per second 5651.2 +- 222.21
1:1 async-actor calls with args async per second 3618.34 +- 76.02
1:n async-actor calls async per second 7379.2 +- 144.83
n:n async-actor calls async per second 19768.79 +- 211.95
```
This PR mainly makes logic changes to the `ray.get` call chain. As we
can see from the benchmark above, the single clientget calls performance
matches pre-regression levels.

---------

Signed-off-by: davik <[email protected]>
Co-authored-by: davik <[email protected]>
Co-authored-by: Ibrahim Rabbani <[email protected]>
…ay-project#58471)

2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns

3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers

4. **Simplified error handling** - not supporting self healing

5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases

**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support


Next PR ray-project#58473

---------

Signed-off-by: abrar <[email protected]>
Currently, Ray metrics and events are exported through a centralized
process called the Dashboard Agent. This process functions as a gRPC
server, receiving data from all other components (GCS, Raylet, workers,
etc.). However, during a node shutdown, the Dashboard Agent may
terminate before the other components, resulting in gRPC errors and
potential loss of metrics and events.

As this issue occurs, the otel sdk logs become very noisy. Add a default
options to disable otel sdk logs to avoid confusion.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
which is a non-flaky version of `fetch_prometheus`. Update all of test
usage accordingly.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… RD Datatype (ray-project#58225)

## Description
As title suggests

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <[email protected]>
…ay-project#58581)

allowing for py3.13 images (cpu & cu123) in release tests

Signed-off-by: elliot-barn <[email protected]>
## Description
Add avg prompt length metric

When using uniform prompt length (especially in testing), the P50 and
P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
Average prompt length provides another useful dimension to look at and
validate.

For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
9400, and avg accurately shows 5000.

<img width="1186" height="466" alt="image"
src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
/>
 

## Related issues

## Additional information

---------

Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
historically has been supported counter metric with and without `_total`
suffix for backward compatibility, but it is now time to drop the
support (2 years since the warning was added).

There is one place in ray serve dashboard that still doesn't use the
`_total` suffix so fix it in this PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
will delegate authentication and authorization of Ray access to
Kubernetes TokenReview and SubjectAccessReview APIs.

---------

Signed-off-by: Andrew Sy Kim <[email protected]>
unifying to python 3.10

Signed-off-by: Lonnie Liu <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #677 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5363.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a routine daily merge from the master branch into main, focusing on comprehensive updates to the project's build and continuous integration systems. The changes streamline Docker image creation, enhance dependency resolution, and refine code quality checks, ensuring a more robust and maintainable development workflow.

Highlights

  • CI/CD Infrastructure Updates: The CI/CD pipeline has undergone significant refactoring, including a migration from miniconda to miniforge for Python environments, and a consolidation of Docker image build definitions into a new _images.rayci.yml file. Documentation-related CI steps are now centralized in doc.rayci.yml.
  • Bazel Build Configuration Enhancements: Bazel build configurations have been updated to enable strict_action_env by default, introduce new Linux workspace status commands, and add UTF-8 support for Windows CXX options. Many RPC and core component build rules have been refactored into sub-BUILD files for better organization.
  • Python Dependency Management with RayDepsets: A new raydepsets tool has been introduced for more robust Python dependency management, allowing for precise control over dependency compilation, subsetting, and expansion across various Ray components and environments.
  • Docker Image Tagging and Registry Updates: Docker image tagging logic has been enhanced to include rayci_build_id in tags and support new *-extra image types. Azure Container Registry (ACR) integration has been added for Anyscale Docker images, alongside existing AWS ECR and GCP registries.
  • Code Ownership and Linting Refinements: Code ownership rules in .github/CODEOWNERS have been consolidated and expanded. Pre-commit hooks have been updated to include semgrep, vale, cython-lint, and eslint for improved code quality and consistency.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main. It incorporates a vast number of changes, primarily focused on a large-scale refactoring and improvement of the build and CI systems. Key themes include:

  • CI/CD Refactoring: The Buildkite pipelines have been significantly modularized. Steps for building images, running documentation checks, and managing dependencies have been moved into separate, dedicated pipeline files (_images.rayci.yml, doc.rayci.yml, dependencies.rayci.yml). This improves clarity and maintainability.
  • Bazel Overhaul: The build system has undergone a major cleanup. The root BUILD.bazel file has been simplified, with many targets moved to more appropriate sub-packages. The use of pkg_zip and pkg_files replaces older genrule scripts for packaging, which is a move towards more hermetic and declarative builds. The workspace name has been updated to io_ray, following Bazel best practices.
  • Dependency Management: There's a clear shift towards more robust dependency management. This includes the introduction of a new raydepsets tool, the adoption of uv for Python dependency resolution, and a switch from miniconda to miniforge.
  • Linting and Static Analysis: The .pre-commit-config.yaml has been greatly expanded with more tools like semgrep, vale, and eslint, enhancing code quality and consistency across the repository.
  • Code Modernization: Several C++ files have been updated to use modern language features (e.g., std::invoke_result_t instead of the deprecated std::result_of_t) and to align with internal API refactorings.

Overall, these changes represent a significant step forward in the project's engineering practices. The refactoring makes the build system more robust, maintainable, and easier to understand. I have reviewed the changes and found no issues of medium or higher severity. The modifications are consistent and well-aligned with the goal of improving the development infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.