Skip to content

[pull] master from ray-project:master#823

Merged
pull[bot] merged 12 commits intogarymm:masterfrom
ray-project:master
Mar 13, 2026
Merged

[pull] master from ray-project:master#823
pull[bot] merged 12 commits intogarymm:masterfrom
ray-project:master

Conversation

@pull
Copy link

@pull pull bot commented Mar 13, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

jeffreywang-anyscale and others added 12 commits March 12, 2026 12:54
## Description
vLLM is moving to py312 + cuda13. To accommodate that, add CUDA 13 CI
and release images for ray-llm and core-gpu.

## Approach
### Ray-LLM on CUDA 13

- **CI:** `llmgpubuild-py312` build step using `cu130` base
(`.buildkite/llm.rayci.yml`)
- **Release images:** `ray-llm-anyscale` py3.12 + `cu13.0.0-cudnn` built
and published (`.buildkite/release/build.rayci.yml`)
- **BYOD type:** `llm-cu130` (requires Python 3.12)
- **Python:** 3.12

### Core-GPU (Compiled Graphs) on CUDA 13

- **CI:** `coregpu-cu130-build` + `core-multi-gpu-cu130-tests` steps
(`.buildkite/core.rayci.yml`)
- **Release images:** `ray-ml-anyscale` py3.10 + `cu13.0.0-cudnn` built
and published through the full ray-ml chain
- **BYOD type:** `gpu-cu130` (requires Python 3.10)
- **Release tests:** `compiled_graphs_GPU_cu130` and
`compiled_graphs_GPU_multinode_cu130` on L4 instances
- **Python:** 3.10


### Summary Table

| Workload  | BYOD Type   | Python | CUDA | Image Repo         |
|------------|------------|--------|------|-------------------|
| Ray-LLM    | llm-cu128  | 3.11   | 12.8 | anyscale/ray-llm  |
| Ray-LLM    | llm-cu130  | 3.12   | 13.0 | anyscale/ray-llm  |
| Core-GPU   | gpu        | 3.10   | 12.1 | anyscale/ray-ml   |
| Core-GPU   | gpu-cu130  | 3.10   | 13.0 | anyscale/ray-ml   |

## Prerequisite
#61496

## Follow up
- Add BYOD types (llm-cu130, gpu-cu130), release test entries, and
compute configs for compiled graph tests on CUDA 13.
- Switch ray-llm tests to the new py312 + cuda13 image & core-gpu tests
(compiled graphs) to run on the new cuda13 image.

## Related issues
This is the first step to resolve
#61384.

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: Andrew Pollack-Gray <andrew@anyscale.com>
## Description
The current implementation of iter_groups in aggregate determine the
boundary of each key by iterating row by row when need to split all rows
into multiple groups. That is expensive operation. #58910 provide the
same functionality function, named _iter_groups_sorted. This function is
more efficient than original.

To verify, i mock the following table.
| length|content |
| --- | --- |
|50|12343432323232323232324343434234243434343423433333|
|60|123434323232323232323243434342342434343434234333331111111111|
|...|...|

The 'content' column is randomly generated.
The 'length' column is length of 'content' column.
And use the following script to test aggregate function performance.
```
import ray
import pyarrow.parquet as pq
import pyarrow.compute as pac
from ray.data._internal.arrow_ops.transform_pyarrow import take_table
from ray.data._internal.execution.operators.hash_aggregate import ReducingAggregation
from typing import Optional
from ray.data.aggregate import AggregateFnV2
from ray.data.block import AggType, Block, BlockAccessor

import time

class SubstrAndSum(AggregateFnV2):
    def __init__(
        self,
        on: Optional[str] = None,
        ignore_nulls: bool = True,
        alias_name: Optional[str] = None,
    ):
        super().__init__(
            alias_name if alias_name else f"SubstrAndSum",
            on=on,
            ignore_nulls=ignore_nulls,
            zero_factory=lambda: 0,
        )

    def aggregate_block(self, block: Block) -> AggType:
        sum = 0
        for v in block['value']:
            sum += int(v.as_py()[0:2])
        return sum

    def combine(self, current_accumulator: AggType, new: AggType) -> AggType:
        return current_accumulator + new

aggregation = SubstrAndSum()
table = pq.read_table('~/part-01018-8710c510-d42e-4324-ac8b-37184c1541e4-c000.zstd.parquet')
sort_key = ReducingAggregation._get_sort_key(["length"])
indices = pac.sort_indices(table, sort_keys=sort_key.to_arrow_sort_args())
sort_table = take_table(table, indices)

start = time.time()
aggregate_table = BlockAccessor.for_block(sort_table)._aggregate(sort_key, [aggregation])
end = time.time()
print("time: ", (end-start))
```
The test result is:
|Data Size|CPU spec|Code Version|Time consumed|
| --- | --- | --- | --- |
| 6276904 record|Apple M4| original|20s|
| 6276904 record|Apple M4| optimized|5s|
| 6276904 record|Intel Xeon(R) Gold 6330 CPU @ 2.00GHz| original |150s|
| 6276904 record|Intel Xeon(R) Gold 6330 CPU @ 2.00GHz| optimized |25s|

## Related issues
> 

## Additional information
>

---------

Signed-off-by: yifan.xie <xyfabcd@163.com>
…61300)

## Description
This PR adds a util to check for the number of alive, complete TPU
slices in a RayCluster. This PR also adds better test coverage.

This utility is used in the Ray Train elastic policy to cap the number
of workers that can be scaled by the AutoscalingCoordinator.

## Related issues
#55162

Related PR: #61299

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
#61618)

## Description
This PR ensure `Node._node_labels` initializes regardless of
`connect_only`. The original code only initializesd`Node._node_labels`
when `connect_only = true`

## Related issues
Closes #61604

## Additional information
### Verification
#### Testing Script
``` python
import ray

ray.init()
print(ray.runtime_context.get_runtime_context().get_node_labels())
```

#### Before Fix
```
Traceback (most recent call last):
  File "/code/labs/jakob/ray/node_labels.py", line 7, in <module>
    print(ray.runtime_context.get_runtime_context().get_node_labels())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ray/runtime_context.py", line 598, in get_node_labels
    return worker.current_node_labels
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ray/_private/worker.py", line 634, in current_node_labels
    return self.node.node_labels
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ray/_private/node.py", line 685, in node_labels
    return self._node_labels
           ^^^^^^^^^^^^^^^^^
AttributeError: 'Node' object has no attribute '_node_labels'. Did you mean: 'node_labels'?
```

#### After Fix
```
{'ray.io/node-id': 'b32f99e4d621a79a6a001662c1e0fd54b929765b0f69bb12e263c2b7'}
```

---------

Signed-off-by: dancingactor <s990346@gmail.com>
…61363)

## Description

This PR continues the migration from `ray._private` to `ray._common` so
that Ray Serve (and other libraries) do not depend on internal private
APIs. It migrates `tls_utils` and logging constants out of `_private`
into `_common`, and updates all consumers (Serve, dashboard, client
server) to import from the new locations.

**Changes:**

1. **`ray._common.tls_utils`**  
- Moved `generate_self_signed_tls_certs()` from `_private.tls_utils`
into `_common.tls_utils` (it only depended on `_common.network_utils`).
   - `_private.tls_utils` re-exports it for backward compatibility.  
- **Serve:** `serve/tests/test_https_proxy.py` now imports from
`ray._common.tls_utils`.
- **Other consumers:** `_private/grpc_utils.py`,
`_private/test_utils.py`, `dashboard/agent.py`,
`util/client/server/proxier.py`, `util/client/server/server.py` updated.

2. **`ray._common.logging_constants`**  
- New module containing `LOGRECORD_STANDARD_ATTRS` (frozenset),
`LOGGER_FLATTEN_KEYS`, and `LogKey` enum — all moved from
`_private.ray_logging.constants`.
- `_private.ray_logging.constants` is deleted;
`_private.ray_logging.logging_config` now imports from `_common`.
- **Serve:** `serve/schema.py` now imports `LOGRECORD_STANDARD_ATTRS`
from `ray._common.logging_constants`.

3. **`ray._common.filters` / `ray._common.formatters`**  
- Updated internal imports to use `ray._common.logging_constants`
instead of `ray._private.ray_logging.constants`.

**Tests:**  
- New `ray/_common/tests/test_logging_constants.py` — tests the full
`LOGRECORD_STANDARD_ATTRS` frozenset, all `LogKey` enum members, and
`LOGGER_FLATTEN_KEYS`.
- New `ray/_common/tests/test_tls_utils.py` — tests
`generate_self_signed_tls_certs`.
- Lint: ruff check and format pass on all changed files.

## Related issues

Fixes #53478 (partial: migrates Serve-side imports for `tls_utils` and
logging constants; remaining items like `runtime_env_uri` and
`worker_compat` can be done in follow-up PRs).

## Additional information

- **Backward compatibility:** Existing code that imports
`generate_self_signed_tls_certs` from `_private.tls_utils` continues to
work via re-export. `_private.ray_logging.constants` is removed since
all consumers have been updated.
- **Scope:** Some Serve files still use `ray._private.worker.*` as
attribute access (e.g. `global_worker`, `_global_node`) without a direct
import; migrating those would require exposing them via `_common` and is
left for a separate PR.
- Sample PRs referenced in the issue: #53457 (signal/semaphore to
_common), #53652 (wait_for_condition to _common).

---------

Signed-off-by: mkdev11 <MkDev11@users.noreply.github.com>
Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
## Description

Add a user guide for using Ray token auth with Kubernetes RBAC (via
RAY_ENABLE_K8S_TOKEN_AUTH)

## Docs link
1.
https://anyscale-ray--61644.com.readthedocs.build/en/61644/cluster/kubernetes/user-guides/kuberay-auth-rbac.html

<img width="1193" height="482" alt="image"
src="https://github.com/user-attachments/assets/d2cc1c91-0be0-477a-8417-9f98a26c1d02"
/>


2.
https://anyscale-ray--61644.com.readthedocs.build/en/61644/cluster/kubernetes/user-guides.html
<img width="674" height="72" alt="image"
src="https://github.com/user-attachments/assets/5def7875-c7a0-447a-b6ab-dfe838a74277"
/>

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…60659)

## Description

Add observability for `fallback_strategy` in State API and GCS.

While Ray currently provides visibility for `label_selector` (#53423),
there is no mechanism to observe the `fallback_strategy` from outside
the system.

This PR exposes `fallback_strategy` in `TaskInfoEntry and
ActorTableData`. The ability to read and record `fallback_strategy` is
essential for our custom autoscaler development. When primary
`label_selector` constraints cannot be met, the autoscaler must access
these recorded `fallback strategies` to prioritize and allocate
alternative devices.

Beyond autoscaling, adding this feature will provide a better debugging
experience by allowing users to transparently track the entire
scheduling intent, including the `fallback_strategy` for both tasks and
actors.

## Related issues

Related to #51564

## Additional information
```py
from ray import serve  
import ray  
from ray.util.scheduling_strategies import NodeLabelSchedulingStrategy, In, Exists  

@serve.deployment(  
    name="soft_docker_deployment",  
    ray_actor_options={  
        "label_selector": {"docker-image": "in(test-image)"},
        "fallback_strategy": [
            {"label_selector": {"docker-image": "in(test-image2)"}},
        ]  
    }
)
class SoftDockerDeployment:  
    def __call__(self, request):  
        node_labels = ray.get_runtime_context().get_node_labels()  
        return {  
            "message": "Hello from soft-docker deployment!",  
            "node_labels": node_labels  
        }  
  
if __name__ == "__main__":  
    serve.start(http_options={"host": "0.0.0.0", "port": 8000})  
    serve.run(SoftDockerDeployment.bind())
```
#### GlobalStateAccessor.get_actor_table
<img width="1224" height="1076" alt="image"
src="https://github.com/user-attachments/assets/5c66a483-9fce-46a1-a4e7-86874f6a8b27"
/>

#### ray list actors --detail
<img width="836" height="724" alt="image"
src="https://github.com/user-attachments/assets/d99c1c6f-b0f7-4d25-9638-4a2fdd805a0d"
/>

---------

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
## Description

- Cleaning up to make `BundleClass`, `BlockMetadata`, `BlockSlice` and
other classes properly frozen;
 - Added `PandasBlockSchema` as data class

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ehavior for resumption (#61510)

This PR avoids a race condition causing a flaky validation resumption test by:
1) Changing `after_controller_state_update` to do nothing if the state
is terminal. We will process validation tasks from `FINISHED` and
`ERRORED` train runs in `before_controller_shutdown` and cancel
validation tasks in `ABORTED` train runs in `before_controller_abort`.
2) Changing `test_report_validation_fn_resumption` to SIGINT a process
(the same pattern as `test_sigint_abort`) rather than `ray.cancel` a
task (the old pattern) to deterministically test the graceful abortion
path.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Joe Cotant <joe@anyscale.com>
Restrict Slack notifications to only labeled issues and PRs.

Signed-off-by: Joe Cotant <joe@anyscale.com>
@pull pull bot locked and limited conversation to collaborators Mar 13, 2026
@pull pull bot added the ⤵️ pull label Mar 13, 2026
@pull pull bot merged commit 17c122c into garymm:master Mar 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.