[pull] master from ray-project:master by pull[bot] · Pull Request #823 · garymm/ray

pull · 2026-03-13T01:18:17Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

## Description vLLM is moving to py312 + cuda13. To accommodate that, add CUDA 13 CI and release images for ray-llm and core-gpu. ## Approach ### Ray-LLM on CUDA 13 - **CI:** `llmgpubuild-py312` build step using `cu130` base (`.buildkite/llm.rayci.yml`) - **Release images:** `ray-llm-anyscale` py3.12 + `cu13.0.0-cudnn` built and published (`.buildkite/release/build.rayci.yml`) - **BYOD type:** `llm-cu130` (requires Python 3.12) - **Python:** 3.12 ### Core-GPU (Compiled Graphs) on CUDA 13 - **CI:** `coregpu-cu130-build` + `core-multi-gpu-cu130-tests` steps (`.buildkite/core.rayci.yml`) - **Release images:** `ray-ml-anyscale` py3.10 + `cu13.0.0-cudnn` built and published through the full ray-ml chain - **BYOD type:** `gpu-cu130` (requires Python 3.10) - **Release tests:** `compiled_graphs_GPU_cu130` and `compiled_graphs_GPU_multinode_cu130` on L4 instances - **Python:** 3.10 ### Summary Table | Workload | BYOD Type | Python | CUDA | Image Repo | |------------|------------|--------|------|-------------------| | Ray-LLM | llm-cu128 | 3.11 | 12.8 | anyscale/ray-llm | | Ray-LLM | llm-cu130 | 3.12 | 13.0 | anyscale/ray-llm | | Core-GPU | gpu | 3.10 | 12.1 | anyscale/ray-ml | | Core-GPU | gpu-cu130 | 3.10 | 13.0 | anyscale/ray-ml | ## Prerequisite #61496 ## Follow up - Add BYOD types (llm-cu130, gpu-cu130), release test entries, and compute configs for compiled graph tests on CUDA 13. - Switch ray-llm tests to the new py312 + cuda13 image & core-gpu tests (compiled graphs) to run on the new cuda13 image. ## Related issues This is the first step to resolve #61384. ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Andrew Pollack-Gray <andrew@anyscale.com>

## Description The current implementation of iter_groups in aggregate determine the boundary of each key by iterating row by row when need to split all rows into multiple groups. That is expensive operation. #58910 provide the same functionality function, named _iter_groups_sorted. This function is more efficient than original. To verify, i mock the following table. | length|content | | --- | --- | |50|12343432323232323232324343434234243434343423433333| |60|123434323232323232323243434342342434343434234333331111111111| |...|...| The 'content' column is randomly generated. The 'length' column is length of 'content' column. And use the following script to test aggregate function performance. ``` import ray import pyarrow.parquet as pq import pyarrow.compute as pac from ray.data._internal.arrow_ops.transform_pyarrow import take_table from ray.data._internal.execution.operators.hash_aggregate import ReducingAggregation from typing import Optional from ray.data.aggregate import AggregateFnV2 from ray.data.block import AggType, Block, BlockAccessor import time class SubstrAndSum(AggregateFnV2): def __init__( self, on: Optional[str] = None, ignore_nulls: bool = True, alias_name: Optional[str] = None, ): super().__init__( alias_name if alias_name else f"SubstrAndSum", on=on, ignore_nulls=ignore_nulls, zero_factory=lambda: 0, ) def aggregate_block(self, block: Block) -> AggType: sum = 0 for v in block['value']: sum += int(v.as_py()[0:2]) return sum def combine(self, current_accumulator: AggType, new: AggType) -> AggType: return current_accumulator + new aggregation = SubstrAndSum() table = pq.read_table('~/part-01018-8710c510-d42e-4324-ac8b-37184c1541e4-c000.zstd.parquet') sort_key = ReducingAggregation._get_sort_key(["length"]) indices = pac.sort_indices(table, sort_keys=sort_key.to_arrow_sort_args()) sort_table = take_table(table, indices) start = time.time() aggregate_table = BlockAccessor.for_block(sort_table)._aggregate(sort_key, [aggregation]) end = time.time() print("time: ", (end-start)) ``` The test result is: |Data Size|CPU spec|Code Version|Time consumed| | --- | --- | --- | --- | | 6276904 record|Apple M4| original|20s| | 6276904 record|Apple M4| optimized|5s| | 6276904 record|Intel Xeon(R) Gold 6330 CPU @ 2.00GHz| original |150s| | 6276904 record|Intel Xeon(R) Gold 6330 CPU @ 2.00GHz| optimized |25s| ## Related issues > ## Additional information > --------- Signed-off-by: yifan.xie <xyfabcd@163.com>

…61300) ## Description This PR adds a util to check for the number of alive, complete TPU slices in a RayCluster. This PR also adds better test coverage. This utility is used in the Ray Train elastic policy to cap the number of workers that can be scaled by the AutoscalingCoordinator. ## Related issues #55162 Related PR: #61299 --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>

#61618) ## Description This PR ensure `Node._node_labels` initializes regardless of `connect_only`. The original code only initializesd`Node._node_labels` when `connect_only = true` ## Related issues Closes #61604 ## Additional information ### Verification #### Testing Script ``` python import ray ray.init() print(ray.runtime_context.get_runtime_context().get_node_labels()) ``` #### Before Fix ``` Traceback (most recent call last): File "/code/labs/jakob/ray/node_labels.py", line 7, in <module> print(ray.runtime_context.get_runtime_context().get_node_labels()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/ray/runtime_context.py", line 598, in get_node_labels return worker.current_node_labels ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/ray/_private/worker.py", line 634, in current_node_labels return self.node.node_labels ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/ray/_private/node.py", line 685, in node_labels return self._node_labels ^^^^^^^^^^^^^^^^^ AttributeError: 'Node' object has no attribute '_node_labels'. Did you mean: 'node_labels'? ``` #### After Fix ``` {'ray.io/node-id': 'b32f99e4d621a79a6a001662c1e0fd54b929765b0f69bb12e263c2b7'} ``` --------- Signed-off-by: dancingactor <s990346@gmail.com>

…61363) ## Description This PR continues the migration from `ray._private` to `ray._common` so that Ray Serve (and other libraries) do not depend on internal private APIs. It migrates `tls_utils` and logging constants out of `_private` into `_common`, and updates all consumers (Serve, dashboard, client server) to import from the new locations. **Changes:** 1. **`ray._common.tls_utils`** - Moved `generate_self_signed_tls_certs()` from `_private.tls_utils` into `_common.tls_utils` (it only depended on `_common.network_utils`). - `_private.tls_utils` re-exports it for backward compatibility. - **Serve:** `serve/tests/test_https_proxy.py` now imports from `ray._common.tls_utils`. - **Other consumers:** `_private/grpc_utils.py`, `_private/test_utils.py`, `dashboard/agent.py`, `util/client/server/proxier.py`, `util/client/server/server.py` updated. 2. **`ray._common.logging_constants`** - New module containing `LOGRECORD_STANDARD_ATTRS` (frozenset), `LOGGER_FLATTEN_KEYS`, and `LogKey` enum — all moved from `_private.ray_logging.constants`. - `_private.ray_logging.constants` is deleted; `_private.ray_logging.logging_config` now imports from `_common`. - **Serve:** `serve/schema.py` now imports `LOGRECORD_STANDARD_ATTRS` from `ray._common.logging_constants`. 3. **`ray._common.filters` / `ray._common.formatters`** - Updated internal imports to use `ray._common.logging_constants` instead of `ray._private.ray_logging.constants`. **Tests:** - New `ray/_common/tests/test_logging_constants.py` — tests the full `LOGRECORD_STANDARD_ATTRS` frozenset, all `LogKey` enum members, and `LOGGER_FLATTEN_KEYS`. - New `ray/_common/tests/test_tls_utils.py` — tests `generate_self_signed_tls_certs`. - Lint: ruff check and format pass on all changed files. ## Related issues Fixes #53478 (partial: migrates Serve-side imports for `tls_utils` and logging constants; remaining items like `runtime_env_uri` and `worker_compat` can be done in follow-up PRs). ## Additional information - **Backward compatibility:** Existing code that imports `generate_self_signed_tls_certs` from `_private.tls_utils` continues to work via re-export. `_private.ray_logging.constants` is removed since all consumers have been updated. - **Scope:** Some Serve files still use `ray._private.worker.*` as attribute access (e.g. `global_worker`, `_global_node`) without a direct import; migrating those would require exposing them via `_common` and is left for a separate PR. - Sample PRs referenced in the issue: #53457 (signal/semaphore to _common), #53652 (wait_for_condition to _common). --------- Signed-off-by: mkdev11 <MkDev11@users.noreply.github.com> Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>

## Description Add a user guide for using Ray token auth with Kubernetes RBAC (via RAY_ENABLE_K8S_TOKEN_AUTH) ## Docs link 1. https://anyscale-ray--61644.com.readthedocs.build/en/61644/cluster/kubernetes/user-guides/kuberay-auth-rbac.html <img width="1193" height="482" alt="image" src="https://github.com/user-attachments/assets/d2cc1c91-0be0-477a-8417-9f98a26c1d02" /> 2. https://anyscale-ray--61644.com.readthedocs.build/en/61644/cluster/kubernetes/user-guides.html <img width="674" height="72" alt="image" src="https://github.com/user-attachments/assets/5def7875-c7a0-447a-b6ab-dfe838a74277" /> ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

…60659) ## Description Add observability for `fallback_strategy` in State API and GCS. While Ray currently provides visibility for `label_selector` (#53423), there is no mechanism to observe the `fallback_strategy` from outside the system. This PR exposes `fallback_strategy` in `TaskInfoEntry and ActorTableData`. The ability to read and record `fallback_strategy` is essential for our custom autoscaler development. When primary `label_selector` constraints cannot be met, the autoscaler must access these recorded `fallback strategies` to prioritize and allocate alternative devices. Beyond autoscaling, adding this feature will provide a better debugging experience by allowing users to transparently track the entire scheduling intent, including the `fallback_strategy` for both tasks and actors. ## Related issues Related to #51564 ## Additional information ```py from ray import serve import ray from ray.util.scheduling_strategies import NodeLabelSchedulingStrategy, In, Exists @serve.deployment( name="soft_docker_deployment", ray_actor_options={ "label_selector": {"docker-image": "in(test-image)"}, "fallback_strategy": [ {"label_selector": {"docker-image": "in(test-image2)"}}, ] } ) class SoftDockerDeployment: def __call__(self, request): node_labels = ray.get_runtime_context().get_node_labels() return { "message": "Hello from soft-docker deployment!", "node_labels": node_labels } if __name__ == "__main__": serve.start(http_options={"host": "0.0.0.0", "port": 8000}) serve.run(SoftDockerDeployment.bind()) ``` #### GlobalStateAccessor.get_actor_table <img width="1224" height="1076" alt="image" src="https://github.com/user-attachments/assets/5c66a483-9fce-46a1-a4e7-86874f6a8b27" /> #### ray list actors --detail <img width="836" height="724" alt="image" src="https://github.com/user-attachments/assets/d99c1c6f-b0f7-4d25-9638-4a2fdd805a0d" /> --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

## Description - Cleaning up to make `BundleClass`, `BlockMetadata`, `BlockSlice` and other classes properly frozen; - Added `PandasBlockSchema` as data class ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

…ehavior for resumption (#61510) This PR avoids a race condition causing a flaky validation resumption test by: 1) Changing `after_controller_state_update` to do nothing if the state is terminal. We will process validation tasks from `FINISHED` and `ERRORED` train runs in `before_controller_shutdown` and cancel validation tasks in `ABORTED` train runs in `before_controller_abort`. 2) Changing `test_report_validation_fn_resumption` to SIGINT a process (the same pattern as `test_sigint_abort`) rather than `ray.cancel` a task (the old pattern) to deterministically test the graceful abortion path. Signed-off-by: Timothy Seah <tseah@anyscale.com>

Signed-off-by: Joe Cotant <joe@anyscale.com>

Restrict Slack notifications to only labeled issues and PRs. Signed-off-by: Joe Cotant <joe@anyscale.com>

jeffreywang-anyscale and others added 12 commits March 12, 2026 12:54

[Data] - Add cudf as a batch_format (#61329)

63bc264

Add workflow to notify Slack on issue or PR labels

1ebf0fc

Signed-off-by: Joe Cotant <joe@anyscale.com>

Limit Slack notifications to labeled events

17c122c

Restrict Slack notifications to only labeled issues and PRs. Signed-off-by: Joe Cotant <joe@anyscale.com>

pull bot locked and limited conversation to collaborators Mar 13, 2026

pull bot added the ⤵️ pull label Mar 13, 2026

pull bot merged commit 17c122c into garymm:master Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#823

[pull] master from ray-project:master#823
pull[bot] merged 12 commits intogarymm:masterfrom
ray-project:master

pull bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

pull bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

pull bot commented Mar 13, 2026 •

edited

Loading