[pull] master from ray-project:master#823
Merged
pull[bot] merged 12 commits intogarymm:masterfrom Mar 13, 2026
Merged
Conversation
## Description vLLM is moving to py312 + cuda13. To accommodate that, add CUDA 13 CI and release images for ray-llm and core-gpu. ## Approach ### Ray-LLM on CUDA 13 - **CI:** `llmgpubuild-py312` build step using `cu130` base (`.buildkite/llm.rayci.yml`) - **Release images:** `ray-llm-anyscale` py3.12 + `cu13.0.0-cudnn` built and published (`.buildkite/release/build.rayci.yml`) - **BYOD type:** `llm-cu130` (requires Python 3.12) - **Python:** 3.12 ### Core-GPU (Compiled Graphs) on CUDA 13 - **CI:** `coregpu-cu130-build` + `core-multi-gpu-cu130-tests` steps (`.buildkite/core.rayci.yml`) - **Release images:** `ray-ml-anyscale` py3.10 + `cu13.0.0-cudnn` built and published through the full ray-ml chain - **BYOD type:** `gpu-cu130` (requires Python 3.10) - **Release tests:** `compiled_graphs_GPU_cu130` and `compiled_graphs_GPU_multinode_cu130` on L4 instances - **Python:** 3.10 ### Summary Table | Workload | BYOD Type | Python | CUDA | Image Repo | |------------|------------|--------|------|-------------------| | Ray-LLM | llm-cu128 | 3.11 | 12.8 | anyscale/ray-llm | | Ray-LLM | llm-cu130 | 3.12 | 13.0 | anyscale/ray-llm | | Core-GPU | gpu | 3.10 | 12.1 | anyscale/ray-ml | | Core-GPU | gpu-cu130 | 3.10 | 13.0 | anyscale/ray-ml | ## Prerequisite #61496 ## Follow up - Add BYOD types (llm-cu130, gpu-cu130), release test entries, and compute configs for compiled graph tests on CUDA 13. - Switch ray-llm tests to the new py312 + cuda13 image & core-gpu tests (compiled graphs) to run on the new cuda13 image. ## Related issues This is the first step to resolve #61384. ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Andrew Pollack-Gray <andrew@anyscale.com>
## Description The current implementation of iter_groups in aggregate determine the boundary of each key by iterating row by row when need to split all rows into multiple groups. That is expensive operation. #58910 provide the same functionality function, named _iter_groups_sorted. This function is more efficient than original. To verify, i mock the following table. | length|content | | --- | --- | |50|12343432323232323232324343434234243434343423433333| |60|123434323232323232323243434342342434343434234333331111111111| |...|...| The 'content' column is randomly generated. The 'length' column is length of 'content' column. And use the following script to test aggregate function performance. ``` import ray import pyarrow.parquet as pq import pyarrow.compute as pac from ray.data._internal.arrow_ops.transform_pyarrow import take_table from ray.data._internal.execution.operators.hash_aggregate import ReducingAggregation from typing import Optional from ray.data.aggregate import AggregateFnV2 from ray.data.block import AggType, Block, BlockAccessor import time class SubstrAndSum(AggregateFnV2): def __init__( self, on: Optional[str] = None, ignore_nulls: bool = True, alias_name: Optional[str] = None, ): super().__init__( alias_name if alias_name else f"SubstrAndSum", on=on, ignore_nulls=ignore_nulls, zero_factory=lambda: 0, ) def aggregate_block(self, block: Block) -> AggType: sum = 0 for v in block['value']: sum += int(v.as_py()[0:2]) return sum def combine(self, current_accumulator: AggType, new: AggType) -> AggType: return current_accumulator + new aggregation = SubstrAndSum() table = pq.read_table('~/part-01018-8710c510-d42e-4324-ac8b-37184c1541e4-c000.zstd.parquet') sort_key = ReducingAggregation._get_sort_key(["length"]) indices = pac.sort_indices(table, sort_keys=sort_key.to_arrow_sort_args()) sort_table = take_table(table, indices) start = time.time() aggregate_table = BlockAccessor.for_block(sort_table)._aggregate(sort_key, [aggregation]) end = time.time() print("time: ", (end-start)) ``` The test result is: |Data Size|CPU spec|Code Version|Time consumed| | --- | --- | --- | --- | | 6276904 record|Apple M4| original|20s| | 6276904 record|Apple M4| optimized|5s| | 6276904 record|Intel Xeon(R) Gold 6330 CPU @ 2.00GHz| original |150s| | 6276904 record|Intel Xeon(R) Gold 6330 CPU @ 2.00GHz| optimized |25s| ## Related issues > ## Additional information > --------- Signed-off-by: yifan.xie <xyfabcd@163.com>
…61300) ## Description This PR adds a util to check for the number of alive, complete TPU slices in a RayCluster. This PR also adds better test coverage. This utility is used in the Ray Train elastic policy to cap the number of workers that can be scaled by the AutoscalingCoordinator. ## Related issues #55162 Related PR: #61299 --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
#61618) ## Description This PR ensure `Node._node_labels` initializes regardless of `connect_only`. The original code only initializesd`Node._node_labels` when `connect_only = true` ## Related issues Closes #61604 ## Additional information ### Verification #### Testing Script ``` python import ray ray.init() print(ray.runtime_context.get_runtime_context().get_node_labels()) ``` #### Before Fix ``` Traceback (most recent call last): File "/code/labs/jakob/ray/node_labels.py", line 7, in <module> print(ray.runtime_context.get_runtime_context().get_node_labels()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/ray/runtime_context.py", line 598, in get_node_labels return worker.current_node_labels ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/ray/_private/worker.py", line 634, in current_node_labels return self.node.node_labels ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/ray/_private/node.py", line 685, in node_labels return self._node_labels ^^^^^^^^^^^^^^^^^ AttributeError: 'Node' object has no attribute '_node_labels'. Did you mean: 'node_labels'? ``` #### After Fix ``` {'ray.io/node-id': 'b32f99e4d621a79a6a001662c1e0fd54b929765b0f69bb12e263c2b7'} ``` --------- Signed-off-by: dancingactor <s990346@gmail.com>
…61363) ## Description This PR continues the migration from `ray._private` to `ray._common` so that Ray Serve (and other libraries) do not depend on internal private APIs. It migrates `tls_utils` and logging constants out of `_private` into `_common`, and updates all consumers (Serve, dashboard, client server) to import from the new locations. **Changes:** 1. **`ray._common.tls_utils`** - Moved `generate_self_signed_tls_certs()` from `_private.tls_utils` into `_common.tls_utils` (it only depended on `_common.network_utils`). - `_private.tls_utils` re-exports it for backward compatibility. - **Serve:** `serve/tests/test_https_proxy.py` now imports from `ray._common.tls_utils`. - **Other consumers:** `_private/grpc_utils.py`, `_private/test_utils.py`, `dashboard/agent.py`, `util/client/server/proxier.py`, `util/client/server/server.py` updated. 2. **`ray._common.logging_constants`** - New module containing `LOGRECORD_STANDARD_ATTRS` (frozenset), `LOGGER_FLATTEN_KEYS`, and `LogKey` enum — all moved from `_private.ray_logging.constants`. - `_private.ray_logging.constants` is deleted; `_private.ray_logging.logging_config` now imports from `_common`. - **Serve:** `serve/schema.py` now imports `LOGRECORD_STANDARD_ATTRS` from `ray._common.logging_constants`. 3. **`ray._common.filters` / `ray._common.formatters`** - Updated internal imports to use `ray._common.logging_constants` instead of `ray._private.ray_logging.constants`. **Tests:** - New `ray/_common/tests/test_logging_constants.py` — tests the full `LOGRECORD_STANDARD_ATTRS` frozenset, all `LogKey` enum members, and `LOGGER_FLATTEN_KEYS`. - New `ray/_common/tests/test_tls_utils.py` — tests `generate_self_signed_tls_certs`. - Lint: ruff check and format pass on all changed files. ## Related issues Fixes #53478 (partial: migrates Serve-side imports for `tls_utils` and logging constants; remaining items like `runtime_env_uri` and `worker_compat` can be done in follow-up PRs). ## Additional information - **Backward compatibility:** Existing code that imports `generate_self_signed_tls_certs` from `_private.tls_utils` continues to work via re-export. `_private.ray_logging.constants` is removed since all consumers have been updated. - **Scope:** Some Serve files still use `ray._private.worker.*` as attribute access (e.g. `global_worker`, `_global_node`) without a direct import; migrating those would require exposing them via `_common` and is left for a separate PR. - Sample PRs referenced in the issue: #53457 (signal/semaphore to _common), #53652 (wait_for_condition to _common). --------- Signed-off-by: mkdev11 <MkDev11@users.noreply.github.com> Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
## Description Add a user guide for using Ray token auth with Kubernetes RBAC (via RAY_ENABLE_K8S_TOKEN_AUTH) ## Docs link 1. https://anyscale-ray--61644.com.readthedocs.build/en/61644/cluster/kubernetes/user-guides/kuberay-auth-rbac.html <img width="1193" height="482" alt="image" src="https://github.com/user-attachments/assets/d2cc1c91-0be0-477a-8417-9f98a26c1d02" /> 2. https://anyscale-ray--61644.com.readthedocs.build/en/61644/cluster/kubernetes/user-guides.html <img width="674" height="72" alt="image" src="https://github.com/user-attachments/assets/5def7875-c7a0-447a-b6ab-dfe838a74277" /> ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…60659) ## Description Add observability for `fallback_strategy` in State API and GCS. While Ray currently provides visibility for `label_selector` (#53423), there is no mechanism to observe the `fallback_strategy` from outside the system. This PR exposes `fallback_strategy` in `TaskInfoEntry and ActorTableData`. The ability to read and record `fallback_strategy` is essential for our custom autoscaler development. When primary `label_selector` constraints cannot be met, the autoscaler must access these recorded `fallback strategies` to prioritize and allocate alternative devices. Beyond autoscaling, adding this feature will provide a better debugging experience by allowing users to transparently track the entire scheduling intent, including the `fallback_strategy` for both tasks and actors. ## Related issues Related to #51564 ## Additional information ```py from ray import serve import ray from ray.util.scheduling_strategies import NodeLabelSchedulingStrategy, In, Exists @serve.deployment( name="soft_docker_deployment", ray_actor_options={ "label_selector": {"docker-image": "in(test-image)"}, "fallback_strategy": [ {"label_selector": {"docker-image": "in(test-image2)"}}, ] } ) class SoftDockerDeployment: def __call__(self, request): node_labels = ray.get_runtime_context().get_node_labels() return { "message": "Hello from soft-docker deployment!", "node_labels": node_labels } if __name__ == "__main__": serve.start(http_options={"host": "0.0.0.0", "port": 8000}) serve.run(SoftDockerDeployment.bind()) ``` #### GlobalStateAccessor.get_actor_table <img width="1224" height="1076" alt="image" src="https://github.com/user-attachments/assets/5c66a483-9fce-46a1-a4e7-86874f6a8b27" /> #### ray list actors --detail <img width="836" height="724" alt="image" src="https://github.com/user-attachments/assets/d99c1c6f-b0f7-4d25-9638-4a2fdd805a0d" /> --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
## Description - Cleaning up to make `BundleClass`, `BlockMetadata`, `BlockSlice` and other classes properly frozen; - Added `PandasBlockSchema` as data class ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ehavior for resumption (#61510) This PR avoids a race condition causing a flaky validation resumption test by: 1) Changing `after_controller_state_update` to do nothing if the state is terminal. We will process validation tasks from `FINISHED` and `ERRORED` train runs in `before_controller_shutdown` and cancel validation tasks in `ABORTED` train runs in `before_controller_abort`. 2) Changing `test_report_validation_fn_resumption` to SIGINT a process (the same pattern as `test_sigint_abort`) rather than `ray.cancel` a task (the old pattern) to deterministically test the graceful abortion path. Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Joe Cotant <joe@anyscale.com>
Restrict Slack notifications to only labeled issues and PRs. Signed-off-by: Joe Cotant <joe@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )