[pull] master from ray-project:master#833
Merged
pull[bot] merged 11 commits intogarymm:masterfrom Mar 17, 2026
Merged
Conversation
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…verage for actor-failure corner cases (#61758) In chained `DeploymentResponse` flows, a downstream replica can surface an upstream actor death while remaining healthy. Previously, the router treated these failures as local replica deaths and incorrectly removed healthy downstream replicas from routing. This change prevents that misattribution and preserves correct replica health behavior. ran the repro provided in #61594 and it passes added new unit and integration tests --------- Signed-off-by: abrar <abrar@anyscale.com>
## Description Stabilizing flaky tests: <img width="1597" height="82" alt="Screenshot 2026-03-13 at 12 21 04 PM" src="https://github.com/user-attachments/assets/2ed304c3-6aea-46f5-b17c-774da27ce008" /> ## Approach - The dashboard may be unavailable because a previous test's dashboard process is still holding port 8265 when the next test starts a new Ray cluster. The new cluster's dashboard fails to bind to that port, so `list_actors()` (which requires the dashboard) fails. Using `use_controller=True` avoids this by querying replica states through `serve.status()`, which goes through the Serve controller via GCS. - Remove file-based synchronizations and prefer signal actors. - Relax timeouts. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…de (#61386) ## Summary - Add HAProxy load balancing section to the Serve performance tuning guide - Add interdeployment gRPC transport section --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
…otFound Error in Chained Left Joins (#61507) ## Description Per #60013, chained left joins fail with `ColumnNotFoundError` when the first join produces empty intermediate blocks. #60520 attempted the fix, but a refined reproduction script shows it does not resolve the underlying issue. This PR proposes a targeted fix and a deterministic regression test. ### Root cause Using the example in #60013, when the streaming executor feeds the second join's input, the first block delivered can have zero rows. The bug is then triggered through the following sequence: 1. `_do_add_input_inner` sees that this is the first block for input sequence 0 (or 1), so it submits a `_shuffle_block` task with `send_empty_blocks=True` and immediately sets `self._has_schemas_broadcasted[input_index] = True`; 2. Remote `_shuffle_block` worker tasks **triggers an early return of `(empty_metadata, {})`**. No `aggregator.submit()` calls are ever made and the schema never reaches any aggregator; 3. All subsequent blocks are submitted with `send_empty_blocks=False`. Qggregators with no non-empty data are never contacted at all, leaving their bucket queue empty; 4. At finalization, `drain_queue()` returns `[]` for those partitions, so `_combine([])` builds an `ArrowblockBuilder` with no `add_block()` calls and produces an empty table with no columns; 5. When `JoiningAggregation.finalize()` calls `pa.Table.join()` on this columnless table, it raises `ColumnNotFoundError` as observed. ### Why #60520 does not fix this issue #60520 modifies `ArrowBlockBuilder.build()` to use a stored `self._schema` when `len(tables) == 0`. However, `self._schema` is only populated inside`add_block()` calls. When `partition_shards` is `[]` in `_combine(...)`, `self._schema` remains `None`. ### This fix In `_shuffle_block`, when `block.num_rows == 0` and `send_empty_blocks=True`, explicitly broadcast schema-carrying empty tables to every aggregator before returning. This mirrors the broadcast logic for non-empty blocks, which ensures every aggregator holds at least one schema-carrying block and thus finalizes correctly. ### Alternative fix Deleting the entire early-return if branch in `_shuffle_block` would also eliminate the issue. However, since the bug only affects the edge case where the first incoming block is empty, removing the full early-return branch risks performance degradation. ## Related issues Fixes #60013 and follows up on #60520. ## Additional information The original reproduction script in #60013 occasionally misses the error due to the uncertain order of blocks fed to the second join. To force the bug, add the following lines to the reproduction script: ``` ... shapes = [b.shape for b in blocks] print(f"Columns flattened via map_batches: {flatten_columns}") print("Block shapes after first join:", shapes) # ----- Add the following lines ----- # Force the bug # The streaming executor delivers blocks in completion order, so non-empty # partitions finish faster and arrive first, letting schema broadcast succeed # silently. Reconstructing the dataset with empty blocks at the front # guarantees that _shuffle_block() sees a zero-row block as the very first # block for the left input sequence, triggering the premature # _has_schemas_broadcasted flag and the resulting (0,0) empty-table bug. import pyarrow as pa empty = [b for b in blocks if b.num_rows == 0] nonempty = [b for b in blocks if b.num_rows > 0] assert empty, "No empty blocks found — cannot reproduce the bug with this dataset." print(f"Reordering: {len(empty)} empty blocks first, then {len(nonempty)} non-empty.") ds_joined = ray.data.from_arrow(empty + nonempty) print("Block shapes after reordering:", [b.shape for b in (ray.get(ref) for ref in ds_joined.get_internal_block_refs())]) # ---------------------------------- # Create mapping table # Use some of the location_ids for the mapping shared_location_ids = location_ids[: max(1, len(location_ids) // 3)] ... ``` The augmented script forces the order of blocks so that the first block going into the second join is always empty. The new test case in `test_join.py` places the empty block in a list fed to `from_arrow`, preserving the block order and ensuring that the second join will always see the empty block first. The bug fires reliably on every run before the fix. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
`get_metric_dictionaries` internally calls `wait_for_condition` to block until a metric appears. When tests wrap this call inside their own `wait_for_condition` (the common pattern for checking metric values or counts), the two waits nest: the inner one times out and raises, preventing the outer loop from ever retrying. This caused intermittent failures in `test_metrics`, `test_metrics_2`, `test_metrics_3`, and `test_metrics_haproxy` depending on how quickly Prometheus scraped. **What** Added a `wait: bool = True` parameter to `get_metric_dictionaries`: - `wait=True` (default): preserves existing behavior — blocks until the metric appears. - `wait=False`: performs a single fetch and returns immediately (possibly empty), letting the caller's `wait_for_condition` drive the retry loop. All call sites inside test `check_*` / `metrics_available` lambdas that are already wrapped in `wait_for_condition` are switched to `wait=False`. Signed-off-by: abrar <abrar@anyscale.com>
…les used by RLlib (#60877) ## Description Avoids triggering v2 module loading when RLlib imports `BackendExecutor` when `RAY_TRAIN_V2_ENABLED=1`. ## Additional information As a follow-up we can extend this safe import logic across the other files as well. Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
…61709) Hand-written REST client using requests. This commit adds the core layer: GitHubException, a bare GitHubRepo handle, and GitHubClient with all HTTP methods (_get, _get_paginated, _post, _patch). Tests use the responses library to intercept HTTP at the transport layer. Signed-off-by: andrew <andrew@anyscale.com>
…only provider (#61732) ## Description On a static Ray cluster, the autoscaler works with a read-only cloud provider, so that it can still keep reconciling to print out warnings for infeasible requests, but not do the actual scaling. However, it still emits other logs that are misleading for a static cluster. For example, we will see these annoying logs on a static Ray cluster: <img width="876" height="366" alt="image" src="https://github.com/user-attachments/assets/42dd8027-1a1b-4cd0-834d-ba928b157ad8" /> This PR makes the event logger only emit warnings for infeasible requests, but suppresses autoscaler action logs if the cloud provider is read-only. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
…gcs client (#61666) ## Description This PR adds the `resize_raylet_resource_instances` function to the GCS Cython client. It will be used by the autoscaler in a follow-up PR for IPPR. This PR also addresses comments in the previous comments #61654 (comment) and #61654 (comment). Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )