Skip to content

[pull] master from ray-project:master#834

Merged
pull[bot] merged 4 commits intogarymm:masterfrom
ray-project:master
Mar 17, 2026
Merged

[pull] master from ray-project:master#834
pull[bot] merged 4 commits intogarymm:masterfrom
ray-project:master

Conversation

@pull
Copy link

@pull pull bot commented Mar 17, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

jeffreywang-anyscale and others added 4 commits March 16, 2026 18:33
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…equest completion, reduce replica update overhead (#61755)

Fixes elevated P99 latency observed when scaling Ray Serve deployments
with `max_ongoing_requests=1`. The root cause is that the queue length
cache is incremented when a request is sent (`on_send_request`) but
never decremented when the request completes, causing cache entries to
get "stuck" at values >= `max_ongoing_requests`. This forces every
subsequent routing decision to fall back to blocking probe RPCs instead
of using cached values.

This regression was introduced when `on_send_request` was added in the
router refactor (commit de1494e, Aug 2025). Prior to that (Ray <=
2.10), the cache was only updated from replica-reported values (probes
and rejection protocol responses), so there was no
increment-without-matching-decrement problem.

### Changes

**1. Decrement queue length cache on request completion (primary fix)**

Implements `decrement_queue_len_cache` in `RequestRouter` to decrement
the cache entry by 1 when a request finishes. This restores the
increment/decrement symmetry that was missing since `on_send_request`
was introduced.

With `max_ongoing_requests=1`, the routing algorithm in
`_select_from_candidate_replicas` treats any cache entry >=
`max_ongoing_requests` as "needs probing". Before this fix, every routed
request would bump the cache to 1, and it would stay there until either
the 10s TTL expired or a probe happened to refresh it. This meant the
cache was nearly useless, and most routing decisions required a blocking
probe RPC (~20-40ms round trip), directly explaining the observed P99
increase.

**2. Reuse existing replica wrappers in `_update_running_replicas`**

Previously, every replica update created new `RunningReplica` wrappers
for *all* replicas, even those that hadn't changed. During scaling
storms (100+ updates with 250+ replicas each), this caused O(n)
synchronous work per update on the router's event loop.

Now reuses existing wrappers for known replicas, only creating wrappers
for genuinely new ones. This reduces per-update work from
O(all_replicas) to O(new_replicas).

**3. Reduce replica update log noise**

The "Got updated replicas" log line previously serialized every replica
ID (250+) into a string on every update. Changed to log only the total
count and the added/removed counts, reducing both log volume and the
synchronous formatting cost on the event loop.

## Load Test Results

| Scale | Client | Master QPS | Master P99 Latency | Optimized QPS |
Optimized P99 Latency |

|------|--------|------------|--------------------|---------------|-----------------------|
| **Up to 100 Users** | <img
src="https://github.com/user-attachments/assets/d87573da-9fb8-4c22-b93d-04ce4cac2635"
width="250"> | <img
src="https://github.com/user-attachments/assets/d0de09c8-5932-4ea0-9988-cb798e8d6328"
width="400"> | <img
src="https://github.com/user-attachments/assets/c46dc371-c19d-4e2d-8327-b78648e2c393"
width="400"> | <img
src="https://github.com/user-attachments/assets/fccfd88c-6cd2-4d56-bf32-da5cec54e937"
width="400"> | <img
src="https://github.com/user-attachments/assets/6fc9260d-f164-415d-baa9-da8a837a1d23"
width="400"> |
| **Up to 200 Users** | <img
src="https://github.com/user-attachments/assets/8455e66d-9945-480c-89eb-78c1d3641e4e"
width="250"> | <img
src="https://github.com/user-attachments/assets/5a755c67-3588-4add-bfa5-f7e974f8b547"
width="400"> | <img
src="https://github.com/user-attachments/assets/2a7cfe6d-c660-4433-b350-a2b406421a70"
width="400"> | <img
src="https://github.com/user-attachments/assets/1ecea44d-2eb0-445f-a9e2-70b61eb4eba9"
width="400"> | <img
src="https://github.com/user-attachments/assets/e683bffb-9c45-4755-a86c-c6f39f84568b"
width="400"> |

---------

Signed-off-by: abrar <abrar@anyscale.com>
…er (#61299)

## Description
This PR implements support for elastic training on TPUs using the
`JaxTrainer` API and the elastic scaling policy.

Specifically, this PR utilizes a new TPU utility
`get_num_ready_tpu_slices` to return the number of full, ready TPU
slices in the RayCluster and then adjusts the `_count_possible_workers`
calculation when running on TPUs to scale atomically by TPU slices. This
PR also adds comprehensive unit tests and an e2e test for the new
support.

I'll separate the `ray.util.tpu` change in a separate PR, but left it in
for now so that the tests could pass.

## Related issues
Implements milestone 3 of
#55162

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@pull pull bot locked and limited conversation to collaborators Mar 17, 2026
@pull pull bot added the ⤵️ pull label Mar 17, 2026
@pull pull bot merged commit 4397fcb into garymm:master Mar 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants