Skip to content

[tune](deps): Bump nevergrad from 0.4.2.post5 to 0.4.3.post4 in /python/requirements/tune#1

Closed
dependabot[bot] wants to merge 1 commit intomasterfrom
dependabot/pip/python/requirements/tune/nevergrad-0.4.3.post4
Closed

[tune](deps): Bump nevergrad from 0.4.2.post5 to 0.4.3.post4 in /python/requirements/tune#1
dependabot[bot] wants to merge 1 commit intomasterfrom
dependabot/pip/python/requirements/tune/nevergrad-0.4.3.post4

Conversation

@dependabot
Copy link

@dependabot dependabot bot commented on behalf of github Jun 8, 2021

Bumps nevergrad from 0.4.2.post5 to 0.4.3.post4.

Release notes

Sourced from nevergrad's releases.

0.4.3

This version provides a few fixes and the new multi-objective API of optimizers (you can now provide a list/array of float to tell directly). This allows fore more efficient multi-objective optimization for some optimizers (DE, NGOpt). Future work will continue to improve multi-objective capacities and aim at improving constraints management.

See CHANGELOG for details.

Changelog

Sourced from nevergrad's changelog.

Changelog

master

Breaking changes

  • copy() method of a Parameter does not change the parameters's random state anymore (it used to reset it to None #1048
  • MultiobjectiveFunction does not exist anymore #1034.
  • Choice and TransitionChoice have some of their API changed for uniformization. In particular, indices is now an ng.p.Array (and not an np.ndarray) which contains the selected indices (or index) of the Choice. The sampling is performed by specific "layers" that are applied to Data parameters #1065.
  • Parameter.set_standardized_space does not take a deterministic parameter anymore #1068. This is replaced by the more general with ng.p.helpers.determistic_sampling(parameter) context. One-shot algorithms are also updated to choose options of Choice parameters deterministically, since it is a simpler behavior to expect compared to sampling the standardized space than sampling the option stochastically from there
  • RandomSearch now defaults to sample values using the parameter.sample() instead of a Gaussian #1068. The only difference comes with bounded variables since in this case parameter.sample() samples uniformly (unless otherwise specified). The previous behavior can be obtained with RandomSearchMaker(sampler="gaussian").
  • PSO API has been slightly changed #1073
  • Parameter instances descriptor attribute is deprecated, in favor of a combinaison of an analysis function (ng.p.helpers.analyze) returning information about the parameter (eg: whether continuous, deterministic etc...) and a new function attribute which can be used to provide information about the function (eg: whether deterministic etc) #1076.
  • Half the budget alloted to solve cheap constrained is now used by a sub-optimizer #1047. More changes of constraint management will land in the near future.
  • Experimental methods Array.set_recombination and Array.set_mutation(custom=.) are removed in favor of layers changing Array behaviors #1086. Caution: this is still very experimental (and undocumented).

Important changes

  • Parameter classes are undergoing heavy changes, please open an issue if you encounter any problem. The midterm aim is to allow for simpler constraint management.
  • Parameter have been updated have undergone heavy changes to ease the handling of their tree structure ( #1029 #1036 #1038 #1043 #1044 )
  • Parameter classes have now a layer structure #1045 which simplifies changing their behavior. In future PRs this system will take charge of bounds, other constraints, sampling etc.
  • The layer structures allows disentangling bounds and log-distribution. This goal has been reached with #1053 but may create some instabilities. In particular, the representation (__repr__) of Array has changed, and their bounds attribute is no longer reliable for now. This change will eventually lead to a new syntax for settings bounds and distribution, but it's not ready yet.

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [nevergrad](https://github.com/facebookresearch/nevergrad) from 0.4.2.post5 to 0.4.3.post4.
- [Release notes](https://github.com/facebookresearch/nevergrad/releases)
- [Changelog](https://github.com/facebookresearch/nevergrad/blob/master/CHANGELOG.md)
- [Commits](facebookresearch/nevergrad@0.4.2.post5...0.4.3.post4)

---
updated-dependencies:
- dependency-name: nevergrad
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label Jun 8, 2021
@dependabot @github
Copy link
Author

dependabot bot commented on behalf of github Aug 14, 2021

Superseded by #20.

@dependabot dependabot bot closed this Aug 14, 2021
@dependabot dependabot bot deleted the dependabot/pip/python/requirements/tune/nevergrad-0.4.3.post4 branch August 14, 2021 07:09
matthewdeng pushed a commit that referenced this pull request Jul 28, 2022
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:

```
#0  0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
   from /lib64/libstdc++.so.6
#1  0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#2  0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#3  0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#4  0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
   from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#5  0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#6  0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#9  0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
    at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```

The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).

It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.

The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.

BTW, I've tried different approaches:

1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
matthewdeng pushed a commit that referenced this pull request Aug 21, 2025
… condition (ray-project#55367)

## Why are these changes needed?

Workers crash with a fatal `RAY_CHECK` failure when the plasma store
connection is broken during shutdown, causing the following error:
```
RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
```
Stacktrace:
```
core_worker.cc:720 C  Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe 
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()#13}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850]
```

Stack trace flow:
1. Task lease request fails ->
`NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback.
2. Triggers `TaskManager::FailPendingTask()` ->
`MarkTaskReturnObjectsFailed()`.
3. System attempts to store error objects in plasma via
`put_in_local_plasma_callback_`.
4. Plasma connection is broken (raylet/plasma store already shut down).
5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of
graceful handling.

Root Cause:

This is a shutdown ordering race condition:
1. Raylet shuts down first: The raylet stops its IO context
([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146))
which closes plasma store connections.
2. Worker still processes callbacks: Core worker continues processing
pending callbacks on separate threads.
3. Broken connection: When the callback tries to store error objects in
plasma, the connection is already closed.
4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error
and crashes the process.

Fix:

1. Shutdown-aware plasma operations
- Add `CoreWorker::IsShuttingDown()` method to check shutdown state.
- Skip plasma operations entirely when shutdown is in progress.
- Prevents attempting operations on already-closed connections.

2. Targeted error handling for connection failures
- Replace blanket `RAY_CHECK_OK()` with specific error type checking.
- Handle connection errors (Broken pipe, Connection reset, Bad file
descriptor) as warnings during shutdown scenarios.
- Maintain `RAY_CHECK_OK()` for other error types to catch real issues.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
matthewdeng pushed a commit that referenced this pull request Dec 3, 2025
…BRT: "corrupted size vs. prev_size") (ray-project#58660)

## Summary

This PR fixes a heap corruption bug that causes the driver to crash with
SIGABRT. The issue is caused by a use-after-free when the `RayletClient`
object is destroyed while an asynchronous RPC callback is still pending.

## Problem Description

### Scenario

A Ray Data job (Ray 2.50.0) with pipeline `read_parquet -> filter ->
map_batches -> write` running for 4+ hours, where workers use elastic
resources with low job priority causing frequent worker deaths due to
pod preemption, crashes the driver with SIGABRT:
```
corrupted size vs. prev_size
*** SIGABRT received at time=1761916578 on cpu 30 ***
PC: @ 0x7f073569d9fc (unknown) pthread_kill
Aborted (core dumped)
```



### Trigger Conditions

After reproducing with an ASan image, Asan reveals the actual
use-after-free at:
```
 #0 0x7ff282967361 in std::__atomic_base<long>::fetch_sub(long, std::memory_order) /usr/include/c++/11/bits/atomic_base.h:628
    #1 0x7ff282967361 in std::__atomic_base<long>::operator--(int) /usr/include/c++/11/bits/atomic_base.h:377
    #2 0x7ff282967361 in operator() src/ray/raylet_rpc_client/raylet_client.cc:338
    #3 0x7ff282967361 in __invoke_impl<void, ray::rpc::RayletClient::PinObjectIDs(const ray::rpc::Address&, const std::vector<ray::ObjectID>&, const ray::ObjectID&, ray::rpc::ClientCallback<ray::rpc::PinObjectIDsReply>&)::<lambda(ray::Status, ray::rpc::PinObjectIDsReply&&)>&, const ray::Status&, ray::rpc::PinObjectIDsReply> /usr/include/c++/11/bits/invoke.h:61
    #4 0x7ff282967361 in __invoke_r<void, ray::rpc::RayletClient::PinObjectIDs(const ray::rpc::Address&, const std::vector<ray::ObjectID>&, const ray::ObjectID&, ray::rpc::ClientCallback<ray::rpc::PinObjectIDsReply>&)::<lambda(ray::Status, ray::rpc::PinObjectIDsReply&&)>&, const ray::Status&, ray::rpc::PinObjectIDsReply> /usr/include/c++/11/bits/invoke.h:111
    #5 0x7ff282967361 in _M_invoke /usr/include/c++/11/bits/std_function.h:290
    #6 0x7ff2829fbadf in std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>::operator()(ray::Status const&, ray::rpc::PinObjectIDsReply&&) const /usr/include/c++/11/bits/std_function.h:590
    #7 0x7ff2829fbadf in ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Create<ray::rpc::NodeManagerService, ray::rpc::PinObjectIDsRequest, ray::rpc::PinObjectIDsReply>(std::weak_ptr<ray::rpc::RetryableGrpcClient>, std::unique_ptr<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply>, std::default_delete<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply> > > (ray::rpc::NodeManagerService::Stub::*)(grpc::ClientContext*, ray::rpc::PinObjectIDsRequest const&, grpc::CompletionQueue*), std::shared_ptr<ray::rpc::GrpcClient<ray::rpc::NodeManagerService> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ray::rpc::PinObjectIDsRequest, std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>, long)::{lambda(ray::Status const&)#2}::operator()(ray::Status const&) const bazel-out/k8-dbg/bin/src/ray/rpc/_virtual_includes/retryable_grpc_client/ray/rpc/retryable_grpc_client.h:293
    #8 0x7ff2829fbadf in void std::__invoke_impl<void, ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Create<ray::rpc::NodeManagerService, ray::rpc::PinObjectIDsRequest, ray::rpc::PinObjectIDsReply>(std::weak_ptr<ray::rpc::RetryableGrpcClient>, std::unique_ptr<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply>, std::default_delete<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply> > > (ray::rpc::NodeManagerService::Stub::*)(grpc::ClientContext*, ray::rpc::PinObjectIDsRequest const&, grpc::CompletionQueue*), std::shared_ptr<ray::rpc::GrpcClient<ray::rpc::NodeManagerService> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ray::rpc::PinObjectIDsRequest, std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>, long)::{lambda(ray::Status const&)#2}&, ray::Status>(std::__invoke_other, ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Create<ray::rpc::NodeManagerService, ray::rpc::PinObjectIDsRequest, ray::rpc::PinObjectIDsReply>(std::weak_ptr<ray::rpc::RetryableGrpcClient>, std::unique_ptr<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply>, std::default_delete<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply> > > (ray::rpc::NodeManagerService::Stub::*)(grpc::ClientContext*, ray::rpc::PinObjectIDsRequest const&, grpc::CompletionQueue*), std::shared_ptr<ray::rpc::GrpcClient<ray::rpc::NodeManagerService> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ray::rpc::PinObjectIDsRequest, std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>, long)::{lambda(ray::Status const&)#2}&, ray::Status&&) /usr/include/c++/11/bits/invoke.h:61
    #9 0x7ff2829fbadf in std::enable_if<is_invocable_r_v<void, ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Create<ray::rpc::NodeManagerService, ray::rpc::PinObjectIDsRequest, ray::rpc::PinObjectIDsReply>(std::weak_ptr<ray::rpc::RetryableGrpcClient>, std::unique_ptr<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply>, std::default_delete<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply> > > (ray::rpc::NodeManagerService::Stub::*)(grpc::ClientContext*, ray::rpc::PinObjectIDsRequest const&, grpc::CompletionQueue*), std::shared_ptr<ray::rpc::GrpcClient<ray::rpc::NodeManagerService> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ray::rpc::PinObjectIDsRequest, std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>, long)::{lambda(ray::Status const&)#2}&, ray::Status>, void>::type std::__invoke_r<void, ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Create<ray::rpc::NodeManagerService, ray::rpc::PinObjectIDsRequest, ray::rpc::PinObjectIDsReply>(std::weak_ptr<ray::rpc::RetryableGrpcClient>, std::unique_ptr<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply>, std::default_delete<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply> > > (ray::rpc::NodeManagerService::Stub::*)(grpc::ClientContext*, ray::rpc::PinObjectIDsRequest const&, grpc::CompletionQueue*), std::shared_ptr<ray::rpc::GrpcClient<ray::rpc::NodeManagerService> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ray::rpc::PinObjectIDsRequest, std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>, long)::{lambda(ray::Status const&)#2}&, ray::Status>(ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Create<ray::rpc::NodeManagerService, ray::rpc::PinObjectIDsRequest, ray::rpc::PinObjectIDsReply>(std::weak_ptr<ray::rpc::RetryableGrpcClient>, std::unique_ptr<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply>, std::default_delete<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply> > > (ray::rpc::NodeManagerService::Stub::*)(grpc::ClientContext*, ray::rpc::PinObjectIDsRequest const&, grpc::CompletionQueue*), std::shared_ptr<ray::rpc::GrpcClient<ray::rpc::NodeManagerService> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ray::rpc::PinObjectIDsRequest, std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>, long)::{lambda(ray::Status const&)#2}&, ray::Status&&) /usr/include/c++/11/bits/invoke.h:111
    #10 0x7ff2829fbadf in std::_Function_handler<void (ray::Status), ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Create<ray::rpc::NodeManagerService, ray::rpc::PinObjectIDsRequest, ray::rpc::PinObjectIDsReply>(std::weak_ptr<ray::rpc::RetryableGrpcClient>, std::unique_ptr<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply>, std::default_delete<grpc::ClientAsyncResponseReader<ray::rpc::PinObjectIDsReply> > > (ray::rpc::NodeManagerService::Stub::*)(grpc::ClientContext*, ray::rpc::PinObjectIDsRequest const&, grpc::CompletionQueue*), std::shared_ptr<ray::rpc::GrpcClient<ray::rpc::NodeManagerService> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ray::rpc::PinObjectIDsRequest, std::function<void (ray::Status const&, ray::rpc::PinObjectIDsReply&&)>, long)::{lambda(ray::Status const&)#2}>::_M_invoke(std::_Any_data const&, ray::Status&&) /usr/include/c++/11/bits/std_function.h:290
    #11 0x7ff2834657e9 in std::function<void (ray::Status)>::operator()(ray::Status) const /usr/include/c++/11/bits/std_function.h:590
    #12 0x7ff2834657e9 in ray::rpc::RetryableGrpcClient::RetryableGrpcRequest::Fail(ray::Status const&) bazel-out/k8-dbg/bin/src/ray/rpc/_virtual_includes/retryable_grpc_client/ray/rpc/retryable_grpc_client.h:109
    #13 0x7ff2834657e9 in operator() src/ray/rpc/retryable_grpc_client.cc:30
    #14 0x7ff2834657e9 in __invoke_impl<void, ray::rpc::RetryableGrpcClient::~RetryableGrpcClient()::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
    #15 0x7ff2834657e9 in __invoke_r<void, ray::rpc::RetryableGrpcClient::~RetryableGrpcClient()::<lambda()>&> /usr/include/c++/11/bits/invoke.h:111
    #16 0x7ff2834657e9 in _M_invoke /usr/include/c++/11/bits/std_function.h:290
2025-11-14 16:15:05,608	INFO streaming_executor_state.py:511 -- Running activate tasks is {'MapBatches(QwenInfer)': ['MapBatches(QwenInfer)-79153', 'MapBatches(QwenInfer)-80170', 'MapBatches(QwenInfer)-80225', 'MapBatches(QwenInfer)-80299', 'MapBatches(QwenInfer)-82624'], 'MapBatches(drop_columns)->Write': ['MapBatches(drop_columns)->Write-25244', 'MapBatches(drop_columns)->Write-34438', 'MapBatches(drop_columns)->Write-34439', 'MapBatches(drop_columns)->Write-34440', 'MapBatches(drop_columns)->Write-34441']}
    #17 0x7ff2834e2407 in std::function<void ()>::operator()() const /usr/include/c++/11/bits/std_function.h:590
    #18 0x7ff2834e2407 in EventTracker::RecordExecution(std::function<void ()> const&, std::shared_ptr<StatsHandle>) src/ray/common/event_stats.cc:112
    #19 0x7ff2834bea54 in operator() src/ray/common/asio/instrumented_io_context.cc:110
    #20 0x7ff2834bea54 in __invoke_impl<void, instrumented_io_context::post(std::function<void()>, std::string, int64_t)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
    #21 0x7ff2834bea54 in __invoke_r<void, instrumented_io_context::post(std::function<void()>, std::string, int64_t)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:111
    #22 0x7ff2834bea54 in _M_invoke /usr/include/c++/11/bits/std_function.h:290
    #23 0x7ff28242fb5b in std::function<void ()>::operator()() const /usr/include/c++/11/bits/std_function.h:590
    #24 0x7ff28242fb5b in boost::asio::detail::binder0<std::function<void ()> >::operator()() external/boost/boost/asio/detail/bind_handler.hpp:60
    #25 0x7ff28242fb5b in void boost::asio::asio_handler_invoke<boost::asio::detail::binder0<std::function<void ()> > >(boost::asio::detail::binder0<std::function<void ()> >&, ...) external/boost/boost/asio/handler_invoke_hook.hpp:88
    #26 0x7ff28242fb5b in void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<std::function<void ()> >, std::function<void ()> >(boost::asio::detail::binder0<std::function<void ()> >&, std::function<void ()>&) external/boost/boost/asio/detail/handler_invoke_helpers.hpp:54
    #27 0x7ff28242fb5b in void boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0<std::function<void ()> >, std::function<void ()> >(boost::asio::detail::binder0<std::function<void ()> >&, boost::asio::detail::binder0<std::function<void ()> >*) external/boost/boost/asio/detail/bind_handler.hpp:111
    #28 0x7ff28242fb5b in void boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<std::function<void ()> >, boost::asio::detail::binder0<std::function<void ()> > >(boost::asio::detail::binder0<std::function<void ()> >&, boost::asio::detail::binder0<std::function<void ()> >&) external/boost/boost/asio/detail/handler_invoke_helpers.hpp:54
    #29 0x7ff28242fb5b in boost::asio::detail::executor_op<boost::asio::detail::binder0<std::function<void ()> >, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) external/boost/boost/asio/detail/executor_op.hpp:70
    #30 0x7ff2838607d6 in boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) external/boost/boost/asio/detail/scheduler_operation.hpp:40
    #31 0x7ff2838607d6 in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) external/boost/boost/asio/detail/impl/scheduler.ipp:492
    #32 0x7ff283892d35 in boost::asio::detail::scheduler::run(boost::system::error_code&) external/boost/boost/asio/detail/impl/scheduler.ipp:210
    #33 0x7ff2838981e0 in boost::asio::io_context::run() external/boost/boost/asio/impl/io_context.ipp:63
2025-11-14 16:15:05,742	INFO streaming_executor_state.py:511 -- Running activate tasks is {'MapBatches(QwenInfer)': ['MapBatches(QwenInfer)-79153', 'MapBatches(QwenInfer)-80170', 'MapBatches(QwenInfer)-80225', 'MapBatches(QwenInfer)-80299', 'MapBatches(QwenInfer)-82624'], 'MapBatches(drop_columns)->Write': ['MapBatches(drop_columns)->Write-25244', 'MapBatches(drop_columns)->Write-34438', 'MapBatches(drop_columns)->Write-34439', 'MapBatches(drop_columns)->Write-34440', 'MapBatches(drop_columns)->Write-34441']}
    #34 0x7ff281e9d0aa in operator() src/ray/core_worker/core_worker_process.cc:193
    #35 0x7ff281e9d247 in run external/boost/boost/thread/detail/thread.hpp:120
    #36 0x7ff282503c47 in thread_proxy external/boost/libs/thread/src/pthread/thread.cpp:179
    #37 0x7ff28b013ac2 in start_thread nptl/pthread_create.c:442
    #38 0x7ff28b0a58bf  (/lib/x86_64-linux-gnu/libc.so.6+0x1268bf)

0x50c003fd3d30 is located 112 bytes inside of 120-byte region [0x50c003fd3cc0,0x50c003fd3d38)
freed by thread T68 here:
2025-11-14 16:15:05,876	INFO streaming_executor_state.py:511 -- Running activate tasks is {'MapBatches(QwenInfer)': ['MapBatches(QwenInfer)-79153', 'MapBatches(QwenInfer)-80170', 'MapBatches(QwenInfer)-80225', 'MapBatches(QwenInfer)-80299', 'MapBatches(QwenInfer)-82624'], 'MapBatches(drop_columns)->Write': ['MapBatches(drop_columns)->Write-25244', 'MapBatches(drop_columns)->Write-34438', 'MapBatches(drop_columns)->Write-34439', 'MapBatches(drop_columns)->Write-34440', 'MapBatches(drop_columns)->Write-34441']}
    #0 0x7ff28b39924f in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x7ff281eceb5f in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<ray::rpc::RayletClient, std::allocator<ray::rpc::RayletClient>, (__gnu_cxx::_Lock_policy)2> >::deallocate(std::_Sp_counted_ptr_inplace<ray::rpc::RayletClient, std::allocator<ray::rpc::RayletClient>, (__gnu_cxx::_Lock_policy)2>*, unsigned long) /usr/include/c++/11/ext/new_allocator.h:145
    #2 0x7ff281eceb5f in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<ray::rpc::RayletClient, std::allocator<ray::rpc::RayletClient>, (__gnu_cxx::_Lock_policy)2> > >::deallocate(std::allocator<std::_Sp_counted_ptr_inplace<ray::rpc::RayletClient, std::allocator<ray::rpc::RayletClient>, (__gnu_cxx::_Lock_policy)2> >&, std::_Sp_counted_ptr_inplace<ray::rpc::RayletClient, std::allocator<ray::rpc::RayletClient>, (__gnu_cxx::_Lock_policy)2>*, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:496
    #3 0x7ff281eceb5f in std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<ray::rpc::RayletClient, std::allocator<ray::rpc::RayletClient>, (__gnu_cxx::_Lock_policy)2> > >::~__allocated_ptr() /usr/include/c++/11/bits/allocated_ptr.h:74
    #4 0x7ff281eceb5f in std::_Sp_counted_ptr_inplace<ray::rpc::RayletClient, std::allocator<ray::rpc::RayletClient>, (__gnu_cxx::_Lock_policy)2>::_M_destroy() /usr/include/c++/11/bits/shared_ptr_base.h:538
    #5 0x7ff282a73f0a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/11/bits/shared_ptr_base.h:184
    #6 0x7ff282a73f0a in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/11/bits/shared_ptr_base.h:705
    #7 0x7ff282a73f0a in std::__shared_ptr<ray::RayletClientInterface, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/11/bits/shared_ptr_base.h:1154
    #8 0x7ff282a73f0a in std::shared_ptr<ray::RayletClientInterface>::~shared_ptr() /usr/include/c++/11/bits/shared_ptr.h:122
    #9 0x7ff282a73f0a in std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> >::~pair() /usr/include/c++/11/bits/stl_pair.h:211
    #10 0x7ff282a73f0a in void __gnu_cxx::new_allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > >::destroy<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > >(std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> >*) /usr/include/c++/11/ext/new_allocator.h:168
    #11 0x7ff282a73f0a in void std::allocator_traits<std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > > >::destroy<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > >(std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > >&, std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> >*) /usr/include/c++/11/bits/alloc_traits.h:535
    #12 0x7ff282a73f0a in void absl::lts_20230802::container_internal::map_slot_policy<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >::destroy<std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > > >(std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > >*, absl::lts_20230802::container_internal::map_slot_type<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >*) external/com_google_absl/absl/container/internal/container_memory.h:421
    #13 0x7ff282a73f0a in void absl::lts_20230802::container_internal::FlatHashMapPolicy<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >::destroy<std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > > >(std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > >*, absl::lts_20230802::container_internal::map_slot_type<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >*) external/com_google_absl/absl/container/flat_hash_map.h:578
    #14 0x7ff282a73f0a in void absl::lts_20230802::container_internal::common_policy_traits<absl::lts_20230802::container_internal::FlatHashMapPolicy<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >, void>::destroy<std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > > >(std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > >*, absl::lts_20230802::container_internal::map_slot_type<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >*) external/com_google_absl/absl/container/internal/common_policy_traits.h:50
    #15 0x7ff282a73f0a in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >, absl::lts_20230802::hash_internal::Hash<ray::NodeID>, std::equal_to<ray::NodeID>, std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > > >::erase(absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<ray::NodeID, std::shared_ptr<ray::RayletClientInterface> >, absl::lts_20230802::hash_internal::Hash<ray::NodeID>, std::equal_to<ray::NodeID>, std::allocator<std::pair<ray::NodeID const, std::shared_ptr<ray::RayletClientInterface> > > >::iterator) external/com_google_absl/absl/container/internal/raw_hash_set.h:2183
    #16 0x7ff282a73f0a in ray::rpc::RayletClientPool::Disconnect(ray::NodeID) src/ray/raylet_rpc_client/raylet_client_pool.cc:114
    #17 0x7ff282a7aa61 in operator() src/ray/raylet_rpc_client/raylet_client_pool.cc:69
    #18 0x7ff282a7ac66 in __invoke_impl<void, ray::rpc::RayletClientPool::GetDefaultUnavailableTimeoutCallback(ray::gcs::GcsClient*, ray::rpc::RayletClientPool*, const ray::rpc::Address&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
    #19 0x7ff282a7ac66 in __invoke_r<void, ray::rpc::RayletClientPool::GetDefaultUnavailableTimeoutCallback(ray::gcs::GcsClient*, ray::rpc::RayletClientPool*, const ray::rpc::Address&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:111
    #20 0x7ff282a7ac66 in _M_invoke /usr/include/c++/11/bits/std_function.h:290
    #21 0x7ff28346a1ac in std::function<void ()>::operator()() const /usr/include/c++/11/bits/std_function.h:590
    #22 0x7ff28346a1ac in ray::rpc::RetryableGrpcClient::CheckChannelStatus(bool) src/ray/rpc/retryable_grpc_client.cc:85
    #23 0x7ff28346c06a in operator() src/ray/rpc/retryable_grpc_client.cc:45
```

This is a **non-deterministic race condition** that occurs under the
following sequence:

1. Worker A's pod is preempted → Worker A dies
2. Objects on Worker A are lost
3. Objects are found on Worker B → `PinObjectIDs` RPC is initiated
4. Worker B dies or becomes unavailable → `CheckChannelStatus` detects
this → `Disconnect` is called
5. The `RayletClient` corresponding to Worker B on the driver is
destroyed
6. RPC callback executes and accesses the already-freed `RayletClient` →
use-after-free triggers crash

Whether the use-after-free occurs depends on the relative timing of
steps 5 and 6. In scenarios with frequent pod preemptions, object
recovery frequently triggers `PinObjectIDs`, making this race condition
more likely to occur.

### Root Cause

In `RayletClient::PinObjectIDs`, the RPC callback lambda directly
captured the raw `this` pointer:

```cpp
auto rpc_callback = [this, callback = std::move(callback)](...) {
    pins_in_flight_--;  // Accessing member via 'this' pointer
    ...
};
```

If the `RayletClient` object is destroyed before the async RPC callback
executes, the callback will access freed memory through the dangling
`this` pointer, leading to heap corruption and SIGABRT with the error
message "corrupted size vs. prev_size".

## Solution

The fix ensures that the `RayletClient` object remains alive during the
asynchronous callback execution by:

1. **Inheriting from `std::enable_shared_from_this<RayletClient>`**: The
class already inherits from this base class (line 43 in
`raylet_client.h`), which enables safe shared pointer management.

2. **Capturing `shared_from_this()` in the lambda**: Instead of
capturing the raw `this` pointer, the callback now captures a
`shared_ptr` to the object. The `shared_from_this()` is called before
incrementing `pins_in_flight_` to ensure proper lifetime management:

```cpp
// Capture shared_from_this() before incrementing to ensure object lifetime
// is extended for the async callback, preventing use-after-free.
auto self = shared_from_this();
pins_in_flight_++;
auto rpc_callback = [self, callback = std::move(callback)](
                        Status status, rpc::PinObjectIDsReply &&reply) {
  self->pins_in_flight_--;
  callback(status, std::move(reply));
};
```

This ensures that the `RayletClient` object's lifetime is extended until
the callback completes, preventing the use-after-free bug. By capturing
the shared pointer before incrementing the counter, we also ensure that
if `shared_from_this()` were to fail (though it shouldn't in normal
usage), we don't leave the counter in an inconsistent state.

## Code Changes

- **File**: `src/ray/raylet_rpc_client/raylet_client.cc`
- **Method**: `RayletClient::PinObjectIDs`
- **Change**: Replace `this` capture with `shared_from_this()` capture
in the RPC callback lambda

Signed-off-by: dragongu <andrewgu@vip.qq.com>
Co-authored-by: gulonglong <gulonglong@stepfun.com>
matthewdeng pushed a commit that referenced this pull request Feb 9, 2026
…c callback during shutdown (ray-project#60048)

## Description
When a Ray worker process shuts down (e.g., during `ray.shutdown()` or
node termination), the OpenTelemetry `PeriodicExportingMetricReader`'s
background thread may still be invoking the gauge callback
(`_DoubleGaugeCallback`), which then accesses already-destroyed member
data, resulting in a use-after-free crash.

The error message:
```
(bundle_reservation_check_func pid=1543823) pure virtual method called
(bundle_reservation_check_func pid=1543823) __cxa_deleted_virtual
```


I looked further into this, and ideally, at the OpenTelemetry code
level, shutdown should be handled correctly.

[PeriodicExportingMetricReader's
shutdown](https://github.com/open-telemetry/opentelemetry-cpp/blob/f33dcc07c56c7e3b18fd18e13986f0eda965d116/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L292-L299)
waits for `worker_thread_` to finish.
```c
bool PeriodicExportingMetricReader::OnShutDown(std::chrono::microseconds timeout) noexcept
{
  if (worker_thread_.joinable())
  {
    cv_.notify_all();
    worker_thread_.join();
  }
  return exporter_->Shutdown(timeout);
}
```

And callback(`worker_thread_`) is in a [while (IsShutdown() !=
true)](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L147)
loop.

Therefore, there should be no use-after-free race condition at the
OpenTelemetry code level, and it should be safe to call
`meter_provider_->Shutdown()`.

However, the issue is that the last callback appears to access member
data that has already been destroyed during ForceFlush, which is called
before Shutdown. This member data belongs to the OpenTelemetry SDK
itself.

The more I look into it, the more it feels like this is actually a bug
in the OpenTelemetry SDK.

And even further, I found this:[[SDK] Use shared_ptr internally for
AttributesProcessor to prevent use-after-free
](open-telemetry/opentelemetry-cpp#3457)

Which is exactly the issue I encountered!

This PR upgrade the OpenTelemetry C++ SDK version to include this fix.



## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
It is quit easy to reproduced, For example, if we manually running the
`test_placement_group_reschedule_node_dead` in
`python/ray/autoscaler/v2/tests/test_e2e.py`.
```
(docs) ubuntu@devbox:~/ray$ pkill -9 -f raylet 2>/dev/null || true; pkill -9 -f gcs_server 2>/dev/null || true; ray stop --force 2>/dev/null || true; sleep 2
Did not find any active Ray processes.
(docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?"

............

__cxa_deleted_virtual
opentelemetry::v1::sdk::metrics::FilteredOrderedAttributeMap::FilteredOrderedAttributeMap()::{lambda()#1}::operator()()
opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
opentelemetry::v1::sdk::metrics::ObserverResultT<>::Observe()
opentelemetry::v1::metrics::ObserverResultT<>::Observe<>()
ray::observability::OpenTelemetryMetricRecorder::CollectGaugeMetricValues()
(anonymous namespace)::_DoubleGaugeCallback()
opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
opentelemetry::v1::sdk::metrics::Meter::Collect()
opentelemetry::v1::sdk::metrics::MetricCollector::Produce()
opentelemetry::v1::sdk::metrics::MetricReader::Collect()
opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()
std::thread::_State_impl<>::_M_run()


............
```

after this pr, no such error message:
```
(docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?"
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/ray
configfile: pytest.ini
plugins: asyncio-1.3.0, anyio-4.11.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 2 items

python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v1] Did not find any active Ray processes.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.31.5.171

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.31.5.171:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8265
  
  If connection to the dashboard fails, check your firewall settings and network configuration.
2026-01-12 12:30:00,347 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379...
2026-01-12 12:30:00,385 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(autoscaler +11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +11s) Resized to 0 CPUs.
(autoscaler +12s) Resized to 0 CPUs.
(autoscaler +14s) Resized to 0 CPUs.
(autoscaler +15s) Resized to 0 CPUs.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +16s) Resized to 0 CPUs.
(autoscaler +16s) Adding 1 node(s) of type type-1.
(autoscaler +16s) Adding 1 node(s) of type type-2.
(autoscaler +16s) Adding 1 node(s) of type type-3.
Killing pids 1566233
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms
    [state-dump]        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 1
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB.
    [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9

(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a     (1) raylet crashes unexpectedly (OOM, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms
    [state-dump]        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 1
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB.
    [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9
Stopped all 10 Ray processes.

(autoscaler +32s) Resized to 0 CPUs.
(autoscaler +32s) Adding 1 node(s) of type type-1.
(autoscaler +32s) Adding 1 node(s) of type type-2.
(autoscaler +32s) Adding 1 node(s) of type type-3.
(autoscaler +32s) Adding 1 node(s) of type type-3.
(autoscaler +32s) Removing 1 nodes of type type-3 (idle).
PASSED
python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v2] Did not find any active Ray processes.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.31.5.171

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.31.5.171:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8265
  
  If connection to the dashboard fails, check your firewall settings and network configuration.
2026-01-12 12:30:40,170 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379...
2026-01-12 12:30:40,202 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
Stopped only 9 out of 12 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=1569612, name='raylet', status='terminated'), psutil.Process(pid=1569160, name='raylet', status='terminated'), psutil.Process(pid=1568952, name='raylet', status='terminated')] will be forcefully terminated.
You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination.
Killing pids 1568744
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB.
    [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9

(raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a  (1) raylet crashes unexpectedly (OOM, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB.
    [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9

PASSED

========================= 2 passed in 80.90s (0:01:20) =========================
EXIT CODE: 0
(docs) ubuntu@devbox:~/ray$ 
```

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
matthewdeng pushed a commit that referenced this pull request Mar 2, 2026
ray-project#61034)

Currently, there is a chance that a worker can crash on the `getenv`
syscall from the otel lazy initialization. We found the race is between
`setenv` on the user thread (`setenv(RBLN_DEVICES)`) and `getenv` on the
worker internal thread. However, we can't forbid `setenv` on a user's
thread; the only thing we can do is not call `getenv` once the user's
thread starts.

Here is the backtrace of the crash we found by intercepting the
`getenv`:

```
[getenv_preload] setenv name=RBLN_DEVICES value= overwrite=1
[getenv_preload] setenv backtrace:
  #0 /home/ray/getenv_trace_preload.so(setenv+0x73) [0x748a77ea870b]
  #1 ray::IDLE(+0x224d5b) [0x59f10aeead5b]
  #2 ray::IDLE(+0x13dfc3) [0x59f10ae03fc3]
  #3 ray::IDLE(_PyEval_EvalFrameDefault+0x313) [0x59f10adf3703]
  #4 ray::IDLE(+0x184bfd) [0x59f10ae4abfd]
  #5 ray::IDLE(+0x19da04) [0x59f10ae63a04]
  #6 ray::IDLE(_PyEval_EvalFrameDefault+0x115a) [0x59f10adf454a]
  #7 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc]
  #8 ray::IDLE(_PyEval_EvalFrameDefault+0x49ae) [0x59f10adf7d9e]
  #9 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc]
  #10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x9a9333) [0x748a76270333]
  #11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_PSN_ISG_IS11_bESaIS16_EERSO_INS0_17LocalMemoryBufferEEPbPSsS1E_RKSN_INS0_16ConcurrencyGroupESaIS1F_EESsbbblRKSt8optionalISsEEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S19_S1C_S1D_S1E_S1E_S1J_SsbbblS1L_EE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1X_OS19_S1C_OS1D_OS1E_S20_S1J_S1W_ObS21_S21_OlS1N_+0x1ab) [0x748a761786ab]
  #12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDESt10shared_ptrINS_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSsS15_+0x1166) [0x748a76320a96]
  #13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDESt10shared_ptrINS0_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsS15_ESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_SK_SS_SS_SW_S13_S14_S15_S15_EPS19_St12_PlaceholderILi1EES1D_ILi2EES1D_ILi3EES1D_ILi4EES1D_ILi5EES1D_ILi6EES1D_ILi7EES1D_ILi8EES1D_ILi9EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSS_S1U_OSW_OS13_OS14_OS15_S1Y_+0x87) [0x748a762e8647]
  #14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb5186d) [0x748a7641886d]
  #15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb557c5) [0x748a7641c7c5]
  #16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x748a769053eb]
  #17 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x748a768fbf0b]
  #18 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x748a7643621b]
  #19 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x748a76e503cb]
  #20 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x748a76e51d69]
  #21 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x748a76e52472]
  #22 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x132) [0x748a762e4252]
  #23 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x41) [0x748a76336bd1]
  #24 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x8a45c1) [0x748a7616b5c1]
  #25 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb]
  #26 ray::IDLE(_PyFunction_Vectorcall+0x6c) [0x59f10ae03dfc]
  #27 ray::IDLE(_PyEval_EvalFrameDefault+0x6fb) [0x59f10adf3aeb]
  #28 ray::IDLE(+0x1d5cac) [0x59f10ae9bcac]
  #29 ray::IDLE(PyEval_EvalCode+0x85) [0x59f10ae9bbf5]
  #30 ray::IDLE(+0x20732a) [0x59f10aecd32a]
  #31 ray::IDLE(+0x201d13) [0x59f10aec7d13]
  #32 ray::IDLE(+0x976be) [0x59f10ad5d6be]
  #33 ray::IDLE(_PyRun_SimpleFileObject+0x1bb) [0x59f10aec23db]
  #34 ray::IDLE(_PyRun_AnyFileObject+0x44) [0x59f10aec1f74]
  #35 ray::IDLE(Py_RunMain+0x371) [0x59f10aebf3e1]
  #36 ray::IDLE(Py_BytesMain+0x37) [0x59f10ae8f447]
  #37 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x748a77baad90]
  #38 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x748a77baae40]
  #39 ray::IDLE(+0x1c930e) [0x59f10ae8f30e]

[getenv_preload] getenv name=OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER
[getenv_preload] backtrace:
  #0 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10a9d17) [0x7321ce3c9d17]
  #1 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x10abe2b) [0x7321ce3cbe2b]
  #2 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1050ffc) [0x7321ce370ffc]
  #3 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x104f4d7) [0x7321ce36f4d7]
  #4 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1045833) [0x7321ce365833]
  #5 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6c760) [0x7321cdd8c760]
  #6 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xe69d9a) [0x7321ce189d9a]
  #7 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_16HealthCheckReplyEE15OnReplyReceivedEv+0x165) [0x7321ce18c005]
  #8 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7321cdd8e475]
  #9 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x103e3eb) [0x7321ce35e3eb]
  #10 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x1034f0b) [0x7321ce354f0b]
  #11 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xb6f21b) [0x7321cde8f21b]
  #12 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x15893cb) [0x7321ce8a93cb]
  #13 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158ad69) [0x7321ce8aad69]
  #14 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x158b472) [0x7321ce8ab472]
  #15 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xa6bb54) [0x7321cdd8bb54]
  #16 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xba2250) [0x7321cdec2250]
  #17 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7321cf66eac3]
  #18 /lib/x86_64-linux-gnu/libc.so.6(+0x1268d0) [0x7321cf7008d0]

*** SIGSEGV received at time=1770862205 on cpu 1 ***
PC: @     0x748a77bc5c1d  (unknown)  getenv
    @     0x748a77bc3520  (unknown)  (unknown)
{"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"*** SIGSEGV received at time=1770862205 on cpu 1 ***","filename":"logging.cc","lineno":474}
{"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"PC: @     0x748a77bc5c1d  (unknown)  getenv","filename":"logging.cc","lineno":474}
{"asctime":"2026-02-11 18:10:05,910","levelname":"E","message":"    @     0x748a77bc3520  (unknown)  (unknown)","filename":"logging.cc","lineno":474}
Fatal Python error: Segmentation fault
```

According to the backtrace, we can identify that it is the
`OtlpGrpcMetricExporterOptions`, [which called
`getenv(OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_BACKOFF_MULTIPLIER)`](https://github.com/open-telemetry/opentelemetry-cpp/blob/13ad05a6f431efb76995cffb1225d26b45374749/exporters/otlp/src/otlp_grpc_metric_exporter_options.cc#L47),
getting initialized by calling `InitOpenTelemetryExporter` in the
`metrics_agent_client_->WaitForServerReady()` callback, that causes the
issue.

This PR moves `OtlpGrpcMetricExporterOptions` into
`OpenTelemetryMetricRecorder` (so that we keep otel details
encapsulated) and moves its initialization early to `stats::Init()`, to
force the `OtlpGrpcMetricExporterOptions` to be initialized early, so
that we don't call `getenv` afterward.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
matthewdeng pushed a commit that referenced this pull request Mar 2, 2026
…opy_me__upb_internal_use_only) (ray-project#61147)

We found that if the version of python profobuf library mismatches with
the raylet's, the ray client python server will hit segment fault with
this trace:

```
* thread #12, name = 'python3.11', stop reason = signal SIGSEGV: address not mapped to object (fault address=0x16)
  * frame #0: 0x0000733c1ea38a59 _raylet.so`_upb_Arena_SlowMalloc_dont_copy_me__upb_internal_use_only + 41
    frame #1: 0x0000733c1ea366ad _raylet.so`_upb_Array_Realloc_dont_copy_me__upb_internal_use_only + 285
    frame #2: 0x0000733c1c54517b _message.cpython-311-x86_64-linux-gnu.so`_upb_Decoder_DecodeMessage + 3835
    frame #3: 0x0000733c1c545f0c _message.cpython-311-x86_64-linux-gnu.so`upb_Decoder_Decode + 108
    frame #4: 0x0000733c1c543ff9 _message.cpython-311-x86_64-linux-gnu.so`upb_Decode + 201
    frame #5: 0x0000733c1c52907d _message.cpython-311-x86_64-linux-gnu.so`PyUpb_Message_MergeFromString + 237
    frame #6: 0x0000733c1c5293c4 _message.cpython-311-x86_64-linux-gnu.so`PyUpb_Message_FromString + 36
    frame #7: 0x0000568134f2d81a python3.11`cfunction_vectorcall_O(func=0x0000733c15c0b560, args=0x0000733c20200480, nargsf=<unavailable>, kwnames=<unavailable>) at methodobject.c:514:24
    frame #8: 0x0000568135270620 python3.11
```

We can see from the trace that the python profobuf library
(`message.cpython-311-x86_64-linux-gnu.so`) tried to decode a message
with a function `_upb_Array_Realloc_dont_copy_me__upb_internal_use_only`
from `_raylet.so`, which is apparently not ideal. Ideally, the python
profobuf library should not use a function from `_raylet.so`.

That happens because the current exporting rule `*ray*internal*`
accidentally matches
`_upb_Array_Realloc_dont_copy_me__upb_internal_use_only`, so we have it
exposed globally from raylet:

<img width="1162" height="169" alt="image"
src="https://github.com/user-attachments/assets/f40ae524-9675-454d-8cce-f6c43d2d901c"
/>

The problematic rule `*ray*internal*` aims to export `ray::internal`
only, so this PR makes the pattern strict and does not expose
_upb_Arena_SlowMalloc_dont_copy_me__upb_internal_use_only.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
matthewdeng pushed a commit that referenced this pull request Mar 2, 2026
…#60480)

## Description
### Larger Issue

Refactored the execution callback system from dynamic, per-job injection
to static, eager initialization. DataContext now serves as a stateless
registry of callback factories, enabling the StreamingExecutor to load
all callbacks at startup. Each callback then self-configures or no-ops
based on the runtime environment.

### Context

This is Part 1 of a four-part change set intended to resolve the issue
described above.

Closes ray-project#60279.

---------

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
matthewdeng pushed a commit that referenced this pull request Mar 13, 2026
## Description
grpc 1.57.1 will call `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`
on every grpc channel establishment for parsing load-balancing policy.
This causes race conditions between user tasks as they are allowed to do
setenv at anytime. This PR upgrades the grpc lib to 1.58.0 to get rid of
the `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`.

```
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#8  0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#9  <signal handler called>
#10 __GI_getenv (name=0x7580a6a078c2 "PC_EXPERIMENTAL_PICKFIRST_LB_CONFIG") at ./stdlib/getenv.c:84
#11 0x00007580a67e8b8a in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#12 0x00007580a649601f in grpc_core::ShufflePickFirstEnabled() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#13 0x00007580a64960ed in grpc_core::json_detail::FinishedJsonObjectLoader<grpc_core::(anonymous namespace)::PickFirstConfig, 1ul, void>::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#14 0x00007580a6787384 in grpc_core::json_detail::LoadWrapped::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#15 0x00007580a6497b07 in grpc_core::(anonymous namespace)::PickFirstFactory::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#16 0x00007580a67c18a7 in grpc_core::LoadBalancingPolicyRegistry::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#17 0x00007580a66ad9b8 in grpc_core::ClientChannel::OnResolverResultChangedLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#18 0x00007580a66ae452 in grpc_core::ClientChannel::ResolverResultHandler::ReportResult(grpc_core::Resolver::Result) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#19 0x00007580a63bc603 in grpc_core::PollingResolver::OnRequestCompleteLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#20 0x00007580a63bcb2d in std::_Function_handler<void (), grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
    () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#21 0x00007580a67cbf46 in grpc_core::WorkSerializer::WorkSerializerImpl::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#22 0x00007580a67cc0ea in grpc_core::WorkSerializer::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#23 0x00007580a63bd117 in grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#24 0x00007580a63b3f86 in grpc_core::(anonymous namespace)::AresClientChannelDNSResolver::AresRequestWrapper::OnHostnameResolved(void*, absl::lts_20230802::Status) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#25 0x00007580a67c44c4 in grpc_core::ExecCtx::Flush() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#26 0x00007580a63408a2 in grpc_core::ExecCtx::~ExecCtx() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#27 0x00007580a6740343 in grpc_call_start_batch () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#28 0x00007580a5e281e9 in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpRecvInitialMetadata, grpc::internal::CallOpRecvMessage<google::protobuf::MessageLite>, grpc::internal::CallOpClientSendClose, grpc::internal::CallOpClientRecvStatus>::ContinueFillOpsAfterInterception() ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#29 0x00007580a5e2d809 in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::inte--Type <RET> for more, q to quit, c to continue without paging--c
rnal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#30 0x00007580a62d76ea in opentelemetry::proto::collector::metrics::v1::MetricsService::Stub::Export(grpc::ClientContext*, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest const&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#31 0x00007580a62ca40c in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::DelegateExport(opentelemetry::proto::collector::metrics::v1::MetricsService::StubInterface*, std::unique_ptr<grpc::ClientContext, std::default_delete<grpc::ClientContext> >&&, std::unique_ptr<google::protobuf::Arena, std::default_delete<google::protobuf::Arena> >&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#32 0x00007580a62c23ed in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#33 0x00007580a62c0334 in (anonymous namespace)::OpenTelemetryMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#34 0x00007580a62e5fdf in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}::operator()() const::{lambda(opentelemetry::v1::sdk::metrics::ResourceMetrics&)#1}::operator()(opentelemetry::v1::sdk::metrics::ResourceMetrics&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#35 0x00007580a62ee7a6 in opentelemetry::v1::sdk::metrics::MetricReader::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics&)>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#36 0x00007580a62e5085 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}> > >::_M_run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#37 0x00007580a6997be0 in execute_native_thread_routine () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#38 0x00007580a7597ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#39 0x00007580a76298d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
```

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants