Add Slurm support to rrun with PMIx-based coordination#775
Add Slurm support to rrun with PMIx-based coordination#775rapids-bot[bot] merged 75 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
gforsyth
left a comment
There was a problem hiding this comment.
The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes
gforsyth
left a comment
There was a problem hiding this comment.
Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.
Thanks Gil, appreciate it. Would indeed be nice to have Kyle review CMake as well, thanks for tagging him. |
wence-
left a comment
There was a problem hiding this comment.
I have some small questions, and I think we are not being consistent with cleanup of PMIX-allocated data everywhere. Overall this is looking good though
cpp/tools/rrun.cpp
Outdated
| // Capture all parameters by value to avoid any potential issues | ||
| int captured_global_rank = global_rank; | ||
| int captured_local_rank = local_rank; | ||
| int captured_total_ranks = total_ranks; | ||
|
|
||
| for (size_t i = 0; i < pids.size(); ++i) { | ||
| int status; | ||
| while (true) { | ||
| pid_t result = waitpid(pids[i], &status, 0); | ||
| return fork_with_piped_stdio( | ||
| out_fd_stdout, | ||
| out_fd_stderr, | ||
| /*combine_stderr*/ false, | ||
| [&cfg, captured_global_rank, captured_local_rank, captured_total_ranks]() { | ||
| // Set custom environment variables first (can be overridden by specific vars) | ||
| for (auto const& env_pair : cfg.env_vars) { | ||
| setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1); | ||
| } | ||
|
|
||
| if (result < 0) { | ||
| if (errno == EINTR) { | ||
| // Retry waitpid for the same pid | ||
| continue; | ||
| } | ||
| std::cerr << "Error waiting for rank " << i << ": " | ||
| << std::strerror(errno) << std::endl; | ||
| overall_status = 1; | ||
| break; | ||
| setenv("RAPIDSMPF_RANK", std::to_string(captured_global_rank).c_str(), 1); | ||
| setenv("RAPIDSMPF_NRANKS", std::to_string(captured_total_ranks).c_str(), 1); | ||
|
|
There was a problem hiding this comment.
Why is it not sufficient to capture by value in the lambda capture? Also, we're not capturing the cfg by value...
There was a problem hiding this comment.
Sorry, probably a leftover from previous behavior. Removed.
| setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1); | ||
| } | ||
|
|
||
| apply_topology_bindings(cfg, gpu_id, cfg.verbose); |
There was a problem hiding this comment.
OK, so in passthrough mode, the rrun binary does two things:
- remap
SLURM_envvars toRAPIDSMPF_ones - apply some process affinity bindings (based on the selected GPU ID).
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| static PmixGlobalState& instance() { | ||
| static PmixGlobalState state; | ||
| return state; | ||
| } |
There was a problem hiding this comment.
Question: Is it going to be problematic that the dtor here is going to run below main?
There was a problem hiding this comment.
Actually this is not necessary anymore, this was an artifact of an old implementation, I think now we can finalize in the destructor without problems. Removed the global state and moved the finalizer to destructor.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| std::array<char, PMIX_MAX_NSLEN + 1> const& nspace, std::string const& operation_name | ||
| ) { | ||
| pmix_proc_t proc; | ||
| PMIX_PROC_CONSTRUCT(&proc); |
There was a problem hiding this comment.
Do we need to PMIX_PROC_DESTRUCT?
There was a problem hiding this comment.
Good catch, fixed here and other missing entries as well.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| std::memcpy(data, bcast_data.data(), size); | ||
| } | ||
|
|
||
| barrier(); |
There was a problem hiding this comment.
question: What is this barrier for?
There was a problem hiding this comment.
The barrier here is exclusively to prevent processes continue until they all got the data. Perhaps I'm being overly cautious unnecessarily, but I'm not certain if we can always provide a guarantee that it is safe for some processes to continue while maybe not all of them have finished retrieving this data. I can try to remove it if you prefer.
There was a problem hiding this comment.
And I have now realized we're not using broadcast for anything at the moment, only the put/sync directly, so I've removed it entirely. With that the API docs for put()/get() need to be updated, so I've done that too.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| ); | ||
| } | ||
|
|
||
| // Commit to make the data available |
There was a problem hiding this comment.
| // Commit to make the data available |
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| auto start = std::chrono::steady_clock::now(); | ||
| auto poll_interval = std::chrono::milliseconds{100}; | ||
|
|
||
| // Get from rank 0 specifically (since that's where the key is stored) |
There was a problem hiding this comment.
I see the confusion, it's not obvious from the implementation and you need to be aware of PMIx. But what happens is that PMIx_Put puts a value in its rank-local key store, making it available globally (with PMIX_GLOBAL). In broadcast, only rank 0 puts a value, while other ranks get from rank 0, and that's why we know it, as that's currently the only case for put/get functions. I have made an attempt to add comments that clarify that, please let me know if anything is unclear.
| comm = std::make_shared<ucxx::UCXX>(std::move(ucxx_initialized_rank), options); | ||
| } | ||
|
|
||
| comm->barrier(); |
There was a problem hiding this comment.
Aside (not to be solved here). I think I have observed that the implementation of the barrier here is not quite barrier-like (mainly when debugging deadlocks due to incorrect cleanup of other objects). Non-root ranks can leave the barrier before all non-root ranks have arrived (if the active message send from root to non-root advertising that the barrier has begun goes over the eager protocol)
There was a problem hiding this comment.
Thanks Lawrence, that's important to investigate. I've opened #857 to do so.
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
cpp/tools/rrun.cpp
Outdated
| // Capture all parameters by value to avoid any potential issues | ||
| int captured_global_rank = global_rank; | ||
| int captured_local_rank = local_rank; | ||
| int captured_total_ranks = total_ranks; | ||
|
|
||
| for (size_t i = 0; i < pids.size(); ++i) { | ||
| int status; | ||
| while (true) { | ||
| pid_t result = waitpid(pids[i], &status, 0); | ||
| return fork_with_piped_stdio( | ||
| out_fd_stdout, | ||
| out_fd_stderr, | ||
| /*combine_stderr*/ false, | ||
| [&cfg, captured_global_rank, captured_local_rank, captured_total_ranks]() { | ||
| // Set custom environment variables first (can be overridden by specific vars) | ||
| for (auto const& env_pair : cfg.env_vars) { | ||
| setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1); | ||
| } | ||
|
|
||
| if (result < 0) { | ||
| if (errno == EINTR) { | ||
| // Retry waitpid for the same pid | ||
| continue; | ||
| } | ||
| std::cerr << "Error waiting for rank " << i << ": " | ||
| << std::strerror(errno) << std::endl; | ||
| overall_status = 1; | ||
| break; | ||
| setenv("RAPIDSMPF_RANK", std::to_string(captured_global_rank).c_str(), 1); | ||
| setenv("RAPIDSMPF_NRANKS", std::to_string(captured_total_ranks).c_str(), 1); | ||
|
|
There was a problem hiding this comment.
Sorry, probably a leftover from previous behavior. Removed.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| std::array<char, PMIX_MAX_NSLEN + 1> const& nspace, std::string const& operation_name | ||
| ) { | ||
| pmix_proc_t proc; | ||
| PMIX_PROC_CONSTRUCT(&proc); |
There was a problem hiding this comment.
Good catch, fixed here and other missing entries as well.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
|
|
||
| void SlurmBackend::put(std::string const& key, std::string const& value) { | ||
| pmix_value_t pmix_value; | ||
| PMIX_VALUE_CONSTRUCT(&pmix_value); |
There was a problem hiding this comment.
No, PMIx_Value_destruct (I've switched to the function API, replacing the deprecated macro API), should only be used with PMIx_Value_create or from a value returned by PMIx_Get, which are owning objects. PMIx_Put doesn't take ownership and thus we use PMIx_Value_construct for a non-owning reference.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| // Get from rank 0 specifically (since that's where the key is stored) | ||
| // Using PMIX_RANK_WILDCARD doesn't seem to work reliably | ||
| pmix_proc_t proc; | ||
| PMIX_PROC_CONSTRUCT(&proc); |
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| auto start = std::chrono::steady_clock::now(); | ||
| auto poll_interval = std::chrono::milliseconds{100}; | ||
|
|
||
| // Get from rank 0 specifically (since that's where the key is stored) |
There was a problem hiding this comment.
I see the confusion, it's not obvious from the implementation and you need to be aware of PMIx. But what happens is that PMIx_Put puts a value in its rank-local key store, making it available globally (with PMIX_GLOBAL). In broadcast, only rank 0 puts a value, while other ranks get from rank 0, and that's why we know it, as that's currently the only case for put/get functions. I have made an attempt to add comments that clarify that, please let me know if anything is unclear.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| std::memcpy(data, bcast_data.data(), size); | ||
| } | ||
|
|
||
| barrier(); |
There was a problem hiding this comment.
The barrier here is exclusively to prevent processes continue until they all got the data. Perhaps I'm being overly cautious unnecessarily, but I'm not certain if we can always provide a guarantee that it is safe for some processes to continue while maybe not all of them have finished retrieving this data. I can try to remove it if you prefer.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| std::memcpy(data, bcast_data.data(), size); | ||
| } | ||
|
|
||
| barrier(); |
There was a problem hiding this comment.
And I have now realized we're not using broadcast for anything at the moment, only the put/sync directly, so I've removed it entirely. With that the API docs for put()/get() need to be updated, so I've done that too.
cpp/src/bootstrap/slurm_backend.cpp
Outdated
| static PmixGlobalState& instance() { | ||
| static PmixGlobalState state; | ||
| return state; | ||
| } |
There was a problem hiding this comment.
Actually this is not necessary anymore, this was an artifact of an old implementation, I think now we can finalize in the destructor without problems. Removed the global state and moved the finalizer to destructor.
| comm = std::make_shared<ucxx::UCXX>(std::move(ucxx_initialized_rank), options); | ||
| } | ||
|
|
||
| comm->barrier(); |
There was a problem hiding this comment.
Thanks Lawrence, that's important to investigate. I've opened #857 to do so.
|
Thanks all for the reviews! |
|
/merge |
This PR adds Slurm support for
rrun, enabling RapidsMPF to run without MPI. This is achieved by addingSlurmBackendclass that wraps PMIx for process coordination, implementing bootstrap operations (put/get/barrier/sync) using PMIx primitives.The new execution mode delivers a passthrough approach with multiple tasks per node, one task per GPU. This is similar to the way MPI applications launch in Slurm, but unlike
mpirunwhich should not be part of the application execution,rrunmust act as launcher to the application. Ifrrunis omitted, Slurm will automatically fallback to MPI (if available).Usage example:
srun \ --mpi=pmix \ --nodes=2 \ --ntasks-per-node=4 \ --cpus-per-task=36 \ --gpus-per-task=1 \ --gres=gpu:4 \ rrun ./benchmarks/bench_shuffle -C ucxx