Skip to content

Add Slurm support to rrun with PMIx-based coordination#775

Merged
rapids-bot[bot] merged 75 commits intorapidsai:mainfrom
pentschev:rrun-slurm
Feb 18, 2026
Merged

Add Slurm support to rrun with PMIx-based coordination#775
rapids-bot[bot] merged 75 commits intorapidsai:mainfrom
pentschev:rrun-slurm

Conversation

@pentschev
Copy link
Member

@pentschev pentschev commented Jan 11, 2026

This PR adds Slurm support for rrun, enabling RapidsMPF to run without MPI. This is achieved by adding SlurmBackend class that wraps PMIx for process coordination, implementing bootstrap operations (put/get/barrier/sync) using PMIx primitives.

The new execution mode delivers a passthrough approach with multiple tasks per node, one task per GPU. This is similar to the way MPI applications launch in Slurm, but unlike mpirun which should not be part of the application execution, rrun must act as launcher to the application. If rrun is omitted, Slurm will automatically fallback to MPI (if available).

Usage example:

  srun \
      --mpi=pmix \
      --nodes=2 \
      --ntasks-per-node=4 \
      --cpus-per-task=36 \
      --gpus-per-task=1 \
      --gres=gpu:4 \
      rrun ./benchmarks/bench_shuffle -C ucxx

@pentschev pentschev self-assigned this Jan 11, 2026
@pentschev pentschev added feature request New feature or request non-breaking Introduces a non-breaking change labels Jan 11, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 11, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes

Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.

@pentschev
Copy link
Member Author

The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes

Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.

Thanks Gil, appreciate it. Would indeed be nice to have Kyle review CMake as well, thanks for tagging him.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some small questions, and I think we are not being consistent with cleanup of PMIX-allocated data everywhere. Overall this is looking good though

Comment on lines +1055 to +1072
// Capture all parameters by value to avoid any potential issues
int captured_global_rank = global_rank;
int captured_local_rank = local_rank;
int captured_total_ranks = total_ranks;

for (size_t i = 0; i < pids.size(); ++i) {
int status;
while (true) {
pid_t result = waitpid(pids[i], &status, 0);
return fork_with_piped_stdio(
out_fd_stdout,
out_fd_stderr,
/*combine_stderr*/ false,
[&cfg, captured_global_rank, captured_local_rank, captured_total_ranks]() {
// Set custom environment variables first (can be overridden by specific vars)
for (auto const& env_pair : cfg.env_vars) {
setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1);
}

if (result < 0) {
if (errno == EINTR) {
// Retry waitpid for the same pid
continue;
}
std::cerr << "Error waiting for rank " << i << ": "
<< std::strerror(errno) << std::endl;
overall_status = 1;
break;
setenv("RAPIDSMPF_RANK", std::to_string(captured_global_rank).c_str(), 1);
setenv("RAPIDSMPF_NRANKS", std::to_string(captured_total_ranks).c_str(), 1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not sufficient to capture by value in the lambda capture? Also, we're not capturing the cfg by value...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, probably a leftover from previous behavior. Removed.

setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1);
}

apply_topology_bindings(cfg, gpu_id, cfg.verbose);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so in passthrough mode, the rrun binary does two things:

  1. remap SLURM_ envvars to RAPIDSMPF_ ones
  2. apply some process affinity bindings (based on the selected GPU ID).

Comment on lines +44 to +47
static PmixGlobalState& instance() {
static PmixGlobalState state;
return state;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it going to be problematic that the dtor here is going to run below main?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not necessary anymore, this was an artifact of an old implementation, I think now we can finalize in the destructor without problems. Removed the global state and moved the finalizer to destructor.

std::array<char, PMIX_MAX_NSLEN + 1> const& nspace, std::string const& operation_name
) {
pmix_proc_t proc;
PMIX_PROC_CONSTRUCT(&proc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to PMIX_PROC_DESTRUCT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed here and other missing entries as well.

std::memcpy(data, bcast_data.data(), size);
}

barrier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What is this barrier for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The barrier here is exclusively to prevent processes continue until they all got the data. Perhaps I'm being overly cautious unnecessarily, but I'm not certain if we can always provide a guarantee that it is safe for some processes to continue while maybe not all of them have finished retrieving this data. I can try to remove it if you prefer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I have now realized we're not using broadcast for anything at the moment, only the put/sync directly, so I've removed it entirely. With that the API docs for put()/get() need to be updated, so I've done that too.

);
}

// Commit to make the data available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Commit to make the data available

auto start = std::chrono::steady_clock::now();
auto poll_interval = std::chrono::milliseconds{100};

// Get from rank 0 specifically (since that's where the key is stored)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion, it's not obvious from the implementation and you need to be aware of PMIx. But what happens is that PMIx_Put puts a value in its rank-local key store, making it available globally (with PMIX_GLOBAL). In broadcast, only rank 0 puts a value, while other ranks get from rank 0, and that's why we know it, as that's currently the only case for put/get functions. I have made an attempt to add comments that clarify that, please let me know if anything is unclear.

comm = std::make_shared<ucxx::UCXX>(std::move(ucxx_initialized_rank), options);
}

comm->barrier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside (not to be solved here). I think I have observed that the implementation of the barrier here is not quite barrier-like (mainly when debugging deadlocks due to incorrect cleanup of other objects). Non-root ranks can leave the barrier before all non-root ranks have arrived (if the active message send from root to non-root advertising that the barrier has begun goes over the eager protocol)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lawrence, that's important to investigate. I've opened #857 to do so.

Copy link
Member Author

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wence for the review. I think I have addressed all your comments and simplified the implementation a bit more in the process. Please have another look!

Comment on lines +1055 to +1072
// Capture all parameters by value to avoid any potential issues
int captured_global_rank = global_rank;
int captured_local_rank = local_rank;
int captured_total_ranks = total_ranks;

for (size_t i = 0; i < pids.size(); ++i) {
int status;
while (true) {
pid_t result = waitpid(pids[i], &status, 0);
return fork_with_piped_stdio(
out_fd_stdout,
out_fd_stderr,
/*combine_stderr*/ false,
[&cfg, captured_global_rank, captured_local_rank, captured_total_ranks]() {
// Set custom environment variables first (can be overridden by specific vars)
for (auto const& env_pair : cfg.env_vars) {
setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1);
}

if (result < 0) {
if (errno == EINTR) {
// Retry waitpid for the same pid
continue;
}
std::cerr << "Error waiting for rank " << i << ": "
<< std::strerror(errno) << std::endl;
overall_status = 1;
break;
setenv("RAPIDSMPF_RANK", std::to_string(captured_global_rank).c_str(), 1);
setenv("RAPIDSMPF_NRANKS", std::to_string(captured_total_ranks).c_str(), 1);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, probably a leftover from previous behavior. Removed.

std::array<char, PMIX_MAX_NSLEN + 1> const& nspace, std::string const& operation_name
) {
pmix_proc_t proc;
PMIX_PROC_CONSTRUCT(&proc);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed here and other missing entries as well.


void SlurmBackend::put(std::string const& key, std::string const& value) {
pmix_value_t pmix_value;
PMIX_VALUE_CONSTRUCT(&pmix_value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, PMIx_Value_destruct (I've switched to the function API, replacing the deprecated macro API), should only be used with PMIx_Value_create or from a value returned by PMIx_Get, which are owning objects. PMIx_Put doesn't take ownership and thus we use PMIx_Value_construct for a non-owning reference.

// Get from rank 0 specifically (since that's where the key is stored)
// Using PMIX_RANK_WILDCARD doesn't seem to work reliably
pmix_proc_t proc;
PMIX_PROC_CONSTRUCT(&proc);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed too.

auto start = std::chrono::steady_clock::now();
auto poll_interval = std::chrono::milliseconds{100};

// Get from rank 0 specifically (since that's where the key is stored)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion, it's not obvious from the implementation and you need to be aware of PMIx. But what happens is that PMIx_Put puts a value in its rank-local key store, making it available globally (with PMIX_GLOBAL). In broadcast, only rank 0 puts a value, while other ranks get from rank 0, and that's why we know it, as that's currently the only case for put/get functions. I have made an attempt to add comments that clarify that, please let me know if anything is unclear.

std::memcpy(data, bcast_data.data(), size);
}

barrier();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The barrier here is exclusively to prevent processes continue until they all got the data. Perhaps I'm being overly cautious unnecessarily, but I'm not certain if we can always provide a guarantee that it is safe for some processes to continue while maybe not all of them have finished retrieving this data. I can try to remove it if you prefer.

std::memcpy(data, bcast_data.data(), size);
}

barrier();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I have now realized we're not using broadcast for anything at the moment, only the put/sync directly, so I've removed it entirely. With that the API docs for put()/get() need to be updated, so I've done that too.

Comment on lines +44 to +47
static PmixGlobalState& instance() {
static PmixGlobalState state;
return state;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not necessary anymore, this was an artifact of an old implementation, I think now we can finalize in the destructor without problems. Removed the global state and moved the finalizer to destructor.

comm = std::make_shared<ucxx::UCXX>(std::move(ucxx_initialized_rank), options);
}

comm->barrier();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lawrence, that's important to investigate. I've opened #857 to do so.

@pentschev pentschev requested a review from wence- February 12, 2026 22:45
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @pentschev

@pentschev
Copy link
Member Author

Thanks all for the reviews!

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 5ad21a6 into rapidsai:main Feb 18, 2026
89 checks passed
@pentschev pentschev deleted the rrun-slurm branch February 18, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants