Add Slurm support to rrun with PMIx-based coordination by pentschev · Pull Request #775 · rapidsai/rapidsmpf

pentschev · 2026-01-11T21:39:46Z

This PR adds Slurm support for rrun, enabling RapidsMPF to run without MPI. This is achieved by adding SlurmBackend class that wraps PMIx for process coordination, implementing bootstrap operations (put/get/barrier/sync) using PMIx primitives.

The new execution mode delivers a passthrough approach with multiple tasks per node, one task per GPU. This is similar to the way MPI applications launch in Slurm, but unlike mpirun which should not be part of the application execution, rrun must act as launcher to the application. If rrun is omitted, Slurm will automatically fallback to MPI (if available).

Usage example:

  srun \
      --mpi=pmix \
      --nodes=2 \
      --ntasks-per-node=4 \
      --cpus-per-task=36 \
      --gpus-per-task=1 \
      --gres=gpu:4 \
      rrun ./benchmarks/bench_shuffle -C ucxx

copy-pr-bot · 2026-01-11T21:39:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

pentschev · 2026-01-11T21:39:53Z

/ok to test

pentschev · 2026-01-11T21:47:06Z

/ok to test

pentschev · 2026-01-11T21:57:03Z

/ok to test

pentschev · 2026-01-11T22:05:52Z

/ok to test

pentschev · 2026-01-13T22:12:23Z

/ok to test

pentschev · 2026-01-13T22:12:43Z

/ok to test

pentschev · 2026-01-14T09:54:16Z

/ok to test

gforsyth

The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes

gforsyth

Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.

pentschev · 2026-02-05T15:27:23Z

The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes

Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.

Thanks Gil, appreciate it. Would indeed be nice to have Kyle review CMake as well, thanks for tagging him.

wence-

I have some small questions, and I think we are not being consistent with cleanup of PMIX-allocated data everywhere. Overall this is looking good though

wence- · 2026-02-09T15:40:04Z

cpp/tools/rrun.cpp

+    // Capture all parameters by value to avoid any potential issues
+    int captured_global_rank = global_rank;
+    int captured_local_rank = local_rank;
+    int captured_total_ranks = total_ranks;

-    for (size_t i = 0; i < pids.size(); ++i) {
-        int status;
-        while (true) {
-            pid_t result = waitpid(pids[i], &status, 0);
+    return fork_with_piped_stdio(
+        out_fd_stdout,
+        out_fd_stderr,
+        /*combine_stderr*/ false,
+        [&cfg, captured_global_rank, captured_local_rank, captured_total_ranks]() {
+            // Set custom environment variables first (can be overridden by specific vars)
+            for (auto const& env_pair : cfg.env_vars) {
+                setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1);
+            }

-            if (result < 0) {
-                if (errno == EINTR) {
-                    // Retry waitpid for the same pid
-                    continue;
-                }
-                std::cerr << "Error waiting for rank " << i << ": "
-                          << std::strerror(errno) << std::endl;
-                overall_status = 1;
-                break;
+            setenv("RAPIDSMPF_RANK", std::to_string(captured_global_rank).c_str(), 1);
+            setenv("RAPIDSMPF_NRANKS", std::to_string(captured_total_ranks).c_str(), 1);
+


Why is it not sufficient to capture by value in the lambda capture? Also, we're not capturing the cfg by value...

Sorry, probably a leftover from previous behavior. Removed.

wence- · 2026-02-09T15:48:45Z

cpp/tools/rrun.cpp

+        setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1);
+    }
+
+    apply_topology_bindings(cfg, gpu_id, cfg.verbose);


OK, so in passthrough mode, the rrun binary does two things:

remap SLURM_ envvars to RAPIDSMPF_ ones

apply some process affinity bindings (based on the selected GPU ID).

wence- · 2026-02-09T15:49:46Z

cpp/src/bootstrap/slurm_backend.cpp

+    static PmixGlobalState& instance() {
+        static PmixGlobalState state;
+        return state;
+    }


Question: Is it going to be problematic that the dtor here is going to run below main?

Actually this is not necessary anymore, this was an artifact of an old implementation, I think now we can finalize in the destructor without problems. Removed the global state and moved the finalizer to destructor.

wence- · 2026-02-09T15:52:30Z

cpp/src/bootstrap/slurm_backend.cpp

+    std::array<char, PMIX_MAX_NSLEN + 1> const& nspace, std::string const& operation_name
+) {
+    pmix_proc_t proc;
+    PMIX_PROC_CONSTRUCT(&proc);


Do we need to PMIX_PROC_DESTRUCT?

Good catch, fixed here and other missing entries as well.

cpp/src/bootstrap/slurm_backend.cpp

wence- · 2026-02-09T17:18:51Z

cpp/src/bootstrap/slurm_backend.cpp

+        std::memcpy(data, bcast_data.data(), size);
+    }
+
+    barrier();


question: What is this barrier for?

The barrier here is exclusively to prevent processes continue until they all got the data. Perhaps I'm being overly cautious unnecessarily, but I'm not certain if we can always provide a guarantee that it is safe for some processes to continue while maybe not all of them have finished retrieving this data. I can try to remove it if you prefer.

And I have now realized we're not using broadcast for anything at the moment, only the put/sync directly, so I've removed it entirely. With that the API docs for put()/get() need to be updated, so I've done that too.

cpp/include/rapidsmpf/bootstrap/slurm_backend.hpp

wence- · 2026-02-11T15:06:19Z

cpp/src/bootstrap/slurm_backend.cpp

+        );
+    }
+
+    // Commit to make the data available


Suggested change

// Commit to make the data available

wence- · 2026-02-11T15:06:56Z

cpp/src/bootstrap/slurm_backend.cpp

+    auto start = std::chrono::steady_clock::now();
+    auto poll_interval = std::chrono::milliseconds{100};
+
+    // Get from rank 0 specifically (since that's where the key is stored)


How do we know this?

I see the confusion, it's not obvious from the implementation and you need to be aware of PMIx. But what happens is that PMIx_Put puts a value in its rank-local key store, making it available globally (with PMIX_GLOBAL). In broadcast, only rank 0 puts a value, while other ranks get from rank 0, and that's why we know it, as that's currently the only case for put/get functions. I have made an attempt to add comments that clarify that, please let me know if anything is unclear.

wence- · 2026-02-11T15:17:11Z

cpp/src/bootstrap/ucxx.cpp

        comm = std::make_shared<ucxx::UCXX>(std::move(ucxx_initialized_rank), options);
    }
+
    comm->barrier();


Aside (not to be solved here). I think I have observed that the implementation of the barrier here is not quite barrier-like (mainly when debugging deadlocks due to incorrect cleanup of other objects). Non-root ranks can leave the barrier before all non-root ranks have arrived (if the active message send from root to non-root advertising that the barrier has begun goes over the eager protocol)

Thanks Lawrence, that's important to investigate. I've opened #857 to do so.

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

pentschev

Thanks @wence for the review. I think I have addressed all your comments and simplified the implementation a bit more in the process. Please have another look!

pentschev · 2026-02-12T16:43:25Z

cpp/tools/rrun.cpp

+    // Capture all parameters by value to avoid any potential issues
+    int captured_global_rank = global_rank;
+    int captured_local_rank = local_rank;
+    int captured_total_ranks = total_ranks;

-    for (size_t i = 0; i < pids.size(); ++i) {
-        int status;
-        while (true) {
-            pid_t result = waitpid(pids[i], &status, 0);
+    return fork_with_piped_stdio(
+        out_fd_stdout,
+        out_fd_stderr,
+        /*combine_stderr*/ false,
+        [&cfg, captured_global_rank, captured_local_rank, captured_total_ranks]() {
+            // Set custom environment variables first (can be overridden by specific vars)
+            for (auto const& env_pair : cfg.env_vars) {
+                setenv(env_pair.first.c_str(), env_pair.second.c_str(), 1);
+            }

-            if (result < 0) {
-                if (errno == EINTR) {
-                    // Retry waitpid for the same pid
-                    continue;
-                }
-                std::cerr << "Error waiting for rank " << i << ": "
-                          << std::strerror(errno) << std::endl;
-                overall_status = 1;
-                break;
+            setenv("RAPIDSMPF_RANK", std::to_string(captured_global_rank).c_str(), 1);
+            setenv("RAPIDSMPF_NRANKS", std::to_string(captured_total_ranks).c_str(), 1);
+


Sorry, probably a leftover from previous behavior. Removed.

pentschev · 2026-02-12T17:40:40Z

cpp/src/bootstrap/slurm_backend.cpp

+    std::array<char, PMIX_MAX_NSLEN + 1> const& nspace, std::string const& operation_name
+) {
+    pmix_proc_t proc;
+    PMIX_PROC_CONSTRUCT(&proc);


Good catch, fixed here and other missing entries as well.

pentschev · 2026-02-12T17:46:12Z

cpp/src/bootstrap/slurm_backend.cpp

+
+void SlurmBackend::put(std::string const& key, std::string const& value) {
+    pmix_value_t pmix_value;
+    PMIX_VALUE_CONSTRUCT(&pmix_value);


No, PMIx_Value_destruct (I've switched to the function API, replacing the deprecated macro API), should only be used with PMIx_Value_create or from a value returned by PMIx_Get, which are owning objects. PMIx_Put doesn't take ownership and thus we use PMIx_Value_construct for a non-owning reference.

pentschev · 2026-02-12T17:49:51Z

cpp/src/bootstrap/slurm_backend.cpp

+    // Get from rank 0 specifically (since that's where the key is stored)
+    // Using PMIX_RANK_WILDCARD doesn't seem to work reliably
+    pmix_proc_t proc;
+    PMIX_PROC_CONSTRUCT(&proc);


pentschev · 2026-02-12T20:57:13Z

cpp/src/bootstrap/slurm_backend.cpp

+    auto start = std::chrono::steady_clock::now();
+    auto poll_interval = std::chrono::milliseconds{100};
+
+    // Get from rank 0 specifically (since that's where the key is stored)


I see the confusion, it's not obvious from the implementation and you need to be aware of PMIx. But what happens is that PMIx_Put puts a value in its rank-local key store, making it available globally (with PMIX_GLOBAL). In broadcast, only rank 0 puts a value, while other ranks get from rank 0, and that's why we know it, as that's currently the only case for put/get functions. I have made an attempt to add comments that clarify that, please let me know if anything is unclear.

pentschev · 2026-02-12T21:05:07Z

cpp/src/bootstrap/slurm_backend.cpp

+        std::memcpy(data, bcast_data.data(), size);
+    }
+
+    barrier();


The barrier here is exclusively to prevent processes continue until they all got the data. Perhaps I'm being overly cautious unnecessarily, but I'm not certain if we can always provide a guarantee that it is safe for some processes to continue while maybe not all of them have finished retrieving this data. I can try to remove it if you prefer.

pentschev · 2026-02-12T21:15:16Z

cpp/src/bootstrap/slurm_backend.cpp

+        std::memcpy(data, bcast_data.data(), size);
+    }
+
+    barrier();


And I have now realized we're not using broadcast for anything at the moment, only the put/sync directly, so I've removed it entirely. With that the API docs for put()/get() need to be updated, so I've done that too.

pentschev · 2026-02-12T22:38:09Z

cpp/src/bootstrap/slurm_backend.cpp

+    static PmixGlobalState& instance() {
+        static PmixGlobalState state;
+        return state;
+    }


Actually this is not necessary anymore, this was an artifact of an old implementation, I think now we can finalize in the destructor without problems. Removed the global state and moved the finalizer to destructor.

cpp/src/bootstrap/slurm_backend.cpp

pentschev · 2026-02-12T22:44:25Z

cpp/src/bootstrap/ucxx.cpp

        comm = std::make_shared<ucxx::UCXX>(std::move(ucxx_initialized_rank), options);
    }
+
    comm->barrier();


Thanks Lawrence, that's important to investigate. I've opened #857 to do so.

madsbk

LGTM, thanks @pentschev

pentschev · 2026-02-18T11:34:48Z

Thanks all for the reviews!

pentschev · 2026-02-18T11:34:51Z

/merge

pentschev added 2 commits January 11, 2026 13:37

Add Slurm support for rrun

aa9821b

Add libpmix-devel dependency

9642eae

pentschev self-assigned this Jan 11, 2026

pentschev added feature request New feature or request non-breaking Introduces a non-breaking change labels Jan 11, 2026

Add build-pmix dependency matrix

f80ffcf

Fix libpmix-devel conda build dependency

d39476c

Pin libpmix <6.0

90c3565

pentschev added 3 commits January 13, 2026 14:09

Support srun (with debug)

92686f4

Remove debugging

02fd79d

Add Slurm mode to rrun

0a569f8

Merge remote-tracking branch 'upstream/main' into rrun-slurm

c3bb33d

Fix run on Slurm check without breaking run with mpirun

66da920

pentschev added 9 commits January 14, 2026 03:30

Support multiple ranks with single Slurm task

df1ed25

Use file backend in Slurm hybrid mode

fff8d86

Use rrun parent process as PMIx coordinator

e87e535

Fix subprocess coordination

af4c390

Move rank launching into new functions

5417c51

Use different execute functions for each launch condition

ceda7a8

Make coordinate_root_address_via_pmix cleaner with std::optional

0876fd6

Add [rrun] prefix to cout/cerr messages

72d2399

Generalize into setup_launch_and_cleanup

20d5ae3

pentschev requested review from nirandaperera and wence- February 5, 2026 10:16

pentschev added 6 commits February 5, 2026 02:17

Merge remote-tracking branch 'upstream/main' into rrun-slurm

4f8ed8a

Use sync instead of barrier

acc85fc

Merge remote-tracking branch 'upstream/main' into rrun-slurm

bf15cc1

Merge remote-tracking branch 'upstream/main' into rrun-slurm

3ee7a21

Fix BackendType in Python

2891959

Remove Slurm hybrid mode

5b9da5c

pentschev mentioned this pull request Feb 5, 2026

Add hybrid Slurm support to rrun with PMIx-based coordination #844

Open

gforsyth reviewed Feb 5, 2026

View reviewed changes

gforsyth approved these changes Feb 5, 2026

View reviewed changes

wence- reviewed Feb 11, 2026

View reviewed changes

pentschev and others added 10 commits February 12, 2026 08:41

Remove unnecessary variable copying

30de430

Fix missing PMIX_PROC_DESTRUCT

38472bc

Remove use of deprecated OpenMPI-bundled PMIx macros

b9cf29e

Remove redundant comment

b38d38a

Remove redundant root argument from broadcast, clarify docs

d260c72

Make SlurmBackend::barrier() a pure barrier

799ceb4

Remove Backend::broadcast()

189b921

Update put/get docs

311c420

Simplify PMIx state management

b48971b

Fix commit API docs

7eb1334

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

pentschev commented Feb 12, 2026

View reviewed changes

pentschev requested a review from wence- February 12, 2026 22:45

madsbk approved these changes Feb 17, 2026

View reviewed changes

rapids-bot bot merged commit 5ad21a6 into rapidsai:main Feb 18, 2026
89 checks passed

pentschev deleted the rrun-slurm branch February 18, 2026 16:24

Conversation

pentschev commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jan 11, 2026

Uh oh!

pentschev commented Jan 11, 2026

Uh oh!

pentschev commented Jan 11, 2026

Uh oh!

pentschev commented Jan 11, 2026

Uh oh!

pentschev commented Jan 11, 2026

Uh oh!

pentschev commented Jan 13, 2026

Uh oh!

pentschev commented Jan 13, 2026

Uh oh!

pentschev commented Jan 14, 2026

Uh oh!

gforsyth left a comment

Choose a reason for hiding this comment

Uh oh!

gforsyth left a comment

Choose a reason for hiding this comment

Uh oh!

pentschev commented Feb 5, 2026

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev commented Jan 11, 2026 •

edited

Loading