Add hybrid Slurm support to rrun with PMIx-based coordination by pentschev · Pull Request #844 · rapidsai/rapidsmpf

pentschev · 2026-02-05T12:49:09Z

This PR adds hybrid Slurm support for rrun, enabling RapidsMPF to run without MPI. This is achieved by reusing SlurmBackend that provides a passthrough mode.

The new hybrid Slurm execution model allows running one task per node, multiple GPUs per task with parent-mediated coordination:

Root parent launches rank 0 first to obtain UCXX address, then broadcasts via PMIx to parents in all nodes
rrun parents coordinate via PMIx then spawn non-root ranks with the root rank address

The hybrid execution mode requires specifying the number of tasks/ranks (-n) directly to rrun so it knows it should run in hybrid mode and know how many tasks/ranks per node it should launch.

Usage example:

  srun \
      --mpi=pmix \
      --nodes=2 \
      --ntasks-per-node=1 \
      --cpus-per-task=144 \
      --gpus-per-task=4 \
      --gres=gpu:4 \
      rrun -n 4 ./benchmarks/bench_shuffle -C ucxx

Requires #775

copy-pr-bot · 2026-02-05T12:49:13Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

pentschev · 2026-02-05T13:36:58Z

/ok to test

cpp/tools/rrun.cpp

cpp/src/bootstrap/ucxx.cpp

madsbk · 2026-02-18T19:03:25Z

cpp/tools/rrun.cpp

+    bool is_root_parent = (cfg.slurm_global_rank == 0);
+
+    // Coordinate root address with other nodes via PMIx
+    int slurm_ntasks = cfg.slurm_ntasks > 0 ? cfg.slurm_ntasks : 1;


Is SLURM_NTASKS ever unset?

It might be safer to explicitly check and throw if SLURM_NTASKS is not set. Otherwise, ucxx::init could end up waiting indefinitely for ranks that will never show up.

No, SLURM_NTASKS should never be unset, it will be available from the moment the Slurm job starts throughout its end. We read it with detect_slurm_environment during parse_args before we get here, if it was not parsed correctly (shouldn't ever happen AFAIK and should always be a positive integer) but if it does we default to 1 task here which will be safe, although not what the user would probably expect.

I suggest we throw instead of default to 1. As you say, something is going serious wrong.

Done in 6bd1324

madsbk · 2026-02-18T19:06:01Z

cpp/tools/rrun.cpp

+    // Read the hex-encoded address and remove file
+    std::string encoded_address;
+    std::ifstream addr_stream(address_file);
+    std::getline(addr_stream, encoded_address);


Do we need a sanity check here? For example, what happens if the file exists but is empty?

What kind of sanity check do you have in mind? If the content is corrupted in any way UCX will fail to connect, so an error will occur. If you think it's best we can add some kind of checksum for the address to raise a more specific error message instead.

I was just thinking we could throw early if encoded_address is empty or obviously garbage, but maybe it’s fine to let UCX complain instead.

Maybe just:

if (!addr_stream || encoded_address.empty()) throw

Sounds reasonable, done in dee4029

cpp/tools/rrun.cpp

madsbk · 2026-02-20T07:45:57Z

cpp/tools/rrun.cpp

+    }
+
+    // Launch remaining ranks (skip rank 0 if we already have it in pre_launched_process)
+    int start_local_rank = pre_launched_process.has_value() ? 1 : 0;


start_local_rank is computed after std::move(*pre_launched_process). This works because moving the value does not empty the optional, but this confused me :)
Consider computing start_local_rank before the if (pre_launched_process) block to make the intent clear.

Yes, I see the confusion now. This should have been improved now in 71d469d

wence- · 2026-02-20T12:10:51Z

cpp/include/rapidsmpf/bootstrap/slurm_backend.hpp

+ * # Hybrid mode: one task per node, 4 GPUs per task, two nodes.
+ * srun \
+ *     --mpi=pmix \
+ *     --nodes=2 \
+ *     --ntasks-per-node=1 \
+ *     --cpus-per-task=144 \
+ *     --gpus-per-task=4 \
+ *     --gres=gpu:4 \
+ *     rrun -n 4 ./benchmarks/bench_shuffle -C ucxx
 * ```


Can you add some documentation on why I might want to use hybrid launch mode? IIUC it is because then I just need to ensure that I launch the correct number of processes per node and not worry about binding via slurm (the rrun launch takes care of that).

but, one thing I don't understand is why we can't just launch in passthrough mode and then apply bindings to all the processes we get via the rrun topology detection stuff.

What concretely is hybrid mode buying us?

In passthrough mode we can do that, although in a somewhat limited manner. Take for example CPUs as resources, you have two options:

Setup the job itself to partition the CPUs: I believe this works as expected if the cluster is properly configured, but it also means the burden is on the user to define the resources appropriately, which necessarily puts some burden on the user to know a priori the amount of CPUs available and divide them evenly for each task; or

Just passthrough all CPUs: in this case we can use rrun to determine the CPUs to bind to, however, the GPU index is not known (since the GPU always appears as index 0 to each task) and thus we cannot partition the CPUs per GPU but only bind to the whole CPU socket/NUMA node. Note that the CPU partitioning is not currently implemented in rrun but I intend to do so. We could also technically do that partitioning with rrun based on the Slurm local ID but that would also require a different specialization.

However, there's a few more reasons why I would like to have hybrid mode:

If we have another process to coordinate work submission (analogous to a Dask/Distributed client, as we previously discussed, for example), launching a single task per node will greatly simplify coordination as the coordination can be all prepared via the hybrid mode implementation.

I'm leaning towards using PMIx for other future specializations as a sort of distributed KV store. It seems like it would be a good fit for extending rrun to other distributed systems (like when using SSH). I know this will require a bit more scaffolding in rrun (like setting up a PMIx server), but using PMIx is probably a better alternative than writing our own KV store for launching purposes, and may also allow us to avoid file-based synchronization. Therefore, this existing implementation with Slurm serves also as a test ground for future PMIx use.

however, the GPU index is not known (since the GPU always appears as index 0 to each task) and thus we cannot partition the CPUs per GPU but only bind to the whole CPU socket/NUMA node.

Can we not map the cuda runtime device id to its nvml counterpart and therefore determine the physical CPUs to bind to?

I think no. This is an intentional isolation layer from Slurm (or rather enroot/pyxis) so that only the resources that are actually available to the allocation are seen at runtime, and that includes NVML, so NVML reports a single device being available and you cannot determine the true index or "order" of the device.

pentschev self-assigned this Feb 5, 2026

pentschev added feature request New feature or request non-breaking Introduces a non-breaking change labels Feb 5, 2026

pentschev mentioned this pull request Feb 5, 2026

Add Slurm support to rrun with PMIx-based coordination #775

Merged

Add hybrid Slurm support to rrun with PMIx-based coordination

9611a70

pentschev force-pushed the rrun-slurm-hybrid branch from 63c360e to 9611a70 Compare February 18, 2026 16:37

Merge remote-tracking branch 'upstream/main' into rrun-slurm-hybrid

d347ff6

pentschev marked this pull request as ready for review February 18, 2026 16:42

pentschev requested a review from a team as a code owner February 18, 2026 16:42

madsbk reviewed Feb 18, 2026

View reviewed changes

pentschev added 4 commits February 19, 2026 03:38

Clarify RAPIDSMPF_ROOT_ADDRESS_FILE unset purpose

4516f54

Ensure rank 0 has proper signal/log forwarding

0fba6e4

Write address to a temporary file first

8635182

Merge remote-tracking branch 'upstream/main' into rrun-slurm-hybrid

21c56d4

pentschev requested a review from madsbk February 19, 2026 20:22

madsbk reviewed Feb 20, 2026

View reviewed changes

wence- reviewed Feb 20, 2026

View reviewed changes

pentschev added 5 commits February 25, 2026 13:37

Error if unable to determine number of tasks/processes

6bd1324

Early check for invalid address

dee4029

Clarify start_local_rank intent

71d469d

Merge remote-tracking branch 'upstream/main' into rrun-slurm-hybrid

0c5b25c

Fix build errors

d7145ae

Conversation

pentschev commented Feb 5, 2026

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

pentschev commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants