Skip to content

Add hybrid Slurm support to rrun with PMIx-based coordination#844

Open
pentschev wants to merge 11 commits intorapidsai:mainfrom
pentschev:rrun-slurm-hybrid
Open

Add hybrid Slurm support to rrun with PMIx-based coordination#844
pentschev wants to merge 11 commits intorapidsai:mainfrom
pentschev:rrun-slurm-hybrid

Conversation

@pentschev
Copy link
Member

This PR adds hybrid Slurm support for rrun, enabling RapidsMPF to run without MPI. This is achieved by reusing SlurmBackend that provides a passthrough mode.

The new hybrid Slurm execution model allows running one task per node, multiple GPUs per task with parent-mediated coordination:

  • Root parent launches rank 0 first to obtain UCXX address, then broadcasts via PMIx to parents in all nodes
  • rrun parents coordinate via PMIx then spawn non-root ranks with the root rank address

The hybrid execution mode requires specifying the number of tasks/ranks (-n) directly to rrun so it knows it should run in hybrid mode and know how many tasks/ranks per node it should launch.

Usage example:

  srun \
      --mpi=pmix \
      --nodes=2 \
      --ntasks-per-node=1 \
      --cpus-per-task=144 \
      --gpus-per-task=4 \
      --gres=gpu:4 \
      rrun -n 4 ./benchmarks/bench_shuffle -C ucxx

Requires #775

@pentschev pentschev self-assigned this Feb 5, 2026
@pentschev pentschev added feature request New feature or request non-breaking Introduces a non-breaking change labels Feb 5, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 5, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@pentschev
Copy link
Member Author

/ok to test

@pentschev pentschev marked this pull request as ready for review February 18, 2026 16:42
@pentschev pentschev requested a review from a team as a code owner February 18, 2026 16:42
bool is_root_parent = (cfg.slurm_global_rank == 0);

// Coordinate root address with other nodes via PMIx
int slurm_ntasks = cfg.slurm_ntasks > 0 ? cfg.slurm_ntasks : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SLURM_NTASKS ever unset?

It might be safer to explicitly check and throw if SLURM_NTASKS is not set. Otherwise, ucxx::init could end up waiting indefinitely for ranks that will never show up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, SLURM_NTASKS should never be unset, it will be available from the moment the Slurm job starts throughout its end. We read it with detect_slurm_environment during parse_args before we get here, if it was not parsed correctly (shouldn't ever happen AFAIK and should always be a positive integer) but if it does we default to 1 task here which will be safe, although not what the user would probably expect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we throw instead of default to 1. As you say, something is going serious wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6bd1324

// Read the hex-encoded address and remove file
std::string encoded_address;
std::ifstream addr_stream(address_file);
std::getline(addr_stream, encoded_address);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a sanity check here? For example, what happens if the file exists but is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of sanity check do you have in mind? If the content is corrupted in any way UCX will fail to connect, so an error will occur. If you think it's best we can add some kind of checksum for the address to raise a more specific error message instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking we could throw early if encoded_address is empty or obviously garbage, but maybe it’s fine to let UCX complain instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just:

if (!addr_stream || encoded_address.empty()) throw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, done in dee4029

@pentschev pentschev requested a review from madsbk February 19, 2026 20:22
}

// Launch remaining ranks (skip rank 0 if we already have it in pre_launched_process)
int start_local_rank = pre_launched_process.has_value() ? 1 : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_local_rank is computed after std::move(*pre_launched_process). This works because moving the value does not empty the optional, but this confused me :)
Consider computing start_local_rank before the if (pre_launched_process) block to make the intent clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see the confusion now. This should have been improved now in 71d469d

Comment on lines +43 to 52
* # Hybrid mode: one task per node, 4 GPUs per task, two nodes.
* srun \
* --mpi=pmix \
* --nodes=2 \
* --ntasks-per-node=1 \
* --cpus-per-task=144 \
* --gpus-per-task=4 \
* --gres=gpu:4 \
* rrun -n 4 ./benchmarks/bench_shuffle -C ucxx
* ```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation on why I might want to use hybrid launch mode? IIUC it is because then I just need to ensure that I launch the correct number of processes per node and not worry about binding via slurm (the rrun launch takes care of that).

but, one thing I don't understand is why we can't just launch in passthrough mode and then apply bindings to all the processes we get via the rrun topology detection stuff.

What concretely is hybrid mode buying us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In passthrough mode we can do that, although in a somewhat limited manner. Take for example CPUs as resources, you have two options:

  1. Setup the job itself to partition the CPUs: I believe this works as expected if the cluster is properly configured, but it also means the burden is on the user to define the resources appropriately, which necessarily puts some burden on the user to know a priori the amount of CPUs available and divide them evenly for each task; or
  2. Just passthrough all CPUs: in this case we can use rrun to determine the CPUs to bind to, however, the GPU index is not known (since the GPU always appears as index 0 to each task) and thus we cannot partition the CPUs per GPU but only bind to the whole CPU socket/NUMA node. Note that the CPU partitioning is not currently implemented in rrun but I intend to do so. We could also technically do that partitioning with rrun based on the Slurm local ID but that would also require a different specialization.

However, there's a few more reasons why I would like to have hybrid mode:

  1. If we have another process to coordinate work submission (analogous to a Dask/Distributed client, as we previously discussed, for example), launching a single task per node will greatly simplify coordination as the coordination can be all prepared via the hybrid mode implementation.
  2. I'm leaning towards using PMIx for other future specializations as a sort of distributed KV store. It seems like it would be a good fit for extending rrun to other distributed systems (like when using SSH). I know this will require a bit more scaffolding in rrun (like setting up a PMIx server), but using PMIx is probably a better alternative than writing our own KV store for launching purposes, and may also allow us to avoid file-based synchronization. Therefore, this existing implementation with Slurm serves also as a test ground for future PMIx use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, the GPU index is not known (since the GPU always appears as index 0 to each task) and thus we cannot partition the CPUs per GPU but only bind to the whole CPU socket/NUMA node.

Can we not map the cuda runtime device id to its nvml counterpart and therefore determine the physical CPUs to bind to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no. This is an intentional isolation layer from Slurm (or rather enroot/pyxis) so that only the resources that are actually available to the allocation are seen at runtime, and that includes NVML, so NVML reports a single device being available and you cannot determine the true index or "order" of the device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants