Add hybrid Slurm support to rrun with PMIx-based coordination#844
Add hybrid Slurm support to rrun with PMIx-based coordination#844pentschev wants to merge 11 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
63c360e to
9611a70
Compare
cpp/tools/rrun.cpp
Outdated
| bool is_root_parent = (cfg.slurm_global_rank == 0); | ||
|
|
||
| // Coordinate root address with other nodes via PMIx | ||
| int slurm_ntasks = cfg.slurm_ntasks > 0 ? cfg.slurm_ntasks : 1; |
There was a problem hiding this comment.
Is SLURM_NTASKS ever unset?
It might be safer to explicitly check and throw if SLURM_NTASKS is not set. Otherwise, ucxx::init could end up waiting indefinitely for ranks that will never show up.
There was a problem hiding this comment.
No, SLURM_NTASKS should never be unset, it will be available from the moment the Slurm job starts throughout its end. We read it with detect_slurm_environment during parse_args before we get here, if it was not parsed correctly (shouldn't ever happen AFAIK and should always be a positive integer) but if it does we default to 1 task here which will be safe, although not what the user would probably expect.
There was a problem hiding this comment.
I suggest we throw instead of default to 1. As you say, something is going serious wrong.
| // Read the hex-encoded address and remove file | ||
| std::string encoded_address; | ||
| std::ifstream addr_stream(address_file); | ||
| std::getline(addr_stream, encoded_address); |
There was a problem hiding this comment.
Do we need a sanity check here? For example, what happens if the file exists but is empty?
There was a problem hiding this comment.
What kind of sanity check do you have in mind? If the content is corrupted in any way UCX will fail to connect, so an error will occur. If you think it's best we can add some kind of checksum for the address to raise a more specific error message instead.
There was a problem hiding this comment.
I was just thinking we could throw early if encoded_address is empty or obviously garbage, but maybe it’s fine to let UCX complain instead.
There was a problem hiding this comment.
Maybe just:
if (!addr_stream || encoded_address.empty()) throw
cpp/tools/rrun.cpp
Outdated
| } | ||
|
|
||
| // Launch remaining ranks (skip rank 0 if we already have it in pre_launched_process) | ||
| int start_local_rank = pre_launched_process.has_value() ? 1 : 0; |
There was a problem hiding this comment.
start_local_rank is computed after std::move(*pre_launched_process). This works because moving the value does not empty the optional, but this confused me :)
Consider computing start_local_rank before the if (pre_launched_process) block to make the intent clear.
There was a problem hiding this comment.
Yes, I see the confusion now. This should have been improved now in 71d469d
| * # Hybrid mode: one task per node, 4 GPUs per task, two nodes. | ||
| * srun \ | ||
| * --mpi=pmix \ | ||
| * --nodes=2 \ | ||
| * --ntasks-per-node=1 \ | ||
| * --cpus-per-task=144 \ | ||
| * --gpus-per-task=4 \ | ||
| * --gres=gpu:4 \ | ||
| * rrun -n 4 ./benchmarks/bench_shuffle -C ucxx | ||
| * ``` |
There was a problem hiding this comment.
Can you add some documentation on why I might want to use hybrid launch mode? IIUC it is because then I just need to ensure that I launch the correct number of processes per node and not worry about binding via slurm (the rrun launch takes care of that).
but, one thing I don't understand is why we can't just launch in passthrough mode and then apply bindings to all the processes we get via the rrun topology detection stuff.
What concretely is hybrid mode buying us?
There was a problem hiding this comment.
In passthrough mode we can do that, although in a somewhat limited manner. Take for example CPUs as resources, you have two options:
- Setup the job itself to partition the CPUs: I believe this works as expected if the cluster is properly configured, but it also means the burden is on the user to define the resources appropriately, which necessarily puts some burden on the user to know a priori the amount of CPUs available and divide them evenly for each task; or
- Just passthrough all CPUs: in this case we can use rrun to determine the CPUs to bind to, however, the GPU index is not known (since the GPU always appears as index 0 to each task) and thus we cannot partition the CPUs per GPU but only bind to the whole CPU socket/NUMA node. Note that the CPU partitioning is not currently implemented in rrun but I intend to do so. We could also technically do that partitioning with rrun based on the Slurm local ID but that would also require a different specialization.
However, there's a few more reasons why I would like to have hybrid mode:
- If we have another process to coordinate work submission (analogous to a Dask/Distributed client, as we previously discussed, for example), launching a single task per node will greatly simplify coordination as the coordination can be all prepared via the hybrid mode implementation.
- I'm leaning towards using PMIx for other future specializations as a sort of distributed KV store. It seems like it would be a good fit for extending rrun to other distributed systems (like when using SSH). I know this will require a bit more scaffolding in rrun (like setting up a PMIx server), but using PMIx is probably a better alternative than writing our own KV store for launching purposes, and may also allow us to avoid file-based synchronization. Therefore, this existing implementation with Slurm serves also as a test ground for future PMIx use.
There was a problem hiding this comment.
however, the GPU index is not known (since the GPU always appears as index 0 to each task) and thus we cannot partition the CPUs per GPU but only bind to the whole CPU socket/NUMA node.
Can we not map the cuda runtime device id to its nvml counterpart and therefore determine the physical CPUs to bind to?
There was a problem hiding this comment.
I think no. This is an intentional isolation layer from Slurm (or rather enroot/pyxis) so that only the resources that are actually available to the allocation are seen at runtime, and that includes NVML, so NVML reports a single device being available and you cannot determine the true index or "order" of the device.
This PR adds hybrid Slurm support for
rrun, enabling RapidsMPF to run without MPI. This is achieved by reusingSlurmBackendthat provides a passthrough mode.The new hybrid Slurm execution model allows running one task per node, multiple GPUs per task with parent-mediated coordination:
rrunparents coordinate via PMIx then spawn non-root ranks with the root rank addressThe hybrid execution mode requires specifying the number of tasks/ranks (
-n) directly torrunso it knows it should run in hybrid mode and know how many tasks/ranks per node it should launch.Usage example:
srun \ --mpi=pmix \ --nodes=2 \ --ntasks-per-node=1 \ --cpus-per-task=144 \ --gpus-per-task=4 \ --gres=gpu:4 \ rrun -n 4 ./benchmarks/bench_shuffle -C ucxxRequires #775