-
Notifications
You must be signed in to change notification settings - Fork 16
Description
For psiflow configurations where multiple apps run in parallel on the same worker node, it seems like MPI core affinities can bind multiple processes to the same core, leading to very poor performance.
Default behaviour
Launching two GPAW evaluations on the same node
[CONFIG]
cores_per_worker: 8
launch_command: 'apptainer exec -e --no-init oras://ghcr.io/molmod/gpaw:24.1 /opt/entry.sh mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw python DUMMY.py'
slurm:
nodes_per_block: 1
cores_per_node: 16
[OUTPUT JOB 1]
User: ???@node3521.doduo.os
pid: 771665, CPU affinity: {75}
pid: 771662, CPU affinity: {2}
pid: 771667, CPU affinity: {76}
pid: 771673, CPU affinity: {79}
pid: 771675, CPU affinity: {80}
pid: 771671, CPU affinity: {78}
pid: 771660, CPU affinity: {0}
pid: 771669, CPU affinity: {77}
======================== JOB MAP ========================
Data for node: node3521 Num slots: 16 Max slots: 0 Num procs: 8
Process OMPI jobid: [44168,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./././././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./././././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/././././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B/./././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[./.][././B/././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[./.][./././B/./././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[./.][././././B/././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[./.][./././././B/./././././././.]
=============================================================
[OUTPUT JOB 2]
User: ???@node3521.doduo.os
pid: 771664, CPU affinity: {75}
pid: 771663, CPU affinity: {2}
pid: 771666, CPU affinity: {76}
pid: 771661, CPU affinity: {0}
pid: 771670, CPU affinity: {78}
pid: 771674, CPU affinity: {80}
pid: 771672, CPU affinity: {79}
pid: 771668, CPU affinity: {77}
======================== JOB MAP ========================
Data for node: node3521 Num slots: 16 Max slots: 0 Num procs: 8
Process OMPI jobid: [44169,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./././././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./././././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/././././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B/./././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[./.][././B/././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[./.][./././B/./././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[./.][././././B/././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[./.][./././././B/./././././././.]
=============================================================
Calculations run extremely slowly on the same 8 cores, but do not crash
This is not unexpected, independent mpirun calls do not communicate which cores to use. It can be easily avoided by setting cores_per_worker=cores_per_node in the psiflow config, but that seems to counteract the parsl idea of blocks.
A hacky workaround I might have found is to wrap the MPI call inside a jobstep:
srun -n 1 -c 8 -v --cpu-bind=v,cores apptainer exec [...] mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw [...]
where we have to ask for 1 task using 8 cores because otherwise mpirun complains about available slots. Very elegant.
Hacky behaviour
Launching two GPAW evaluations on the same node
[CONFIG]
cores_per_worker: 8
launch_command: 'srun -n 1 -c 8 -v --cpu-bind=v,cores apptainer exec -e --no-init oras://ghcr.io/molmod/gpaw:24.1 /opt/entry.sh mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw python DUMMY.py'
slurm:
nodes_per_block: 1
cores_per_node: 16
[OUTPUT JOB 1]
User: ???@node4217.shinx.os
pid: 1935915, CPU affinity: {102}
pid: 1935897, CPU affinity: {94}
pid: 1935922, CPU affinity: {104}
pid: 1935857, CPU affinity: {93}
pid: 1935902, CPU affinity: {95}
pid: 1935907, CPU affinity: {99}
pid: 1935918, CPU affinity: {103}
pid: 1935911, CPU affinity: {100}
======================== JOB MAP ========================
Data for node: node4217 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [64657,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/./.][././././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B/.][././././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0]]:[././B][././././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[././.][B/./././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[././.][./B/././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[././.][././B/./.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[././.][./././B/.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[././.][././././B]
=============================================================
[OUTPUT JOB 2]
User: ???@node4217.shinx.os
pid: 1935906, CPU affinity: {114}
pid: 1935912, CPU affinity: {116}
pid: 1935908, CPU affinity: {115}
pid: 1935901, CPU affinity: {112}
pid: 1935903, CPU affinity: {113}
pid: 1935896, CPU affinity: {111}
pid: 1935914, CPU affinity: {117}
pid: 1935864, CPU affinity: {105}
======================== JOB MAP ========================
Data for node: node4217 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [64663,1] App: 0 Process rank: 0 Bound: socket 1[core 0[hwt 0]]:[][B/././././././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 1 Bound: socket 1[core 1[hwt 0]]:[][./B/./././././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[][././B/././././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[][./././B/./././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[][././././B/././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[][./././././B/./.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[][././././././B/.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[][./././././././B]
=============================================================
Here, the JOB MAP info seems to contradict the CPU affinity logs, but I predict this is due to srun interfering in what MPI has access to (notice Num slots: 8 instead of the total 16 that are available on the node). Also, both calculations seem to run fine.
Generally, I think this issue would present itself for most bash apps running on the same node (all reference calculations, but also I-PI simulations). Why does psiflow not use srun (or equivalent) to separate app resources (e.g., through some parsl launchery deal)?
On a sidenote, it is apparently possible to launch GPAW directly with srun (see), which does not seem to work with the container sandwiched in between. That's an issue for the GPAW repo, however.