Skip to content

Reference (GPAW) apps fighting for cores on node (SLURM) #76

@pdobbelaere

Description

@pdobbelaere

For psiflow configurations where multiple apps run in parallel on the same worker node, it seems like MPI core affinities can bind multiple processes to the same core, leading to very poor performance.

Default behaviour

Launching two GPAW evaluations on the same node

[CONFIG]
cores_per_worker: 8
launch_command: 'apptainer exec -e --no-init oras://ghcr.io/molmod/gpaw:24.1 /opt/entry.sh mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw python DUMMY.py'
slurm:
  nodes_per_block: 1
  cores_per_node: 16

[OUTPUT JOB 1]
User:   ???@node3521.doduo.os
pid: 771665, CPU affinity: {75}
pid: 771662, CPU affinity: {2}
pid: 771667, CPU affinity: {76}
pid: 771673, CPU affinity: {79}
pid: 771675, CPU affinity: {80}
pid: 771671, CPU affinity: {78}
pid: 771660, CPU affinity: {0}
pid: 771669, CPU affinity: {77}

 ========================   JOB MAP   ========================

 Data for node: node3521	Num slots: 16	Max slots: 0	Num procs: 8
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./././././././././././././.]
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./././././././././././././.]
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/././././././././././././.]
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B/./././././././././././.]
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[./.][././B/././././././././././.]
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[./.][./././B/./././././././././.]
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[./.][././././B/././././././././.]
 	Process OMPI jobid: [44168,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[./.][./././././B/./././././././.]

 =============================================================

[OUTPUT JOB 2]
User:   ???@node3521.doduo.os
pid: 771664, CPU affinity: {75}
pid: 771663, CPU affinity: {2}
pid: 771666, CPU affinity: {76}
pid: 771661, CPU affinity: {0}
pid: 771670, CPU affinity: {78}
pid: 771674, CPU affinity: {80}
pid: 771672, CPU affinity: {79}
pid: 771668, CPU affinity: {77}

 ========================   JOB MAP   ========================

 Data for node: node3521	Num slots: 16	Max slots: 0	Num procs: 8
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./././././././././././././.]
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./././././././././././././.]
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/././././././././././././.]
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B/./././././././././././.]
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[./.][././B/././././././././././.]
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[./.][./././B/./././././././././.]
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[./.][././././B/././././././././.]
 	Process OMPI jobid: [44169,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[./.][./././././B/./././././././.]

 =============================================================

Calculations run extremely slowly on the same 8 cores, but do not crash

This is not unexpected, independent mpirun calls do not communicate which cores to use. It can be easily avoided by setting cores_per_worker=cores_per_node in the psiflow config, but that seems to counteract the parsl idea of blocks.

A hacky workaround I might have found is to wrap the MPI call inside a jobstep:
srun -n 1 -c 8 -v --cpu-bind=v,cores apptainer exec [...] mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw [...]
where we have to ask for 1 task using 8 cores because otherwise mpirun complains about available slots. Very elegant.

Hacky behaviour

Launching two GPAW evaluations on the same node

[CONFIG]
cores_per_worker: 8
launch_command: 'srun -n 1 -c 8 -v --cpu-bind=v,cores apptainer exec -e --no-init oras://ghcr.io/molmod/gpaw:24.1 /opt/entry.sh mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw python DUMMY.py'
slurm:
  nodes_per_block: 1
  cores_per_node: 16

[OUTPUT JOB 1]
User:   ???@node4217.shinx.os
pid: 1935915, CPU affinity: {102}
pid: 1935897, CPU affinity: {94}
pid: 1935922, CPU affinity: {104}
pid: 1935857, CPU affinity: {93}
pid: 1935902, CPU affinity: {95}
pid: 1935907, CPU affinity: {99}
pid: 1935918, CPU affinity: {103}
pid: 1935911, CPU affinity: {100}

 ========================   JOB MAP   ========================

 Data for node: node4217	Num slots: 8	Max slots: 0	Num procs: 8
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/./.][././././.]
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B/.][././././.]
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0]]:[././B][././././.]
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[././.][B/./././.]
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[././.][./B/././.]
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[././.][././B/./.]
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[././.][./././B/.]
 	Process OMPI jobid: [64657,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[././.][././././B]

 =============================================================

[OUTPUT JOB 2]
User:   ???@node4217.shinx.os
pid: 1935906, CPU affinity: {114}
pid: 1935912, CPU affinity: {116}
pid: 1935908, CPU affinity: {115}
pid: 1935901, CPU affinity: {112}
pid: 1935903, CPU affinity: {113}
pid: 1935896, CPU affinity: {111}
pid: 1935914, CPU affinity: {117}
pid: 1935864, CPU affinity: {105}

 ========================   JOB MAP   ========================

 Data for node: node4217	Num slots: 8	Max slots: 0	Num procs: 8
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 0 Bound: socket 1[core 0[hwt 0]]:[][B/././././././.]
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 1 Bound: socket 1[core 1[hwt 0]]:[][./B/./././././.]
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[][././B/././././.]
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[][./././B/./././.]
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[][././././B/././.]
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[][./././././B/./.]
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[][././././././B/.]
 	Process OMPI jobid: [64663,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[][./././././././B]

 =============================================================

Here, the JOB MAP info seems to contradict the CPU affinity logs, but I predict this is due to srun interfering in what MPI has access to (notice Num slots: 8 instead of the total 16 that are available on the node). Also, both calculations seem to run fine.

Generally, I think this issue would present itself for most bash apps running on the same node (all reference calculations, but also I-PI simulations). Why does psiflow not use srun (or equivalent) to separate app resources (e.g., through some parsl launchery deal)?

On a sidenote, it is apparently possible to launch GPAW directly with srun (see), which does not seem to work with the container sandwiched in between. That's an issue for the GPAW repo, however.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions