-
Notifications
You must be signed in to change notification settings - Fork 66
Description
We are developing comm_repay and finding a problem with torch.distributed.batch_isend_irecv , which is used in one of our testing trace.
The p2p comm sequence of real training between rank0 and rank8 is:
rank 0: batch -> send -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> recv
rank 8: batch -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send
The p2p comm sequence in Execution Trace for replay between rank0 and rank8 is:
rank0: send-> send -> recv-> send -> recv-> send -> recv -> recv
rank8: recv-> send -> recv-> send -> recv-> send -> recv -> send
The issue can be reproduced with the collected ET for https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py#L3846
batched-send-recv-0.json attached two files are simpler version of that unit test (with only one batch_isend_irecv call). You can find nccl::coalesced node, which marks the end of the coalescing buffer. I think the trace missed the node to mark the start of the coalescing buffer. After that is added, all send/recv nodes between the start of coalescing and the end of the coalescing should be treated as one coalesced group to replay.
batched-send-recv-1.json