torch.distributed.batch_isend_irecv is not recorded properly in ET

We are developing comm_repay and finding a problem with torch.distributed.batch_isend_irecv , which is used in one of our testing trace.
The p2p comm sequence of real training between rank0 and rank8 is:
rank 0: batch -> send  -> batch -> send -> recv  -> batch -> send -> recv -> batch -> send -> recv -> batch -> recv
rank 8: batch -> recv  -> batch -> send -> recv  -> batch -> send -> recv -> batch -> send -> recv -> batch -> send
The p2p comm sequence in Execution Trace for replay between rank0 and rank8 is:
rank0:           send->           send -> recv->             send -> recv->             send -> recv ->         recv
rank8:           recv->           send -> recv->             send -> recv->             send -> recv ->         send

The issue can be reproduced with the collected ET for https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py#L3846

[batched-send-recv-0.json](https://github.com/user-attachments/files/16241619/batched-send-recv-0.json) attached two files are simpler version of that unit test (with only one batch_isend_irecv call). You can find nccl::coalesced node, which marks the end of the coalescing buffer. I think the trace missed the node to mark the start of the  coalescing buffer. After that is added, all send/recv nodes between the start of coalescing and the end of the coalescing should be treated as one coalesced group to replay.
[batched-send-recv-1.json](https://github.com/user-attachments/files/16241620/batched-send-recv-1.json)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch.distributed.batch_isend_irecv is not recorded properly in ET #134

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

torch.distributed.batch_isend_irecv is not recorded properly in ET #134

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions