-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
I noticed that a ring-reduce algorithm is chosen for a fully connected topology.
LLMCompass/software_model/communication_primitives.py
Lines 62 to 72 in bcc54eb
| # stage 1: ring reduce | |
| latency = ( | |
| edge_latency | |
| + effective_data_size_per_device / edge_bandwidth_both_direction | |
| ) * (device_count - 1) | |
| # stage 2: broadcast | |
| latency += effective_data_size_per_device / edge_bandwidth_per_direction | |
| latency += ( | |
| data_size / interconnect_module.internal_link_bandwidth_per_direction | |
| ) | |
| self.latency = latency |
Why is this the case? I believe this would lead to significant wastage of the available bandwidth. Wouldn't a reduce-scatter followed by an allgather be a better implementation?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels