Skip to content

All-reduce implementation for Fully Connected topology #6

@averageFOSSenjoyer

Description

@averageFOSSenjoyer

I noticed that a ring-reduce algorithm is chosen for a fully connected topology.

# stage 1: ring reduce
latency = (
edge_latency
+ effective_data_size_per_device / edge_bandwidth_both_direction
) * (device_count - 1)
# stage 2: broadcast
latency += effective_data_size_per_device / edge_bandwidth_per_direction
latency += (
data_size / interconnect_module.internal_link_bandwidth_per_direction
)
self.latency = latency

Why is this the case? I believe this would lead to significant wastage of the available bandwidth. Wouldn't a reduce-scatter followed by an allgather be a better implementation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions