Add ring broadcast algorithm #487

Scusemua · 2026-01-29T22:46:14Z

Summary:
Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).

This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.

Also adds parameterized tests covering message sizes from 1MB to 8MB.

Details

Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.

Algorithm Threshold Updates:

Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB)
Updated BroadcastAdaptive.cuh to delegate to broadcast_adaptive()
Simplified BroadcastBinomialTree.cuh to use round-major ordering (entire message per round) instead of chunk-major

Differential Revision: D91697545

…builds (meta-pytorch#456) Summary: Pull Request resolved: meta-pytorch#456 **TL;DR:** Adds `PIPES_DEVICE_CHECK` and `PIPES_DEVICE_CHECK_MSG` macros for device-side assertions that remain active in `mode/opt` builds, enabling detection of invariant violations within GPU kernels. ## Context & Motivation IIUC, standard C++ `assert()` statements are disabled when `NDEBUG` is defined, which occurs in optimized/release builds (`mode/opt`). This creates a dangerous gap in GPU kernel code: invariant violations that would be caught during development go undetected in production, potentially causing silent data corruption or undefined behavior. Differential Revision: D91689639

Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545

meta-codesync · 2026-01-30T15:48:27Z

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91697545.

Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545

…eta-pytorch#469) Summary: Pull Request resolved: meta-pytorch#469 Adds the flat-tree (star) broadcast collective algorithm for the Pipes API. The root rank sends directly to each non-root rank in parallel using warp partitioning. Also introduces the broadcast test infrastructure including: - Base test fixtures (`BroadcastTestFixture`, `BroadcastParamTest`) - Parameterized test configurations for various message sizes (64B-1MB) - Edge case tests for single-rank and zero-byte broadcasts Comprehensive benchmarks for the Pipes Broadcast collective are introduced in D91727873, and D91715149 includes a deep-dive into the performance of the Flat-Tree algorithm. ## Future Work Abstract the topology into a topology class so that we don't have to have one topology / collective combo. The topology itself is just the series of steps to execute. This is being tracked in T253140119. Differential Revision: D91697523

Summary: Adds the main broadcast benchmark suite that compares Pipes broadcast implementations against NCCL baseline across various message sizes, algorithms, and configurations. **Available Benchmarks (7 total):** - algorithm: Compares flat-tree vs binomial tree algorithms against NCCL - clustered: Compares standard vs clustered kernel launch - rootsweep: Tests all ranks as root to identify topology-dependent variations - extended: Extended sweep from 64B to 256MB with adaptive auto-tuning - gridconfig: Sweeps block/thread configurations for 16MB messages Also includes: - BroadcastTimingStats struct for detailed profiling support - broadcastFlatKernel declaration and implementation Build target: //comms/pipes/benchmarks:broadcast_benchmark (8 GPUs per node) Differential Revision: D91727873

Summary: Adds the binomial tree broadcast algorithm for the Pipes API. This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages: - Round 0: Root sends to rank 1 - Round 1: Ranks 0,1 send to ranks 2,3 - Round 2: Ranks 0-3 send to ranks 4-7 Includes chunk-based pipelining for large messages and parameterized tests covering message sizes from 64KB to 1MB. --- Also introduces two additional benchmarks: - optimal: Tests pre-tuned optimal configurations for key message sizes - tuning: Parameter sweeps for staging buffer sizes Differential Revision: D91729677

Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 29, 2026

Scusemua force-pushed the export-D91697545 branch from dbf3c3d to 1139f22 Compare January 30, 2026 15:48

Scusemua force-pushed the export-D91697545 branch from 1139f22 to e447e09 Compare January 30, 2026 15:51

Scusemua force-pushed the export-D91697545 branch from e447e09 to 0927f04 Compare January 30, 2026 15:53

Ben Carver added 3 commits January 30, 2026 08:30

Scusemua force-pushed the export-D91697545 branch from 0927f04 to 04a3628 Compare January 30, 2026 16:57

Scusemua force-pushed the export-D91697545 branch from 04a3628 to c0300ac Compare January 30, 2026 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ring broadcast algorithm #487

Add ring broadcast algorithm #487

Uh oh!

Scusemua commented Jan 29, 2026

Uh oh!

meta-codesync bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add ring broadcast algorithm #487

Are you sure you want to change the base?

Add ring broadcast algorithm #487

Uh oh!

Conversation

Scusemua commented Jan 29, 2026

Details

Uh oh!

meta-codesync bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant