Add binomial tree broadcast algorithm #486

Scusemua · 2026-01-29T22:46:06Z

Summary:
Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:

Round 0: Root sends to rank 1
Round 1: Ranks 0,1 send to ranks 2,3
Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.

Also introduces two additional benchmarks:

optimal: Tests pre-tuned optimal configurations for key message sizes
tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677

Summary: Adds the binomial tree broadcast algorithm for the Pipes API. This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages: - Round 0: Root sends to rank 1 - Round 1: Ranks 0,1 send to ranks 2,3 - Round 2: Ranks 0-3 send to ranks 4-7 Includes chunk-based pipelining for large messages and parameterized tests covering message sizes from 64KB to 1MB. --- Also introduces two additional benchmarks: - optimal: Tests pre-tuned optimal configurations for key message sizes - tuning: Parameter sweeps for staging buffer sizes Differential Revision: D91729677

Summary: Pull Request resolved: meta-pytorch#486 Adds the binomial tree broadcast algorithm for the Pipes API. This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages: - Round 0: Root sends to rank 1 - Round 1: Ranks 0,1 send to ranks 2,3 - Round 2: Ranks 0-3 send to ranks 4-7 Includes chunk-based pipelining for large messages and parameterized tests covering message sizes from 64KB to 1MB. --- Also introduces two additional benchmarks: - optimal: Tests pre-tuned optimal configurations for key message sizes - tuning: Parameter sweeps for staging buffer sizes Differential Revision: D91729677

meta-codesync · 2026-01-30T15:49:52Z

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91729677.

…builds (meta-pytorch#456) Summary: **TL;DR:** Adds `PIPES_DEVICE_CHECK` and `PIPES_DEVICE_CHECK_MSG` macros for device-side assertions that remain active in `mode/opt` builds, enabling detection of invariant violations within GPU kernels. ## Context & Motivation IIUC, standard C++ `assert()` statements are disabled when `NDEBUG` is defined, which occurs in optimized/release builds (`mode/opt`). This creates a dangerous gap in GPU kernel code: invariant violations that would be caught during development go undetected in production, potentially causing silent data corruption or undefined behavior. Differential Revision: D91689639

…eta-pytorch#469) Summary: Adds the flat-tree (star) broadcast collective algorithm for the Pipes API. The root rank sends directly to each non-root rank in parallel using warp partitioning. Also introduces the broadcast test infrastructure including: - Base test fixtures (`BroadcastTestFixture`, `BroadcastParamTest`) - Parameterized test configurations for various message sizes (64B-1MB) - Edge case tests for single-rank and zero-byte broadcasts Comprehensive benchmarks for the Pipes Broadcast collective are introduced in D91727873, and D91715149 includes a deep-dive into the performance of the Flat-Tree algorithm. ## Future Work Abstract the topology into a topology class so that we don't have to have one topology / collective combo. The topology itself is just the series of steps to execute. This is being tracked in T253140119. Differential Revision: D91697523

Summary: Adds the main broadcast benchmark suite that compares Pipes broadcast implementations against NCCL baseline across various message sizes, algorithms, and configurations. **Available Benchmarks (7 total):** - algorithm: Compares flat-tree vs binomial tree algorithms against NCCL - clustered: Compares standard vs clustered kernel launch - rootsweep: Tests all ranks as root to identify topology-dependent variations - extended: Extended sweep from 64B to 256MB with adaptive auto-tuning - gridconfig: Sweeps block/thread configurations for 16MB messages Also includes `broadcastFlatKernel` declaration and implementation Build target: `//comms/pipes/benchmarks:broadcast_benchmark` (8 GPUs per node) Differential Revision: D91727873

Summary: Adds the binomial tree broadcast algorithm for the Pipes API. This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages: - Round 0: Root sends to rank 1 - Round 1: Ranks 0,1 send to ranks 2,3 - Round 2: Ranks 0-3 send to ranks 4-7 Includes chunk-based pipelining for large messages and parameterized tests covering message sizes from 64KB to 1MB. --- Also introduces two additional benchmarks: - optimal: Tests pre-tuned optimal configurations for key message sizes - tuning: Parameter sweeps for staging buffer sizes Differential Revision: D91729677

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 29, 2026

Scusemua force-pushed the export-D91729677 branch 2 times, most recently from 979a731 to 26cc971 Compare January 30, 2026 15:46

Scusemua force-pushed the export-D91729677 branch from 26cc971 to a257c2a Compare January 30, 2026 15:47

Scusemua force-pushed the export-D91729677 branch from a257c2a to a191308 Compare January 30, 2026 15:49

Ben Carver added 4 commits January 30, 2026 08:55

Scusemua force-pushed the export-D91729677 branch from a191308 to ff29f56 Compare January 30, 2026 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add binomial tree broadcast algorithm #486

Add binomial tree broadcast algorithm #486

Uh oh!

Scusemua commented Jan 29, 2026

Uh oh!

meta-codesync bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add binomial tree broadcast algorithm #486

Are you sure you want to change the base?

Add binomial tree broadcast algorithm #486

Uh oh!

Conversation

Scusemua commented Jan 29, 2026

Uh oh!

meta-codesync bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant