Skip to content

Conversation

@Scusemua
Copy link

Summary:
Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:

  • Round 0: Root sends to rank 1
  • Round 1: Ranks 0,1 send to ranks 2,3
  • Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.


Also introduces two additional benchmarks:

  • optimal: Tests pre-tuned optimal configurations for key message sizes
  • tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2026
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:

Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:
- Round 0: Root sends to rank 1
- Round 1: Ranks 0,1 send to ranks 2,3
- Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.

---

Also introduces two additional benchmarks:
- optimal: Tests pre-tuned optimal configurations for key message sizes
- tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677
@Scusemua Scusemua force-pushed the export-D91729677 branch 2 times, most recently from 979a731 to 26cc971 Compare January 30, 2026 15:46
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:

Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:
- Round 0: Root sends to rank 1
- Round 1: Ranks 0,1 send to ranks 2,3
- Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.

---

Also introduces two additional benchmarks:
- optimal: Tests pre-tuned optimal configurations for key message sizes
- tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:

Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:
- Round 0: Root sends to rank 1
- Round 1: Ranks 0,1 send to ranks 2,3
- Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.

---

Also introduces two additional benchmarks:
- optimal: Tests pre-tuned optimal configurations for key message sizes
- tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:
Pull Request resolved: meta-pytorch#486

Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:
- Round 0: Root sends to rank 1
- Round 1: Ranks 0,1 send to ranks 2,3
- Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.

 ---

Also introduces two additional benchmarks:
- optimal: Tests pre-tuned optimal configurations for key message sizes
- tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677
@meta-codesync
Copy link

meta-codesync bot commented Jan 30, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91729677.

Ben Carver added 4 commits January 30, 2026 08:55
…builds (meta-pytorch#456)

Summary:

**TL;DR:** Adds `PIPES_DEVICE_CHECK` and `PIPES_DEVICE_CHECK_MSG` macros for device-side assertions that remain active in `mode/opt` builds, enabling detection of invariant violations within GPU kernels.

## Context & Motivation

IIUC, standard C++ `assert()` statements are disabled when `NDEBUG` is defined, which occurs in optimized/release builds (`mode/opt`). This creates a dangerous gap in GPU kernel code: invariant violations that would be caught during development go undetected in production, potentially causing silent data corruption or undefined behavior.

Differential Revision: D91689639
…eta-pytorch#469)

Summary:

Adds the flat-tree (star) broadcast collective algorithm for the Pipes API. The root rank sends directly to each non-root rank in parallel using warp partitioning.

Also introduces the broadcast test infrastructure including:
- Base test fixtures (`BroadcastTestFixture`, `BroadcastParamTest`)
- Parameterized test configurations for various message sizes (64B-1MB)
- Edge case tests for single-rank and zero-byte broadcasts

Comprehensive benchmarks for the Pipes Broadcast collective are introduced in D91727873, and D91715149 includes a deep-dive into the performance of the Flat-Tree algorithm.

## Future Work
Abstract the topology into a topology class so that we don't have to have one topology / collective combo. The topology itself is just the series of steps to execute. This is being tracked in T253140119.

Differential Revision: D91697523
Summary:

Adds the main broadcast benchmark suite that compares Pipes broadcast implementations
against NCCL baseline across various message sizes, algorithms, and configurations.

**Available Benchmarks (7 total):**
- algorithm: Compares flat-tree vs binomial tree algorithms against NCCL
- clustered: Compares standard vs clustered kernel launch
- rootsweep: Tests all ranks as root to identify topology-dependent variations
- extended: Extended sweep from 64B to 256MB with adaptive auto-tuning
- gridconfig: Sweeps block/thread configurations for 16MB messages

Also includes `broadcastFlatKernel` declaration and implementation

Build target: `//comms/pipes/benchmarks:broadcast_benchmark` (8 GPUs per node)

Differential Revision: D91727873
Summary:

Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:
- Round 0: Root sends to rank 1
- Round 1: Ranks 0,1 send to ranks 2,3
- Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.

---

Also introduces two additional benchmarks:
- optimal: Tests pre-tuned optimal configurations for key message sizes
- tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant