-
Notifications
You must be signed in to change notification settings - Fork 77
Add ring broadcast algorithm #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Scusemua
wants to merge
5
commits into
meta-pytorch:main
Choose a base branch
from
Scusemua:export-D91697545
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…builds (meta-pytorch#456) Summary: Pull Request resolved: meta-pytorch#456 **TL;DR:** Adds `PIPES_DEVICE_CHECK` and `PIPES_DEVICE_CHECK_MSG` macros for device-side assertions that remain active in `mode/opt` builds, enabling detection of invariant violations within GPU kernels. ## Context & Motivation IIUC, standard C++ `assert()` statements are disabled when `NDEBUG` is defined, which occurs in optimized/release builds (`mode/opt`). This creates a dangerous gap in GPU kernel code: invariant violations that would be caught during development go undetected in production, potentially causing silent data corruption or undefined behavior. Differential Revision: D91689639
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Jan 30, 2026
Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545
dbf3c3d to
1139f22
Compare
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Jan 30, 2026
Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545
1139f22 to
e447e09
Compare
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Jan 30, 2026
Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545
e447e09 to
0927f04
Compare
…eta-pytorch#469) Summary: Pull Request resolved: meta-pytorch#469 Adds the flat-tree (star) broadcast collective algorithm for the Pipes API. The root rank sends directly to each non-root rank in parallel using warp partitioning. Also introduces the broadcast test infrastructure including: - Base test fixtures (`BroadcastTestFixture`, `BroadcastParamTest`) - Parameterized test configurations for various message sizes (64B-1MB) - Edge case tests for single-rank and zero-byte broadcasts Comprehensive benchmarks for the Pipes Broadcast collective are introduced in D91727873, and D91715149 includes a deep-dive into the performance of the Flat-Tree algorithm. ## Future Work Abstract the topology into a topology class so that we don't have to have one topology / collective combo. The topology itself is just the series of steps to execute. This is being tracked in T253140119. Differential Revision: D91697523
Summary: Adds the main broadcast benchmark suite that compares Pipes broadcast implementations against NCCL baseline across various message sizes, algorithms, and configurations. **Available Benchmarks (7 total):** - algorithm: Compares flat-tree vs binomial tree algorithms against NCCL - clustered: Compares standard vs clustered kernel launch - rootsweep: Tests all ranks as root to identify topology-dependent variations - extended: Extended sweep from 64B to 256MB with adaptive auto-tuning - gridconfig: Sweeps block/thread configurations for 16MB messages Also includes: - BroadcastTimingStats struct for detailed profiling support - broadcastFlatKernel declaration and implementation Build target: //comms/pipes/benchmarks:broadcast_benchmark (8 GPUs per node) Differential Revision: D91727873
Summary: Adds the binomial tree broadcast algorithm for the Pipes API. This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages: - Round 0: Root sends to rank 1 - Round 1: Ranks 0,1 send to ranks 2,3 - Round 2: Ranks 0-3 send to ranks 4-7 Includes chunk-based pipelining for large messages and parameterized tests covering message sizes from 64KB to 1MB. --- Also introduces two additional benchmarks: - optimal: Tests pre-tuned optimal configurations for key message sizes - tuning: Parameter sweeps for staging buffer sizes Differential Revision: D91729677
Scusemua
pushed a commit
to Scusemua/torchcomms
that referenced
this pull request
Jan 30, 2026
Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545
0927f04 to
04a3628
Compare
Summary: Pull Request resolved: meta-pytorch#487 Adds the ring broadcast algorithm for the Pipes API. Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1). This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking. Also adds parameterized tests covering message sizes from 1MB to 8MB. ## Details Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments. **Algorithm Threshold Updates:** - Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB) - Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()` - Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major Differential Revision: D91697545
04a3628 to
c0300ac
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).
This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.
Also adds parameterized tests covering message sizes from 1MB to 8MB.
Details
Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.
Algorithm Threshold Updates:
BroadcastAdaptive.cuhto delegate tobroadcast_adaptive()BroadcastBinomialTree.cuhto use round-major ordering (entire message per round) instead of chunk-majorDifferential Revision: D91697545