Skip to content

Conversation

@Scusemua
Copy link

Summary:
Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).

This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.

Also adds parameterized tests covering message sizes from 1MB to 8MB.

Details

Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.

Algorithm Threshold Updates:

  • Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB)
  • Updated BroadcastAdaptive.cuh to delegate to broadcast_adaptive()
  • Simplified BroadcastBinomialTree.cuh to use round-major ordering (entire message per round) instead of chunk-major

Differential Revision: D91697545

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2026
…builds (meta-pytorch#456)

Summary:
Pull Request resolved: meta-pytorch#456

**TL;DR:** Adds `PIPES_DEVICE_CHECK` and `PIPES_DEVICE_CHECK_MSG` macros for device-side assertions that remain active in `mode/opt` builds, enabling detection of invariant violations within GPU kernels.

## Context & Motivation

IIUC, standard C++ `assert()` statements are disabled when `NDEBUG` is defined, which occurs in optimized/release builds (`mode/opt`). This creates a dangerous gap in GPU kernel code: invariant violations that would be caught during development go undetected in production, potentially causing silent data corruption or undefined behavior.

Differential Revision: D91689639
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:
Pull Request resolved: meta-pytorch#487

Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).

This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.

Also adds parameterized tests covering message sizes from 1MB to 8MB.

## Details

Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.

**Algorithm Threshold Updates:**
- Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB)
- Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()`
- Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major

Differential Revision: D91697545
@meta-codesync
Copy link

meta-codesync bot commented Jan 30, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91697545.

Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:
Pull Request resolved: meta-pytorch#487

Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).

This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.

Also adds parameterized tests covering message sizes from 1MB to 8MB.

## Details

Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.

**Algorithm Threshold Updates:**
- Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB)
- Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()`
- Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major

Differential Revision: D91697545
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:
Pull Request resolved: meta-pytorch#487

Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).

This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.

Also adds parameterized tests covering message sizes from 1MB to 8MB.

## Details

Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.

**Algorithm Threshold Updates:**
- Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB)
- Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()`
- Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major

Differential Revision: D91697545
Ben Carver added 3 commits January 30, 2026 08:30
…eta-pytorch#469)

Summary:
Pull Request resolved: meta-pytorch#469

Adds the flat-tree (star) broadcast collective algorithm for the Pipes API. The root rank sends directly to each non-root rank in parallel using warp partitioning.

Also introduces the broadcast test infrastructure including:
- Base test fixtures (`BroadcastTestFixture`, `BroadcastParamTest`)
- Parameterized test configurations for various message sizes (64B-1MB)
- Edge case tests for single-rank and zero-byte broadcasts

Comprehensive benchmarks for the Pipes Broadcast collective are introduced in D91727873, and D91715149 includes a deep-dive into the performance of the Flat-Tree algorithm.

## Future Work
Abstract the topology into a topology class so that we don't have to have one topology / collective combo. The topology itself is just the series of steps to execute. This is being tracked in T253140119.

Differential Revision: D91697523
Summary:
Adds the main broadcast benchmark suite that compares Pipes broadcast implementations
against NCCL baseline across various message sizes, algorithms, and configurations.

**Available Benchmarks (7 total):**
- algorithm: Compares flat-tree vs binomial tree algorithms against NCCL
- clustered: Compares standard vs clustered kernel launch
- rootsweep: Tests all ranks as root to identify topology-dependent variations
- extended: Extended sweep from 64B to 256MB with adaptive auto-tuning
- gridconfig: Sweeps block/thread configurations for 16MB messages

Also includes:
- BroadcastTimingStats struct for detailed profiling support
- broadcastFlatKernel declaration and implementation

Build target: //comms/pipes/benchmarks:broadcast_benchmark (8 GPUs per node)

Differential Revision: D91727873
Summary:
Adds the binomial tree broadcast algorithm for the Pipes API.
This algorithm uses O(log N) rounds for better bandwidth efficiency on large messages:
- Round 0: Root sends to rank 1
- Round 1: Ranks 0,1 send to ranks 2,3
- Round 2: Ranks 0-3 send to ranks 4-7

Includes chunk-based pipelining for large messages and parameterized tests
covering message sizes from 64KB to 1MB.

 ---

Also introduces two additional benchmarks:
- optimal: Tests pre-tuned optimal configurations for key message sizes
- tuning: Parameter sweeps for staging buffer sizes

Differential Revision: D91729677
Scusemua pushed a commit to Scusemua/torchcomms that referenced this pull request Jan 30, 2026
Summary:
Pull Request resolved: meta-pytorch#487

Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).

This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.

Also adds parameterized tests covering message sizes from 1MB to 8MB.

## Details

Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.

**Algorithm Threshold Updates:**
- Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB)
- Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()`
- Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major

Differential Revision: D91697545
Summary:
Pull Request resolved: meta-pytorch#487

Adds the ring broadcast algorithm for the Pipes API.
Data flows around the ring one hop at a time: root -> rank1 -> rank2 -> ... -> rank(N-1).

This algorithm is included for completeness and testing purposes. For pure broadcast operations, flat-tree or binomial tree is recommended. Ring is designed for reduce-scatter/all-gather patterns but is useful for benchmarking.

Also adds parameterized tests covering message sizes from 1MB to 8MB.

## Details

Empirical benchmarks showed that the ring algorithm significantly outperforms binomial tree for large messages, requiring threshold adjustments.

**Algorithm Threshold Updates:**
- Changed adaptive algorithm to use ring for messages ≥8MB (was binomial tree at 64KB)
- Updated `BroadcastAdaptive.cuh` to delegate to `broadcast_adaptive()`
- Simplified `BroadcastBinomialTree.cuh` to use round-major ordering (entire message per round) instead of chunk-major

Differential Revision: D91697545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant