Add NCCL GIN device-side implementation and integration tests #472

goelayu · 2026-01-28T21:15:34Z

Summary:
Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

Key Changes

Device Implementation: Header-only TorchCommDeviceNCCLX.cuh implementing all Device API methods using NCCL GIN primitives
Integration Tests: DeviceApiTest suite validating window creation, buffer registration, and signal/counter allocation
CI Gating: Environment-based test gating via RUN_DEVICE_API_TEST flag

Semantic Gaps

CmpOp Support: Only GE (>=) comparison supported; other operators return error (-1)
SignalOp Support: Only ADD operation supported; SET deferred until NCCL adds support
Fence: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
Barrier: No-op placeholder; world-scope barrier requires host-side ncclGinBarrierHandle allocation

Next Main TODOs

LSA Support: Add P2P/NVLink direct access transport alongside RDMA
Scope Flexibility: Expose cooperation scope API (currently hardcoded to thread scope)
Barrier Implementation: Allocate world-scope barrier handle via ncclGinBarrierCreateRequirement at host

Differential Revision: D91629468

meta-codesync · 2026-01-28T21:15:51Z

@goelayu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91629468.

…ytorch#472) Summary: Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468

…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468

…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Reviewed By: siyengar Differential Revision: D91629468

Summary: # Summary This diff adds the host-side infrastructure for TorchComms device API, enabling GPU-initiated networking (GIN) from CUDA kernels. ## Key Design Decisions ### 1. Dual Window Architecture `tensor_register()` now creates **BOTH**: - A **CTRAN window** for host API (`put`/`signal`/`wait_signal`) - An **NCCL orig window** for device API with GIN support This allows host and device APIs to coexist. ### 2. Explicit Window Path Control Added `NCCL_WIN_FORCE_ORIG_PATH` flag to NCCLX that bypasses the global `NCCL_RMA_ALGO` environment variable. This allows device API windows to always use the NCCL orig path (which has GIN support) regardless of the default RMA algorithm setting. ### 3. Non-Collective Local Buffer Registration Uses `ncclCommSplit` to create a 1-rank local communicator that shares `ginState` with the parent. This enables `register_local_buffer()` to be truly **non-collective** since all bootstrap barriers become no-ops when `nranks=1`. ### 4. Single Synchronization Point All collective operations (`initLocalComm()`, `initNcclOrigWindow()`) now happen in `tensor_register()`, which is already a well-defined synchronization point. This prevents potential deadlocks from lazy initialization. ## Changes ### NCCLX Core (v2_27 and v2_28) - Added `NCCL_WIN_FORCE_ORIG_PATH` (0x02) window flag to `nccl.h.in` - Modified `ncclCommWindowRegister()` to check for this flag and bypass CTRAN path when set ### TorchComms NcclxApi - Added `winFlags` parameter to `commWindowRegister()` with default `NCCL_WIN_DEFAULT` ### TorchCommWindowNCCLX - **`tensor_register()`**: Now initializes local communicator and NCCL orig window alongside CTRAN window - **`register_local_buffer()`**: Registers source buffers for device-side put operations (non-collective) - **`deregister_local_buffer()`**: Deregisters local buffers (non-collective) - **`get_device_window()`**: Placeholder for device window allocation (to be implemented in Diff 2B) - Added cleanup logic in destructor for new resources ### BUCK - Added `device-api-headers` library target for `TorchCommDeviceComm.hpp` - Added dependency on `device-api-headers` to ncclx libraries # Summary This diff adds the host-side infrastructure for TorchComms device API, enabling GPU-initiated networking (GIN) from CUDA kernels. ## Key Design Decisions ### 1. Dual Window Architecture `tensor_register()` now creates **BOTH**: - A **CTRAN window** for host API (`put`/`signal`/`wait_signal`) - An **NCCL orig window** for device API with GIN support This allows host and device APIs to coexist. ### 2. Explicit Window Path Control Added `NCCL_WIN_FORCE_ORIG_PATH` flag to NCCLX that bypasses the global `NCCL_RMA_ALGO` environment variable. This allows device API windows to always use the NCCL orig path (which has GIN support) regardless of the default RMA algorithm setting. ### 3. Non-Collective Local Buffer Registration Uses `ncclCommSplit` to create a 1-rank local communicator that shares `ginState` with the parent. This enables `register_local_buffer()` to be truly **non-collective** since all bootstrap barriers become no-ops when `nranks=1`. ### 4. Single Synchronization Point All collective operations (`initLocalComm()`, `initNcclOrigWindow()`) now happen in `tensor_register()`, which is already a well-defined synchronization point. This prevents potential deadlocks from lazy initialization. ## Changes ### NCCLX Core (v2_27 and v2_28) - Added `NCCL_WIN_FORCE_ORIG_PATH` (0x02) window flag to `nccl.h.in` - Modified `ncclCommWindowRegister()` to check for this flag and bypass CTRAN path when set ### TorchComms NcclxApi - Added `winFlags` parameter to `commWindowRegister()` with default `NCCL_WIN_DEFAULT` ### TorchCommWindowNCCLX - **`tensor_register()`**: Now initializes local communicator and NCCL orig window alongside CTRAN window - **`register_local_buffer()`**: Registers source buffers for device-side put operations (non-collective) - **`deregister_local_buffer()`**: Deregisters local buffers (non-collective) - **`get_device_window()`**: Placeholder for device window allocation (to be implemented in Diff 2B) - Added cleanup logic in destructor for new resources ### BUCK - Added `device-api-headers` library target for `TorchCommDeviceComm.hpp` - Added dependency on `device-api-headers` to ncclx libraries Differential Revision: D91499629 [torchcomms][ncclx] Add device API infrastructure with explicit window path control D91499629

Summary: Implements the get_device_window() function that creates device-side window structures for GPU-initiated networking. The implementation uses NCCL's native GIN (GPU-Initiated Networking) mechanisms for signals and counters rather than allocating separate arrays, avoiding memory duplication. The ncclDevComm created via ncclDevCommCreate already contains GIN signals/counters based on the ncclDevCommRequirements, so we leverage those directly. Also makes TorchCommDeviceComm.hpp backend-agnostic by replacing ncclDevComm-specific types with opaque void* pointers, enabling future NVSHMEM backend support. ## Key Changes - Added DeviceBackend abstraction interface for backend-specific device state management - Implemented NCCLGinDeviceBackend for NCCL GIN backend - Updated get_device_window() to use pass-by-value pattern (matching NCCL's ncclDevComm pattern) - Added CudaApi memcpy wrapper and mock - Added unit tests for TorchCommWindowNCCLX and NCCLGinDeviceBackend Differential Revision: D91536533

Summary: Fix "Double registration of singletons" error in NCCLX by converting folly::Singleton to Meyers singleton pattern. The library_object_internal static library gets linked into multiple shared objects (DSOs) simultaneously: libcomms_torchcomms_ncclx_ncclx-api.so (torchcomms NCCLX API) libcaffe2__libtorch_cuda.so (via torch-cpp → gloo_gpu_cuda → gloo/nccl → comms/ncclx:nccl) When both DSOs are loaded into the same process, each contains its own copy of the folly::Singleton registration code. During static initialization, folly detects the duplicate registration and aborts: Double registration of singletons of the same underlying type; check for multiple definitions of type folly::Singleton<ncclx::comms_monitor::CommsMonitor> This blocks the upcoming torchcomms Device API integration tests that exercise the full NCCLX stack with PyTorch/libtorch dependencies. What We Tried (and Why It Didn't Work) 1. Unifying the dependency chain via BUCK/BZL changes Modified get_ncclx_dependency() in nccl_deps.bzl to return nccl-internal instead of nccl Updated torchcomms BUCK files to use nccl-internal consistently Result: Still failed. The library_object_internal static library was still being compiled into multiple DSOs regardless of the logical dependency unification. This approach also required touching main build configuration files outside of ncclx. 2. Converting to folly::LeakySingleton Changed from folly::Singleton to folly::LeakySingleton Result: Same error (singletonWarnLeakyDoubleRegistrationAndAbort). LeakySingleton uses the same global registration mechanism; "leaky" only affects destruction behavior, not registration. 3. Making nccl-internal a proper shared library Modified nccl_build_config.bzl to build nccl-internal as a DSO with soname Result: Would have worked, but required extensive changes to build infrastructure and affects the entire NCCLX ecosystem. The Solution: Meyers Singleton Converted all folly::Singleton instances to use local static variables in getInstance(): // Before folly::Singleton<CommsMonitor, CommsMonitorSingletonTag> commsMonitorSingleton; std::shared_ptr<CommsMonitor> CommsMonitor::getInstance() { return commsMonitorSingleton.try_get(); } // After std::shared_ptr<CommsMonitor> CommsMonitor::getInstance() { static CommsMonitor instance; return std::shared_ptr<CommsMonitor>(&instance, [](CommsMonitor*) {}); } C++11 guarantees local static variables are initialized exactly once in a thread-safe manner. Unlike folly::Singleton, there's no global registration mechanism that can conflict across DSO boundaries. Files Changed v2_27/meta/comms-monitor/CommsMonitor.cc v2_28/meta/comms-monitor/CommsMonitor.cc v2_27/meta/hints/GlobalHints.cc v2_28/meta/hints/GlobalHints.cc v2_27/meta/algoconf/AlgoConfig.cc v2_28/meta/algoconf/AlgoConfig.cc v2_27/meta/analyzer/NCCLXCommsTracingServiceUtil.cc v2_28/meta/analyzer/NCCLXCommsTracingServiceUtil.cc Impact on External Codebases This change should be transparent to external consumers of NCCLX: The getInstance() API remains unchanged (still returns std::shared_ptr) Singleton behavior is preserved (single instance, thread-safe initialization) No changes to public headers or build targets The only behavioral difference is that singletons now live forever (no-op destructor) rather than being destroyed during folly::SingletonVault shutdown. This is acceptable because: These singletons were already effectively immortal (process-lifetime) NCCLX cleanup happens through explicit ncclCommDestroy calls, not singleton destruction Test Plan Verified DeviceApiTest_1x8_backend_ncclx_fast_init_default_alloc_ctran now passes singleton initialization (test proceeds past static init, fails only due to no GPUs on devserver as expected) Existing NCCLX unit tests continue to pass buck2 build for affected targets succeeds Differential Revision: D91628869

…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Reviewed By: siyengar Differential Revision: D91629468

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 28, 2026

goelayu force-pushed the export-D91629468 branch from b7b0902 to f8abc2b Compare January 29, 2026 00:29

goelayu force-pushed the export-D91629468 branch from f8abc2b to e16e4f7 Compare January 29, 2026 20:15

goelayu force-pushed the export-D91629468 branch from e16e4f7 to a8a0e0f Compare January 29, 2026 20:58

goelayu force-pushed the export-D91629468 branch from a8a0e0f to 08b920c Compare January 29, 2026 23:04

goelayu force-pushed the export-D91629468 branch from 08b920c to 6e190db Compare January 29, 2026 23:09

goelayu force-pushed the export-D91629468 branch from 6e190db to c448b6f Compare January 30, 2026 00:02

goelayu force-pushed the export-D91629468 branch from c448b6f to 6ddddde Compare January 30, 2026 23:40

goelayu force-pushed the export-D91629468 branch from 6ddddde to 84398af Compare January 31, 2026 00:10

Ayush Goel added 3 commits January 30, 2026 16:11

goelayu force-pushed the export-D91629468 branch from 84398af to b530b83 Compare January 31, 2026 01:24

goelayu force-pushed the export-D91629468 branch from b530b83 to 04fe608 Compare January 31, 2026 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NCCL GIN device-side implementation and integration tests #472

Add NCCL GIN device-side implementation and integration tests #472

Uh oh!

goelayu commented Jan 28, 2026

Uh oh!

meta-codesync bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add NCCL GIN device-side implementation and integration tests #472

Are you sure you want to change the base?

Add NCCL GIN device-side implementation and integration tests #472

Uh oh!

Conversation

goelayu commented Jan 28, 2026

Key Changes

Semantic Gaps

Next Main TODOs

Uh oh!

meta-codesync bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant