-
Notifications
You must be signed in to change notification settings - Fork 77
Add NCCL GIN device-side implementation and integration tests #472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
goelayu
wants to merge
4
commits into
meta-pytorch:main
Choose a base branch
from
goelayu:export-D91629468
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 29, 2026
…ytorch#472) Summary: Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468
b7b0902 to
f8abc2b
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 29, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468
f8abc2b to
e16e4f7
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 29, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468
e16e4f7 to
a8a0e0f
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 29, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468
a8a0e0f to
08b920c
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 29, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468
08b920c to
6e190db
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 30, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468
6e190db to
c448b6f
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 30, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Differential Revision: D91629468
c448b6f to
6ddddde
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 31, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Reviewed By: siyengar Differential Revision: D91629468
6ddddde to
84398af
Compare
Summary: # Summary This diff adds the host-side infrastructure for TorchComms device API, enabling GPU-initiated networking (GIN) from CUDA kernels. ## Key Design Decisions ### 1. Dual Window Architecture `tensor_register()` now creates **BOTH**: - A **CTRAN window** for host API (`put`/`signal`/`wait_signal`) - An **NCCL orig window** for device API with GIN support This allows host and device APIs to coexist. ### 2. Explicit Window Path Control Added `NCCL_WIN_FORCE_ORIG_PATH` flag to NCCLX that bypasses the global `NCCL_RMA_ALGO` environment variable. This allows device API windows to always use the NCCL orig path (which has GIN support) regardless of the default RMA algorithm setting. ### 3. Non-Collective Local Buffer Registration Uses `ncclCommSplit` to create a 1-rank local communicator that shares `ginState` with the parent. This enables `register_local_buffer()` to be truly **non-collective** since all bootstrap barriers become no-ops when `nranks=1`. ### 4. Single Synchronization Point All collective operations (`initLocalComm()`, `initNcclOrigWindow()`) now happen in `tensor_register()`, which is already a well-defined synchronization point. This prevents potential deadlocks from lazy initialization. ## Changes ### NCCLX Core (v2_27 and v2_28) - Added `NCCL_WIN_FORCE_ORIG_PATH` (0x02) window flag to `nccl.h.in` - Modified `ncclCommWindowRegister()` to check for this flag and bypass CTRAN path when set ### TorchComms NcclxApi - Added `winFlags` parameter to `commWindowRegister()` with default `NCCL_WIN_DEFAULT` ### TorchCommWindowNCCLX - **`tensor_register()`**: Now initializes local communicator and NCCL orig window alongside CTRAN window - **`register_local_buffer()`**: Registers source buffers for device-side put operations (non-collective) - **`deregister_local_buffer()`**: Deregisters local buffers (non-collective) - **`get_device_window()`**: Placeholder for device window allocation (to be implemented in Diff 2B) - Added cleanup logic in destructor for new resources ### BUCK - Added `device-api-headers` library target for `TorchCommDeviceComm.hpp` - Added dependency on `device-api-headers` to ncclx libraries # Summary This diff adds the host-side infrastructure for TorchComms device API, enabling GPU-initiated networking (GIN) from CUDA kernels. ## Key Design Decisions ### 1. Dual Window Architecture `tensor_register()` now creates **BOTH**: - A **CTRAN window** for host API (`put`/`signal`/`wait_signal`) - An **NCCL orig window** for device API with GIN support This allows host and device APIs to coexist. ### 2. Explicit Window Path Control Added `NCCL_WIN_FORCE_ORIG_PATH` flag to NCCLX that bypasses the global `NCCL_RMA_ALGO` environment variable. This allows device API windows to always use the NCCL orig path (which has GIN support) regardless of the default RMA algorithm setting. ### 3. Non-Collective Local Buffer Registration Uses `ncclCommSplit` to create a 1-rank local communicator that shares `ginState` with the parent. This enables `register_local_buffer()` to be truly **non-collective** since all bootstrap barriers become no-ops when `nranks=1`. ### 4. Single Synchronization Point All collective operations (`initLocalComm()`, `initNcclOrigWindow()`) now happen in `tensor_register()`, which is already a well-defined synchronization point. This prevents potential deadlocks from lazy initialization. ## Changes ### NCCLX Core (v2_27 and v2_28) - Added `NCCL_WIN_FORCE_ORIG_PATH` (0x02) window flag to `nccl.h.in` - Modified `ncclCommWindowRegister()` to check for this flag and bypass CTRAN path when set ### TorchComms NcclxApi - Added `winFlags` parameter to `commWindowRegister()` with default `NCCL_WIN_DEFAULT` ### TorchCommWindowNCCLX - **`tensor_register()`**: Now initializes local communicator and NCCL orig window alongside CTRAN window - **`register_local_buffer()`**: Registers source buffers for device-side put operations (non-collective) - **`deregister_local_buffer()`**: Deregisters local buffers (non-collective) - **`get_device_window()`**: Placeholder for device window allocation (to be implemented in Diff 2B) - Added cleanup logic in destructor for new resources ### BUCK - Added `device-api-headers` library target for `TorchCommDeviceComm.hpp` - Added dependency on `device-api-headers` to ncclx libraries Differential Revision: D91499629 [torchcomms][ncclx] Add device API infrastructure with explicit window path control D91499629
Summary: Implements the get_device_window() function that creates device-side window structures for GPU-initiated networking. The implementation uses NCCL's native GIN (GPU-Initiated Networking) mechanisms for signals and counters rather than allocating separate arrays, avoiding memory duplication. The ncclDevComm created via ncclDevCommCreate already contains GIN signals/counters based on the ncclDevCommRequirements, so we leverage those directly. Also makes TorchCommDeviceComm.hpp backend-agnostic by replacing ncclDevComm-specific types with opaque void* pointers, enabling future NVSHMEM backend support. ## Key Changes - Added DeviceBackend abstraction interface for backend-specific device state management - Implemented NCCLGinDeviceBackend for NCCL GIN backend - Updated get_device_window() to use pass-by-value pattern (matching NCCL's ncclDevComm pattern) - Added CudaApi memcpy wrapper and mock - Added unit tests for TorchCommWindowNCCLX and NCCLGinDeviceBackend Differential Revision: D91536533
Summary:
Fix "Double registration of singletons" error in NCCLX by converting folly::Singleton to Meyers singleton pattern.
The library_object_internal static library gets linked into multiple shared objects (DSOs) simultaneously:
libcomms_torchcomms_ncclx_ncclx-api.so (torchcomms NCCLX API)
libcaffe2__libtorch_cuda.so (via torch-cpp → gloo_gpu_cuda → gloo/nccl → comms/ncclx:nccl)
When both DSOs are loaded into the same process, each contains its own copy of the folly::Singleton registration code. During static initialization, folly detects the duplicate registration and aborts:
Double registration of singletons of the same underlying type;
check for multiple definitions of type folly::Singleton<ncclx::comms_monitor::CommsMonitor>
This blocks the upcoming torchcomms Device API integration tests that exercise the full NCCLX stack with PyTorch/libtorch dependencies.
What We Tried (and Why It Didn't Work)
1. Unifying the dependency chain via BUCK/BZL changes
Modified get_ncclx_dependency() in nccl_deps.bzl to return nccl-internal instead of nccl
Updated torchcomms BUCK files to use nccl-internal consistently
Result: Still failed. The library_object_internal static library was still being compiled into multiple DSOs regardless of the logical dependency unification. This approach also required touching main build configuration files outside of ncclx.
2. Converting to folly::LeakySingleton
Changed from folly::Singleton to folly::LeakySingleton
Result: Same error (singletonWarnLeakyDoubleRegistrationAndAbort). LeakySingleton uses the same global registration mechanism; "leaky" only affects destruction behavior, not registration.
3. Making nccl-internal a proper shared library
Modified nccl_build_config.bzl to build nccl-internal as a DSO with soname
Result: Would have worked, but required extensive changes to build infrastructure and affects the entire NCCLX ecosystem.
The Solution: Meyers Singleton
Converted all folly::Singleton instances to use local static variables in getInstance():
// Before
folly::Singleton<CommsMonitor, CommsMonitorSingletonTag> commsMonitorSingleton;
std::shared_ptr<CommsMonitor> CommsMonitor::getInstance() {
return commsMonitorSingleton.try_get();
}
// After
std::shared_ptr<CommsMonitor> CommsMonitor::getInstance() {
static CommsMonitor instance;
return std::shared_ptr<CommsMonitor>(&instance, [](CommsMonitor*) {});
}
C++11 guarantees local static variables are initialized exactly once in a thread-safe manner. Unlike folly::Singleton, there's no global registration mechanism that can conflict across DSO boundaries.
Files Changed
v2_27/meta/comms-monitor/CommsMonitor.cc
v2_28/meta/comms-monitor/CommsMonitor.cc
v2_27/meta/hints/GlobalHints.cc
v2_28/meta/hints/GlobalHints.cc
v2_27/meta/algoconf/AlgoConfig.cc
v2_28/meta/algoconf/AlgoConfig.cc
v2_27/meta/analyzer/NCCLXCommsTracingServiceUtil.cc
v2_28/meta/analyzer/NCCLXCommsTracingServiceUtil.cc
Impact on External Codebases
This change should be transparent to external consumers of NCCLX:
The getInstance() API remains unchanged (still returns std::shared_ptr)
Singleton behavior is preserved (single instance, thread-safe initialization)
No changes to public headers or build targets
The only behavioral difference is that singletons now live forever (no-op destructor) rather than being destroyed during folly::SingletonVault shutdown. This is acceptable because:
These singletons were already effectively immortal (process-lifetime)
NCCLX cleanup happens through explicit ncclCommDestroy calls, not singleton destruction
Test Plan
Verified DeviceApiTest_1x8_backend_ncclx_fast_init_default_alloc_ctran now passes singleton initialization (test proceeds past static init, fails only due to no GPUs on devserver as expected)
Existing NCCLX unit tests continue to pass
buck2 build for affected targets succeeds
Differential Revision: D91628869
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Jan 31, 2026
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Reviewed By: siyengar Differential Revision: D91629468
84398af to
b530b83
Compare
…ytorch#472) Summary: Pull Request resolved: meta-pytorch#472 Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN. The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating. ## Key Changes - **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives - **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation - **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag ## Semantic Gaps - **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1) - **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support - **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation - **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation ## Next Main TODOs 1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA 2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope) 3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host 4. **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access. Reviewed By: siyengar Differential Revision: D91629468
b530b83 to
04fe608
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.
The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.
Key Changes
TorchCommDeviceNCCLX.cuhimplementing all Device API methods using NCCL GIN primitivesDeviceApiTestsuite validating window creation, buffer registration, and signal/counter allocationRUN_DEVICE_API_TESTflagSemantic Gaps
GE(>=) comparison supported; other operators return error (-1)ADDoperation supported;SETdeferred until NCCL adds supportncclGinBarrierHandleallocationNext Main TODOs
ncclGinBarrierCreateRequirementat hostDifferential Revision: D91629468