Skip to content

Conversation

@goelayu
Copy link

@goelayu goelayu commented Jan 28, 2026

Summary:
Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

Key Changes

  • Device Implementation: Header-only TorchCommDeviceNCCLX.cuh implementing all Device API methods using NCCL GIN primitives
  • Integration Tests: DeviceApiTest suite validating window creation, buffer registration, and signal/counter allocation
  • CI Gating: Environment-based test gating via RUN_DEVICE_API_TEST flag

Semantic Gaps

  • CmpOp Support: Only GE (>=) comparison supported; other operators return error (-1)
  • SignalOp Support: Only ADD operation supported; SET deferred until NCCL adds support
  • Fence: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
  • Barrier: No-op placeholder; world-scope barrier requires host-side ncclGinBarrierHandle allocation

Next Main TODOs

  1. LSA Support: Add P2P/NVLink direct access transport alongside RDMA
  2. Scope Flexibility: Expose cooperation scope API (currently hardcoded to thread scope)
  3. Barrier Implementation: Allocate world-scope barrier handle via ncclGinBarrierCreateRequirement at host

Differential Revision: D91629468

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 28, 2026

@goelayu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91629468.

goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 29, 2026
…ytorch#472)

Summary:

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Differential Revision: D91629468
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 29, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Differential Revision: D91629468
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 29, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Differential Revision: D91629468
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 29, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Differential Revision: D91629468
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 29, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Differential Revision: D91629468
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 30, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Differential Revision: D91629468
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 30, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Differential Revision: D91629468
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 31, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Reviewed By: siyengar

Differential Revision: D91629468
Ayush Goel added 3 commits January 30, 2026 16:11
Summary:
# Summary

This diff adds the host-side infrastructure for TorchComms device API, enabling GPU-initiated networking (GIN) from CUDA kernels.

## Key Design Decisions

### 1. Dual Window Architecture
`tensor_register()` now creates **BOTH**:
- A **CTRAN window** for host API (`put`/`signal`/`wait_signal`)
- An **NCCL orig window** for device API with GIN support

This allows host and device APIs to coexist.

### 2. Explicit Window Path Control
Added `NCCL_WIN_FORCE_ORIG_PATH` flag to NCCLX that bypasses the global `NCCL_RMA_ALGO` environment variable. This allows device API windows to always use the NCCL orig path (which has GIN support) regardless of the default RMA algorithm setting.

### 3. Non-Collective Local Buffer Registration
Uses `ncclCommSplit` to create a 1-rank local communicator that shares `ginState` with the parent. This enables `register_local_buffer()` to be truly **non-collective** since all bootstrap barriers become no-ops when `nranks=1`.

### 4. Single Synchronization Point
All collective operations (`initLocalComm()`, `initNcclOrigWindow()`) now happen in `tensor_register()`, which is already a well-defined synchronization point. This prevents potential deadlocks from lazy initialization.

## Changes

### NCCLX Core (v2_27 and v2_28)
- Added `NCCL_WIN_FORCE_ORIG_PATH` (0x02) window flag to `nccl.h.in`
- Modified `ncclCommWindowRegister()` to check for this flag and bypass CTRAN path when set

### TorchComms NcclxApi
- Added `winFlags` parameter to `commWindowRegister()` with default `NCCL_WIN_DEFAULT`

### TorchCommWindowNCCLX
- **`tensor_register()`**: Now initializes local communicator and NCCL orig window alongside CTRAN window
- **`register_local_buffer()`**: Registers source buffers for device-side put operations (non-collective)
- **`deregister_local_buffer()`**: Deregisters local buffers (non-collective)
- **`get_device_window()`**: Placeholder for device window allocation (to be implemented in Diff 2B)
- Added cleanup logic in destructor for new resources

### BUCK
- Added `device-api-headers` library target for `TorchCommDeviceComm.hpp`
- Added dependency on `device-api-headers` to ncclx libraries

# Summary

This diff adds the host-side infrastructure for TorchComms device API, enabling GPU-initiated networking (GIN) from CUDA kernels.

## Key Design Decisions

### 1. Dual Window Architecture
`tensor_register()` now creates **BOTH**:
- A **CTRAN window** for host API (`put`/`signal`/`wait_signal`)
- An **NCCL orig window** for device API with GIN support

This allows host and device APIs to coexist.

### 2. Explicit Window Path Control
Added `NCCL_WIN_FORCE_ORIG_PATH` flag to NCCLX that bypasses the global `NCCL_RMA_ALGO` environment variable. This allows device API windows to always use the NCCL orig path (which has GIN support) regardless of the default RMA algorithm setting.

### 3. Non-Collective Local Buffer Registration
Uses `ncclCommSplit` to create a 1-rank local communicator that shares `ginState` with the parent. This enables `register_local_buffer()` to be truly **non-collective** since all bootstrap barriers become no-ops when `nranks=1`.

### 4. Single Synchronization Point
All collective operations (`initLocalComm()`, `initNcclOrigWindow()`) now happen in `tensor_register()`, which is already a well-defined synchronization point. This prevents potential deadlocks from lazy initialization.

## Changes

### NCCLX Core (v2_27 and v2_28)
- Added `NCCL_WIN_FORCE_ORIG_PATH` (0x02) window flag to `nccl.h.in`
- Modified `ncclCommWindowRegister()` to check for this flag and bypass CTRAN path when set

### TorchComms NcclxApi
- Added `winFlags` parameter to `commWindowRegister()` with default `NCCL_WIN_DEFAULT`

### TorchCommWindowNCCLX
- **`tensor_register()`**: Now initializes local communicator and NCCL orig window alongside CTRAN window
- **`register_local_buffer()`**: Registers source buffers for device-side put operations (non-collective)
- **`deregister_local_buffer()`**: Deregisters local buffers (non-collective)
- **`get_device_window()`**: Placeholder for device window allocation (to be implemented in Diff 2B)
- Added cleanup logic in destructor for new resources

### BUCK
- Added `device-api-headers` library target for `TorchCommDeviceComm.hpp`
- Added dependency on `device-api-headers` to ncclx libraries

Differential Revision:
D91499629

[torchcomms][ncclx] Add device API infrastructure with explicit window path control

D91499629
Summary:
Implements the get_device_window() function that creates device-side window structures for GPU-initiated networking.

The implementation uses NCCL's native GIN (GPU-Initiated Networking) mechanisms for signals and counters rather than allocating separate arrays, avoiding memory duplication. The ncclDevComm created via ncclDevCommCreate already contains GIN signals/counters based on the ncclDevCommRequirements, so we leverage those directly.

Also makes TorchCommDeviceComm.hpp backend-agnostic by replacing ncclDevComm-specific types with opaque void* pointers, enabling future NVSHMEM backend support.

## Key Changes

- Added DeviceBackend abstraction interface for backend-specific device state management
- Implemented NCCLGinDeviceBackend for NCCL GIN backend
- Updated get_device_window() to use pass-by-value pattern (matching NCCL's ncclDevComm pattern)
- Added CudaApi memcpy wrapper and mock
- Added unit tests for TorchCommWindowNCCLX and NCCLGinDeviceBackend

Differential Revision: D91536533
Summary:
Fix "Double registration of singletons" error in NCCLX by converting folly::Singleton to Meyers singleton pattern.

The library_object_internal static library gets linked into multiple shared objects (DSOs) simultaneously:

libcomms_torchcomms_ncclx_ncclx-api.so (torchcomms NCCLX API)
libcaffe2__libtorch_cuda.so (via torch-cpp → gloo_gpu_cuda → gloo/nccl → comms/ncclx:nccl)
When both DSOs are loaded into the same process, each contains its own copy of the folly::Singleton registration code. During static initialization, folly detects the duplicate registration and aborts:

Double registration of singletons of the same underlying type;
check for multiple definitions of type folly::Singleton<ncclx::comms_monitor::CommsMonitor>

This blocks the upcoming torchcomms Device API integration tests that exercise the full NCCLX stack with PyTorch/libtorch dependencies.

What We Tried (and Why It Didn't Work)
1. Unifying the dependency chain via BUCK/BZL changes

Modified get_ncclx_dependency() in nccl_deps.bzl to return nccl-internal instead of nccl
Updated torchcomms BUCK files to use nccl-internal consistently
Result: Still failed. The library_object_internal static library was still being compiled into multiple DSOs regardless of the logical dependency unification. This approach also required touching main build configuration files outside of ncclx.
2. Converting to folly::LeakySingleton

Changed from folly::Singleton to folly::LeakySingleton
Result: Same error (singletonWarnLeakyDoubleRegistrationAndAbort). LeakySingleton uses the same global registration mechanism; "leaky" only affects destruction behavior, not registration.
3. Making nccl-internal a proper shared library

Modified nccl_build_config.bzl to build nccl-internal as a DSO with soname
Result: Would have worked, but required extensive changes to build infrastructure and affects the entire NCCLX ecosystem.
The Solution: Meyers Singleton
Converted all folly::Singleton instances to use local static variables in getInstance():

// Before
folly::Singleton<CommsMonitor, CommsMonitorSingletonTag> commsMonitorSingleton;
std::shared_ptr<CommsMonitor> CommsMonitor::getInstance() {
  return commsMonitorSingleton.try_get();
}

// After
std::shared_ptr<CommsMonitor> CommsMonitor::getInstance() {
  static CommsMonitor instance;
  return std::shared_ptr<CommsMonitor>(&instance, [](CommsMonitor*) {});
}

C++11 guarantees local static variables are initialized exactly once in a thread-safe manner. Unlike folly::Singleton, there's no global registration mechanism that can conflict across DSO boundaries.

Files Changed
v2_27/meta/comms-monitor/CommsMonitor.cc
v2_28/meta/comms-monitor/CommsMonitor.cc
v2_27/meta/hints/GlobalHints.cc
v2_28/meta/hints/GlobalHints.cc
v2_27/meta/algoconf/AlgoConfig.cc
v2_28/meta/algoconf/AlgoConfig.cc
v2_27/meta/analyzer/NCCLXCommsTracingServiceUtil.cc
v2_28/meta/analyzer/NCCLXCommsTracingServiceUtil.cc
Impact on External Codebases
This change should be transparent to external consumers of NCCLX:

The getInstance() API remains unchanged (still returns std::shared_ptr)
Singleton behavior is preserved (single instance, thread-safe initialization)
No changes to public headers or build targets
The only behavioral difference is that singletons now live forever (no-op destructor) rather than being destroyed during folly::SingletonVault shutdown. This is acceptable because:

These singletons were already effectively immortal (process-lifetime)
NCCLX cleanup happens through explicit ncclCommDestroy calls, not singleton destruction
Test Plan
Verified DeviceApiTest_1x8_backend_ncclx_fast_init_default_alloc_ctran now passes singleton initialization (test proceeds past static init, fails only due to no GPUs on devserver as expected)
Existing NCCLX unit tests continue to pass
buck2 build for affected targets succeeds

Differential Revision: D91628869
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Jan 31, 2026
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Reviewed By: siyengar

Differential Revision: D91629468
…ytorch#472)

Summary:
Pull Request resolved: meta-pytorch#472

Completes the device-side implementation of TorchComms Device API, enabling CUDA kernels to perform GPU-initiated networking via NCCL GIN.

The implementation uses header-only design to work with NCCL GIN's templated APIs, provides simplified semantics (GE-only comparisons, ADD-only signals) to match current NCCL GIN capabilities, and includes integration tests with CI gating.

## Key Changes

- **Device Implementation**: Header-only `TorchCommDeviceNCCLX.cuh` implementing all Device API methods using NCCL GIN primitives
- **Integration Tests**: `DeviceApiTest` suite validating window creation, buffer registration, and signal/counter allocation
- **CI Gating**: Environment-based test gating via `RUN_DEVICE_API_TEST` flag

## Semantic Gaps

- **CmpOp Support**: Only `GE` (>=) comparison supported; other operators return error (-1)
- **SignalOp Support**: Only `ADD` operation supported; `SET` deferred until NCCL adds support
- **Fence**: No-op for GIN backend (ordering guaranteed by NCCL); LSA support will need implementation
- **Barrier**: No-op placeholder; world-scope barrier requires host-side `ncclGinBarrierHandle` allocation

## Next Main TODOs

1. **LSA Support**: Add P2P/NVLink direct access transport alongside RDMA
2. **Scope Flexibility**: Expose cooperation scope API (currently hardcoded to thread scope)
3. **Barrier Implementation**: Allocate world-scope barrier handle via `ncclGinBarrierCreateRequirement` at host
4.  **Fix GIN source buffer registration**: `register_local_buffer()` fails for device-side put because split-comm windows have separate window tables from parent ncclDevComm. Current workaround uses collective window registration. Need to either implement proper non-collective GIN registration, use LSA path, or work with NCCL team on cross-comm window access.

Reviewed By: siyengar

Differential Revision: D91629468
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant