feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook by zhenhuang12 · Pull Request #540 · AMD-AGI/Primus

zhenhuang12 · 2026-02-05T09:10:26Z

No description provided.

Copilot

Pull request overview

This PR adds support for UCCL-EP (Unified Collective Communication Library - Expert Parallelism) to the project, enabling an alternative backend for MoE dispatch/combine operations. The changes include Docker build integration, runtime configuration through environment variables, and conditional logic to select between TURBO and DEEP_EP backends.

Changes:

Adds UCCL-EP installation to Docker build process with configurable commit hash
Implements USING_UEP flag to enable UCCL-EP functionality in runner hooks and pretrain scripts
Updates validation logic to conditionally apply turbo_deepep_num_cu constraint based on backend selection

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
.github/workflows/ci.yaml	Adds UCCL_COMMIT environment variable and passes it to Docker builds
.github/workflows/docker/Dockerfile	Adds UCCL-EP installation steps during Docker image build
runner/helpers/hooks/05_using_uep.sh	New hook script to validate and configure UCCL-EP environment
examples/run_pretrain.sh	Adds UCCL rebuild option and UEP configuration logic
examples/run_local_pretrain.sh	Passes REBUILD_UCCL and USING_UEP environment variables to Docker container
examples/moe_package/run_deepseek_v2_pretrain_mi355x.sh	Enables UCCL-EP feature (option 8) in MoE features array
examples/moe_package/run_deepseek_v2_lite_pretrain_mi355x.sh	Enables UCCL-EP feature (option 8) in MoE features array
primus/modules/trainer/megatron/utils.py	Updates validation to only enforce CU constraint when using TURBO backend

.github/workflows/docker/Dockerfile

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

runner/helpers/hooks/05_using_uep.sh

Copilot · 2026-02-05T09:13:42Z

examples/run_pretrain.sh

+
+    git clone https://github.com/uccl-project/uccl.git
+    cd uccl || exit
+    cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd ..


Hardcoded GPU architectures 'gfx942;gfx950' should be configurable via an environment variable to support different hardware configurations.

Suggested change

cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd ..

cd ep && PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH:-gfx942;gfx950}" python3 setup.py build && cd ..

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Copilot · 2026-02-05T09:14:35Z

examples/run_pretrain.sh

+    LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"
+
+    if [ "$ENABLE_NUMA_BINDING" != "1" ]; then
+        LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."


This duplicates the warning logic from the hook file (runner/helpers/hooks/05_using_uep.sh line 27). Consider using LOG_WARN for consistency with the hook implementation, as both locations serve the same warning purpose.

Suggested change

LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."

LOG_WARN "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot · 2026-02-06T02:39:53Z

primus/modules/trainer/megatron/utils.py

+            args.expert_model_parallel_size >= 16
+            and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO"
+        ):
+            # Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine.


Grammar error in comment: 'is not supported CUs' should be 'does not support CUs' or 'is not supported for CUs'.

Suggested change

# Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine.

# Turbo DeepEP is not supported for CUs > 32 when using internode dispatch/combine.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Copilot · 2026-02-07T11:29:03Z

runner/helpers/hooks/04_rebuild_uccl.sh

 UCCL_BUILD_DIR="${UCCL_BUILD_DIR:-/tmp/uccl_${HOSTNAME:-$(hostname)}}"
 UCCL_REF="${UCCL_REF:-}"
-GPU_ARCHS="${GPU_ARCHS:-gfx942;gfx950}"
+GPU_ARCHS="${GPU_ARCHS:-gfx942,gfx950}"


This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.

Copilot · 2026-02-07T11:29:03Z

runner/helpers/hooks/04_rebuild_uccl.sh


 LOG_INFO_RANK0 "[hook system] Building uccl ep"
-cd ep && PYTORCH_ROCM_ARCH="${GPU_ARCHS}" python3 setup.py build && cd ..
+cd ep && TORCH_CUDA_ARCH_LIST="${GPU_ARCHS}" python3 setup.py build && cd ..


This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.

Copilot · 2026-02-07T11:29:03Z

primus/modules/trainer/megatron/utils.py

-        if args.expert_model_parallel_size >= 16:
+        if (
+            args.expert_model_parallel_size >= 16
+            and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO"


If PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND is unset, this condition becomes false and the turbo_deepep_num_cu <= 32 constraint is skipped even though the effective default backend elsewhere appears to be TURBO. To avoid silently bypassing validation, default the getenv to \"TURBO\" (or otherwise ensure the env var is always set before validation runs).

Suggested change

and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO"

and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "TURBO") == "TURBO"

Copilot · 2026-02-07T11:29:04Z

examples/run_pretrain.sh

+    cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd ..
+    cp ep/build/**/*.so uccl


cp ep/build/**/*.so uccl depends on globstar (**) expansion, which is typically disabled by default and can lead to no .so files being copied (and subsequent install/runtime failures). Use a glob pattern that doesn't require globstar or switch to a find-based copy. Also consider aligning the arch env var/list format with whatever the ROCm build expects (see UCCL rebuild hook).

Suggested change

cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd ..

cp ep/build/**/*.so uccl

cd ep && TORCH_CUDA_ARCH_LIST="gfx942;gfx950" python3 setup.py build && cd ..

find ep/build -name '*.so' -exec cp -t uccl {} +

Copilot · 2026-02-07T11:29:04Z

examples/run_pretrain.sh

+    fi
+
+    git clone https://github.com/uccl-project/uccl.git
+    cd uccl || exit


The rebuild path clones UCCL from main without pinning to a commit/tag, which makes runs non-reproducible and can break unexpectedly over time. Consider supporting a UCCL_COMMIT/UCCL_REF env var (similar to CI) and checking out that ref when provided.

Suggested change

cd uccl || exit

cd uccl || exit

# Optionally pin UCCL to a specific ref/commit for reproducible rebuilds.

# If UCCL_REF is not set, UCCL_COMMIT can be used as a fallback.

UCCL_CHECKOUT_REF="${UCCL_REF:-$UCCL_COMMIT}"

if [ -n "${UCCL_CHECKOUT_REF}" ]; then

LOG_INFO "Checking out UCCL ref: ${UCCL_CHECKOUT_REF}"

git checkout "${UCCL_CHECKOUT_REF}" || exit

fi

Copilot · 2026-02-07T11:29:04Z

runner/helpers/hooks/05_using_uep.sh

+    if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then
+        LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
+        exit 1
+    fi
+    LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"
+    LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"


Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.

Suggested change

if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then

LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."

exit 1

fi

LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"

LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"

if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then

LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."

exit 1

fi

LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)"

LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)"

Copilot · 2026-02-07T11:29:04Z

runner/helpers/hooks/05_using_uep.sh

+    if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then
+        LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
+        exit 1
+    fi
+    LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"
+    LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"


Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.

Suggested change

if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then

LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."

exit 1

fi

LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"

LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"

if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then

LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."

exit 1

fi

LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)"

LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)"

primus/backends/megatron/megatron_adapter.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 57 out of 57 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

docs/backends/extending-backends.md:1

This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.

# Backend Extension Guide

Copilot · 2026-02-09T09:30:19Z

primus/backends/megatron_bridge/megatron_bridge_base_trainer.py

        run_patches(
            backend="megatron",
            phase="before_train",
            backend_version=type(self).detect_megatron_version(),
-            model_name=self.model_name,
            extra={
+                "module_config": module_config,
                "backend_args": self.backend_args,
-                "primus_config": self.primus_config,
-                "module_config": self.module_config,
            },
        )


This applies before_train patches inside the trainer constructor, but the new architecture already applies before_train in PrimusRuntime (and tests assert the phase order). This will cause duplicate patch application for Megatron-Bridge and breaks the runtime-owned phase placement contract; remove patch execution from __init__ and rely on PrimusRuntime._run_phase_patches() (or move any truly-required setup into setup() and keep it patch-free).

Copilot · 2026-02-09T09:30:20Z

primus/core/backend/backend_adapter.py

+                        # Logger may not be initialized yet; sys.path is already updated.
+                        pass
+                return norm_path
+            assert False, error_msg


Using assert False for runtime path validation is fragile (asserts can be optimized out with -O) and raises AssertionError instead of a more appropriate exception. Prefer raising FileNotFoundError (or RuntimeError with a clear message) when the path doesn’t exist.

Suggested change

assert False, error_msg

raise FileNotFoundError(error_msg)

Copilot · 2026-02-09T09:30:20Z

primus/core/utils/yaml_utils.py

 def merge_namespace(dst: SimpleNamespace, src: SimpleNamespace, allow_override=False, excepts: list = None):
    src_dict = vars(src)
    dst_dict = vars(dst)
+    excepts = excepts or []
    for key, value in src_dict.items():
        if key in excepts:
            continue
        if key in dst_dict and not allow_override:
-            raise ValueError(f"Key '{key}' from {src.name} already exists in {dst.name}.")
-        else:
-            setattr(dst, key, value)
+            continue  # Skip duplicate keys, keep dst value
+        setattr(dst, key, value)


This change silently skips conflicting keys by default, whereas the previous behavior raised a ValueError. That’s a behavioral breaking change that can hide misconfigurations (e.g., typos or unintended duplicate keys). Consider restoring the previous default (raise on conflict) and adding an explicit mode/flag for 'skip duplicates', or at least logging conflicts when skipping.

Copilot · 2026-02-09T09:30:20Z

primus/core/runtime/train_runtime.py

+    def _get_backend_version(self) -> Optional[str]:
+        assert self.ctx is not None, "TrainContext must be initialized before detecting backend version."
+        if self.ctx.backend_version is not None:
+            return self.ctx.backend_version
+        if self.ctx.adapter is None:
+            return None
+        try:
+            self.ctx.backend_version = self.ctx.adapter.detect_backend_version()
+        except Exception:
+            self.ctx.backend_version = None
+        return self.ctx.backend_version


Swallowing all exceptions here makes version-dependent patch selection silently degrade to backend_version=None, which can lead to incorrect patch applicability without any signal. Recommend logging a warning with the exception details (at least rank 0), and/or providing a strict mode where version detection failures propagate.

Copilot · 2026-02-09T09:30:21Z

primus/core/patches/context.py

 #   These should stay in sync with the phases used by:
-#     - BackendAdapter._apply_setup_patches() / _apply_build_args_patches()
+#     - PrimusRuntime (setup / build_args / before_train / after_train)
 #     - BaseTrainer.run() (before_train / after_train)


The comment references BaseTrainer.run(), but BaseTrainer no longer owns a run() template method in this PR (lifecycle is runtime-owned and trainers implement train()). Update/remove this reference to avoid misleading phase ownership documentation.

Suggested change

# - BaseTrainer.run() (before_train / after_train)

# - Trainer lifecycle (before_train / after_train)

Copilot · 2026-02-09T09:30:21Z

docs/backends/extending-backends.md

+- `BackendAdapter` – integrates a backend framework
+- `BackendRegistry` – discovers and instantiates adapters
+- `BaseTrainer` – defines the minimal training lifecycle that all backends follow
+- `PrimusRuntime` – orchestrates config loading, env setup, patches, adapter & trainer


This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.

Copilot · 2026-02-09T09:30:21Z

docs/backends/extending-backends.md

+from typing import Any, Dict
+
+from primus.core.backend.backend_adapter import BackendAdapter
+from primus.core.backend.backend_registry import BackendRegistry


This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.

Suggested change

from primus.core.backend.backend_registry import BackendRegistry

Copilot · 2026-02-09T09:30:22Z

primus/core/trainer/base_trainer.py

    """

-    def __init__(self, primus_config: Any, module_config: Any, backend_args: Any = None):
+    def __init__(self, backend_args: Any):


Tests and runtime behavior indicate backend_args can be None (e.g., tests explicitly construct DummyTrainer(backend_args=None)). Consider making the constructor accept an optional value (backend_args: Any | None = None) to align the public trainer interface with actual supported usage and avoid inconsistent typing/expectations.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Copilot · 2026-02-09T09:39:30Z

runner/helpers/hooks/04_rebuild_uccl.sh

@@ -47,7 +45,7 @@ if [[ -n "$UCCL_REF" ]]; then
 fi

 LOG_INFO_RANK0 "[hook system] Building uccl ep"


Removing PYTORCH_ROCM_ARCH/GPU arch injection can break building ROCm extensions in environments without a visible GPU (common in container builds/CI), or produce binaries targeting the wrong arch. Consider restoring a configurable arch env (e.g., PYTORCH_ROCM_ARCH="${GPU_ARCHS}") with a sane default, or otherwise ensuring the build is deterministic across nodes.

Suggested change

LOG_INFO_RANK0 "[hook system] Building uccl ep"

LOG_INFO_RANK0 "[hook system] Building uccl ep"

# Ensure deterministic ROCm arch selection for extension build

if [[ -z "${PYTORCH_ROCM_ARCH:-}" ]]; then

if [[ -n "${GPU_ARCHS:-}" ]]; then

export PYTORCH_ROCM_ARCH="${GPU_ARCHS}"

else

# Fallback to a sane default ROCm arch if none is provided

export PYTORCH_ROCM_ARCH="gfx90a"

fi

fi

LOG_INFO_RANK0 "[hook system] Using PYTORCH_ROCM_ARCH='${PYTORCH_ROCM_ARCH}' for uccl ep build"

primus/modules/trainer/megatron/utils.py

tests/trainer/test_megatron_trainer.py

Copilot · 2026-02-09T09:39:31Z

primus/modules/trainer/megatron/utils.py

+        if (
+            args.expert_model_parallel_size >= 16
+            and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "DEEP_EP") == "TURBO"
+        ):


This backend-dependent validation is easy to regress because it depends on an environment variable and specific expert_model_parallel_size thresholds. Add a unit/integration test that exercises validate_args_on_rocm for both PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=TURBO and =DEEP_EP (and unset), verifying the assertion behavior around turbo_deepep_num_cu when expert_model_parallel_size >= 16.

zhenhuang12 requested a review from Xiaoming-AMD as a code owner February 5, 2026 09:10

Copilot AI review requested due to automatic review settings February 5, 2026 09:10

zhenhuang12 requested review from limou102 and wenxie-amd as code owners February 5, 2026 09:10

Copilot AI reviewed Feb 5, 2026

View reviewed changes

.github/workflows/docker/Dockerfile Outdated Show resolved Hide resolved

.github/workflows/docker/Dockerfile Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings February 5, 2026 09:12

AMD-AGI deleted a comment from Copilot AI Feb 5, 2026

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 5, 2026 09:13

Copilot AI reviewed Feb 5, 2026

View reviewed changes

wenxie-amd force-pushed the dev/zhuang12/uep branch from 0156bcc to 9f03657 Compare February 6, 2026 02:38

Copilot AI review requested due to automatic review settings February 6, 2026 02:39

Copilot AI reviewed Feb 6, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 6, 2026 02:48

Copilot AI reviewed Feb 6, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 7, 2026 08:19

Copilot AI reviewed Feb 7, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 7, 2026 11:27

Copilot AI reviewed Feb 7, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 9, 2026 09:27

github-code-quality bot found potential problems Feb 9, 2026

View reviewed changes

primus/backends/megatron/megatron_adapter.py Fixed Show fixed Hide fixed

zhenhuang12 and others added 9 commits February 9, 2026 09:30

feat: add uccl-ep

b672e0e

update

b87fa42

fix

a271c26

Update runner/helpers/hooks/05_using_uep.sh

3cf4ed8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

62fcbd9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix build uccl-ep

1d2b38d

fix

d9263ad

skip build uccl for jax

4811b80

add test case

95d3ce4

zhenhuang12 added 2 commits February 9, 2026 09:30

fix

07de3a4

add uccl network settings

9622cba

Copilot AI reviewed Feb 9, 2026

View reviewed changes

wenxie-amd force-pushed the dev/zhuang12/uep branch from f7810e0 to 9622cba Compare February 9, 2026 09:30

AMD-AGI deleted a comment from Copilot AI Feb 9, 2026

modify as sugguestion

871adf9

Copilot AI review requested due to automatic review settings February 9, 2026 09:38

	cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd ..
	cd ep && PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH:-gfx942;gfx950}" python3 setup.py build && cd ..

	LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."
	LOG_WARN "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."

	# Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine.
	# Turbo DeepEP is not supported for CUs > 32 when using internode dispatch/combine.

	and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO"
	and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "TURBO") == "TURBO"

		cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd ..
		cp ep/build/*/.so uccl

-    cd uccl || exit
+    cd uccl || exit
+    # Optionally pin UCCL to a specific ref/commit for reproducible rebuilds.
+    # If UCCL_REF is not set, UCCL_COMMIT can be used as a fallback.
+    UCCL_CHECKOUT_REF="${UCCL_REF:-$UCCL_COMMIT}"
+    if [ -n "${UCCL_CHECKOUT_REF}" ]; then
+        LOG_INFO "Checking out UCCL ref: ${UCCL_CHECKOUT_REF}"
+        git checkout "${UCCL_CHECKOUT_REF}" || exit
+    fi

	# - BaseTrainer.run() (before_train / after_train)
	# - Trainer lifecycle (before_train / after_train)

		@@ -47,7 +45,7 @@ if [[ -n "$UCCL_REF" ]]; then
		fi

		LOG_INFO_RANK0 "[hook system] Building uccl ep"

-LOG_INFO_RANK0 "[hook system] Building uccl ep"
+LOG_INFO_RANK0 "[hook system] Building uccl ep"
+# Ensure deterministic ROCm arch selection for extension build
+if [[ -z "${PYTORCH_ROCM_ARCH:-}" ]]; then
+    if [[ -n "${GPU_ARCHS:-}" ]]; then
+        export PYTORCH_ROCM_ARCH="${GPU_ARCHS}"
+    else
+        # Fallback to a sane default ROCm arch if none is provided
+        export PYTORCH_ROCM_ARCH="gfx90a"
+    fi
+fi
+LOG_INFO_RANK0 "[hook system] Using PYTORCH_ROCM_ARCH='${PYTORCH_ROCM_ARCH}' for uccl ep build"

Conversation

zhenhuang12 commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026