feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook#540
feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook#540zhenhuang12 wants to merge 12 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for UCCL-EP (Unified Collective Communication Library - Expert Parallelism) to the project, enabling an alternative backend for MoE dispatch/combine operations. The changes include Docker build integration, runtime configuration through environment variables, and conditional logic to select between TURBO and DEEP_EP backends.
Changes:
- Adds UCCL-EP installation to Docker build process with configurable commit hash
- Implements USING_UEP flag to enable UCCL-EP functionality in runner hooks and pretrain scripts
- Updates validation logic to conditionally apply turbo_deepep_num_cu constraint based on backend selection
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci.yaml | Adds UCCL_COMMIT environment variable and passes it to Docker builds |
| .github/workflows/docker/Dockerfile | Adds UCCL-EP installation steps during Docker image build |
| runner/helpers/hooks/05_using_uep.sh | New hook script to validate and configure UCCL-EP environment |
| examples/run_pretrain.sh | Adds UCCL rebuild option and UEP configuration logic |
| examples/run_local_pretrain.sh | Passes REBUILD_UCCL and USING_UEP environment variables to Docker container |
| examples/moe_package/run_deepseek_v2_pretrain_mi355x.sh | Enables UCCL-EP feature (option 8) in MoE features array |
| examples/moe_package/run_deepseek_v2_lite_pretrain_mi355x.sh | Enables UCCL-EP feature (option 8) in MoE features array |
| primus/modules/trainer/megatron/utils.py | Updates validation to only enforce CU constraint when using TURBO backend |
examples/run_pretrain.sh
Outdated
|
|
||
| git clone https://github.com/uccl-project/uccl.git | ||
| cd uccl || exit | ||
| cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd .. |
There was a problem hiding this comment.
Hardcoded GPU architectures 'gfx942;gfx950' should be configurable via an environment variable to support different hardware configurations.
| cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd .. | |
| cd ep && PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH:-gfx942;gfx950}" python3 setup.py build && cd .. |
examples/run_pretrain.sh
Outdated
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | ||
|
|
||
| if [ "$ENABLE_NUMA_BINDING" != "1" ]; then | ||
| LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." |
There was a problem hiding this comment.
This duplicates the warning logic from the hook file (runner/helpers/hooks/05_using_uep.sh line 27). Consider using LOG_WARN for consistency with the hook implementation, as both locations serve the same warning purpose.
| LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." | |
| LOG_WARN "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." |
0156bcc to
9f03657
Compare
| args.expert_model_parallel_size >= 16 | ||
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO" | ||
| ): | ||
| # Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine. |
There was a problem hiding this comment.
Grammar error in comment: 'is not supported CUs' should be 'does not support CUs' or 'is not supported for CUs'.
| # Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine. | |
| # Turbo DeepEP is not supported for CUs > 32 when using internode dispatch/combine. |
| UCCL_BUILD_DIR="${UCCL_BUILD_DIR:-/tmp/uccl_${HOSTNAME:-$(hostname)}}" | ||
| UCCL_REF="${UCCL_REF:-}" | ||
| GPU_ARCHS="${GPU_ARCHS:-gfx942;gfx950}" | ||
| GPU_ARCHS="${GPU_ARCHS:-gfx942,gfx950}" |
There was a problem hiding this comment.
This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.
|
|
||
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | ||
| cd ep && PYTORCH_ROCM_ARCH="${GPU_ARCHS}" python3 setup.py build && cd .. | ||
| cd ep && TORCH_CUDA_ARCH_LIST="${GPU_ARCHS}" python3 setup.py build && cd .. |
There was a problem hiding this comment.
This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.
| if args.expert_model_parallel_size >= 16: | ||
| if ( | ||
| args.expert_model_parallel_size >= 16 | ||
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO" |
There was a problem hiding this comment.
If PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND is unset, this condition becomes false and the turbo_deepep_num_cu <= 32 constraint is skipped even though the effective default backend elsewhere appears to be TURBO. To avoid silently bypassing validation, default the getenv to \"TURBO\" (or otherwise ensure the env var is always set before validation runs).
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO" | |
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "TURBO") == "TURBO" |
examples/run_pretrain.sh
Outdated
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd .. | ||
| cp ep/build/**/*.so uccl |
There was a problem hiding this comment.
cp ep/build/**/*.so uccl depends on globstar (**) expansion, which is typically disabled by default and can lead to no .so files being copied (and subsequent install/runtime failures). Use a glob pattern that doesn't require globstar or switch to a find-based copy. Also consider aligning the arch env var/list format with whatever the ROCm build expects (see UCCL rebuild hook).
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd .. | |
| cp ep/build/**/*.so uccl | |
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942;gfx950" python3 setup.py build && cd .. | |
| find ep/build -name '*.so' -exec cp -t uccl {} + |
| fi | ||
|
|
||
| git clone https://github.com/uccl-project/uccl.git | ||
| cd uccl || exit |
There was a problem hiding this comment.
The rebuild path clones UCCL from main without pinning to a commit/tag, which makes runs non-reproducible and can break unexpectedly over time. Consider supporting a UCCL_COMMIT/UCCL_REF env var (similar to CI) and checking out that ref when provided.
| cd uccl || exit | |
| cd uccl || exit | |
| # Optionally pin UCCL to a specific ref/commit for reproducible rebuilds. | |
| # If UCCL_REF is not set, UCCL_COMMIT can be used as a fallback. | |
| UCCL_CHECKOUT_REF="${UCCL_REF:-$UCCL_COMMIT}" | |
| if [ -n "${UCCL_CHECKOUT_REF}" ]; then | |
| LOG_INFO "Checking out UCCL ref: ${UCCL_CHECKOUT_REF}" | |
| git checkout "${UCCL_CHECKOUT_REF}" || exit | |
| fi |
runner/helpers/hooks/05_using_uep.sh
Outdated
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | ||
| exit 1 | ||
| fi | ||
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | ||
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" |
There was a problem hiding this comment.
Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | |
| if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)" |
runner/helpers/hooks/05_using_uep.sh
Outdated
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | ||
| exit 1 | ||
| fi | ||
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | ||
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" |
There was a problem hiding this comment.
Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | |
| if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)" |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 57 out of 57 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
docs/backends/extending-backends.md:1
- This guide still describes trainer-class lookup via
BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs fromBackendRegistryand updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters ownload_trainer_class()without going through registry trainer-class APIs), otherwise the guide will not work for readers.
# Backend Extension Guide
| run_patches( | ||
| backend="megatron", | ||
| phase="before_train", | ||
| backend_version=type(self).detect_megatron_version(), | ||
| model_name=self.model_name, | ||
| extra={ | ||
| "module_config": module_config, | ||
| "backend_args": self.backend_args, | ||
| "primus_config": self.primus_config, | ||
| "module_config": self.module_config, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
This applies before_train patches inside the trainer constructor, but the new architecture already applies before_train in PrimusRuntime (and tests assert the phase order). This will cause duplicate patch application for Megatron-Bridge and breaks the runtime-owned phase placement contract; remove patch execution from __init__ and rely on PrimusRuntime._run_phase_patches() (or move any truly-required setup into setup() and keep it patch-free).
| # Logger may not be initialized yet; sys.path is already updated. | ||
| pass | ||
| return norm_path | ||
| assert False, error_msg |
There was a problem hiding this comment.
Using assert False for runtime path validation is fragile (asserts can be optimized out with -O) and raises AssertionError instead of a more appropriate exception. Prefer raising FileNotFoundError (or RuntimeError with a clear message) when the path doesn’t exist.
| assert False, error_msg | |
| raise FileNotFoundError(error_msg) |
| def merge_namespace(dst: SimpleNamespace, src: SimpleNamespace, allow_override=False, excepts: list = None): | ||
| src_dict = vars(src) | ||
| dst_dict = vars(dst) | ||
| excepts = excepts or [] | ||
| for key, value in src_dict.items(): | ||
| if key in excepts: | ||
| continue | ||
| if key in dst_dict and not allow_override: | ||
| raise ValueError(f"Key '{key}' from {src.name} already exists in {dst.name}.") | ||
| else: | ||
| setattr(dst, key, value) | ||
| continue # Skip duplicate keys, keep dst value | ||
| setattr(dst, key, value) |
There was a problem hiding this comment.
This change silently skips conflicting keys by default, whereas the previous behavior raised a ValueError. That’s a behavioral breaking change that can hide misconfigurations (e.g., typos or unintended duplicate keys). Consider restoring the previous default (raise on conflict) and adding an explicit mode/flag for 'skip duplicates', or at least logging conflicts when skipping.
| def _get_backend_version(self) -> Optional[str]: | ||
| assert self.ctx is not None, "TrainContext must be initialized before detecting backend version." | ||
| if self.ctx.backend_version is not None: | ||
| return self.ctx.backend_version | ||
| if self.ctx.adapter is None: | ||
| return None | ||
| try: | ||
| self.ctx.backend_version = self.ctx.adapter.detect_backend_version() | ||
| except Exception: | ||
| self.ctx.backend_version = None | ||
| return self.ctx.backend_version |
There was a problem hiding this comment.
Swallowing all exceptions here makes version-dependent patch selection silently degrade to backend_version=None, which can lead to incorrect patch applicability without any signal. Recommend logging a warning with the exception details (at least rank 0), and/or providing a strict mode where version detection failures propagate.
| # These should stay in sync with the phases used by: | ||
| # - BackendAdapter._apply_setup_patches() / _apply_build_args_patches() | ||
| # - PrimusRuntime (setup / build_args / before_train / after_train) | ||
| # - BaseTrainer.run() (before_train / after_train) |
There was a problem hiding this comment.
The comment references BaseTrainer.run(), but BaseTrainer no longer owns a run() template method in this PR (lifecycle is runtime-owned and trainers implement train()). Update/remove this reference to avoid misleading phase ownership documentation.
| # - BaseTrainer.run() (before_train / after_train) | |
| # - Trainer lifecycle (before_train / after_train) |
| - `BackendAdapter` – integrates a backend framework | ||
| - `BackendRegistry` – discovers and instantiates adapters | ||
| - `BaseTrainer` – defines the minimal training lifecycle that all backends follow | ||
| - `PrimusRuntime` – orchestrates config loading, env setup, patches, adapter & trainer |
There was a problem hiding this comment.
This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.
| from typing import Any, Dict | ||
|
|
||
| from primus.core.backend.backend_adapter import BackendAdapter | ||
| from primus.core.backend.backend_registry import BackendRegistry |
There was a problem hiding this comment.
This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.
| from primus.core.backend.backend_registry import BackendRegistry |
| """ | ||
|
|
||
| def __init__(self, primus_config: Any, module_config: Any, backend_args: Any = None): | ||
| def __init__(self, backend_args: Any): |
There was a problem hiding this comment.
Tests and runtime behavior indicate backend_args can be None (e.g., tests explicitly construct DummyTrainer(backend_args=None)). Consider making the constructor accept an optional value (backend_args: Any | None = None) to align the public trainer interface with actual supported usage and avoid inconsistent typing/expectations.
f7810e0 to
9622cba
Compare
| @@ -47,7 +45,7 @@ if [[ -n "$UCCL_REF" ]]; then | |||
| fi | |||
|
|
|||
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | |||
There was a problem hiding this comment.
Removing PYTORCH_ROCM_ARCH/GPU arch injection can break building ROCm extensions in environments without a visible GPU (common in container builds/CI), or produce binaries targeting the wrong arch. Consider restoring a configurable arch env (e.g., PYTORCH_ROCM_ARCH="${GPU_ARCHS}") with a sane default, or otherwise ensuring the build is deterministic across nodes.
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | |
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | |
| # Ensure deterministic ROCm arch selection for extension build | |
| if [[ -z "${PYTORCH_ROCM_ARCH:-}" ]]; then | |
| if [[ -n "${GPU_ARCHS:-}" ]]; then | |
| export PYTORCH_ROCM_ARCH="${GPU_ARCHS}" | |
| else | |
| # Fallback to a sane default ROCm arch if none is provided | |
| export PYTORCH_ROCM_ARCH="gfx90a" | |
| fi | |
| fi | |
| LOG_INFO_RANK0 "[hook system] Using PYTORCH_ROCM_ARCH='${PYTORCH_ROCM_ARCH}' for uccl ep build" |
| if ( | ||
| args.expert_model_parallel_size >= 16 | ||
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "DEEP_EP") == "TURBO" | ||
| ): |
There was a problem hiding this comment.
This backend-dependent validation is easy to regress because it depends on an environment variable and specific expert_model_parallel_size thresholds. Add a unit/integration test that exercises validate_args_on_rocm for both PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=TURBO and =DEEP_EP (and unset), verifying the assertion behavior around turbo_deepep_num_cu when expert_model_parallel_size >= 16.
No description provided.