Skip to content

feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook#540

Open
zhenhuang12 wants to merge 12 commits intomainfrom
dev/zhuang12/uep
Open

feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook#540
zhenhuang12 wants to merge 12 commits intomainfrom
dev/zhuang12/uep

Conversation

@zhenhuang12
Copy link
Contributor

No description provided.

Copilot AI review requested due to automatic review settings February 5, 2026 09:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for UCCL-EP (Unified Collective Communication Library - Expert Parallelism) to the project, enabling an alternative backend for MoE dispatch/combine operations. The changes include Docker build integration, runtime configuration through environment variables, and conditional logic to select between TURBO and DEEP_EP backends.

Changes:

  • Adds UCCL-EP installation to Docker build process with configurable commit hash
  • Implements USING_UEP flag to enable UCCL-EP functionality in runner hooks and pretrain scripts
  • Updates validation logic to conditionally apply turbo_deepep_num_cu constraint based on backend selection

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
.github/workflows/ci.yaml Adds UCCL_COMMIT environment variable and passes it to Docker builds
.github/workflows/docker/Dockerfile Adds UCCL-EP installation steps during Docker image build
runner/helpers/hooks/05_using_uep.sh New hook script to validate and configure UCCL-EP environment
examples/run_pretrain.sh Adds UCCL rebuild option and UEP configuration logic
examples/run_local_pretrain.sh Passes REBUILD_UCCL and USING_UEP environment variables to Docker container
examples/moe_package/run_deepseek_v2_pretrain_mi355x.sh Enables UCCL-EP feature (option 8) in MoE features array
examples/moe_package/run_deepseek_v2_lite_pretrain_mi355x.sh Enables UCCL-EP feature (option 8) in MoE features array
primus/modules/trainer/megatron/utils.py Updates validation to only enforce CU constraint when using TURBO backend

Copilot AI review requested due to automatic review settings February 5, 2026 09:12
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


git clone https://github.com/uccl-project/uccl.git
cd uccl || exit
cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd ..
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded GPU architectures 'gfx942;gfx950' should be configurable via an environment variable to support different hardware configurations.

Suggested change
cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd ..
cd ep && PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH:-gfx942;gfx950}" python3 setup.py build && cd ..

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 5, 2026 09:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"

if [ "$ENABLE_NUMA_BINDING" != "1" ]; then
LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the warning logic from the hook file (runner/helpers/hooks/05_using_uep.sh line 27). Consider using LOG_WARN for consistency with the hook implementation, as both locations serve the same warning purpose.

Suggested change
LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."
LOG_WARN "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly."

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 6, 2026 02:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

args.expert_model_parallel_size >= 16
and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO"
):
# Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error in comment: 'is not supported CUs' should be 'does not support CUs' or 'is not supported for CUs'.

Suggested change
# Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine.
# Turbo DeepEP is not supported for CUs > 32 when using internode dispatch/combine.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 6, 2026 02:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings February 7, 2026 08:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings February 7, 2026 11:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

UCCL_BUILD_DIR="${UCCL_BUILD_DIR:-/tmp/uccl_${HOSTNAME:-$(hostname)}}"
UCCL_REF="${UCCL_REF:-}"
GPU_ARCHS="${GPU_ARCHS:-gfx942;gfx950}"
GPU_ARCHS="${GPU_ARCHS:-gfx942,gfx950}"
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.

Copilot uses AI. Check for mistakes.

LOG_INFO_RANK0 "[hook system] Building uccl ep"
cd ep && PYTORCH_ROCM_ARCH="${GPU_ARCHS}" python3 setup.py build && cd ..
cd ep && TORCH_CUDA_ARCH_LIST="${GPU_ARCHS}" python3 setup.py build && cd ..
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.

Copilot uses AI. Check for mistakes.
if args.expert_model_parallel_size >= 16:
if (
args.expert_model_parallel_size >= 16
and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO"
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND is unset, this condition becomes false and the turbo_deepep_num_cu <= 32 constraint is skipped even though the effective default backend elsewhere appears to be TURBO. To avoid silently bypassing validation, default the getenv to \"TURBO\" (or otherwise ensure the env var is always set before validation runs).

Suggested change
and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO"
and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "TURBO") == "TURBO"

Copilot uses AI. Check for mistakes.
Comment on lines 394 to 395
cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd ..
cp ep/build/**/*.so uccl
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp ep/build/**/*.so uccl depends on globstar (**) expansion, which is typically disabled by default and can lead to no .so files being copied (and subsequent install/runtime failures). Use a glob pattern that doesn't require globstar or switch to a find-based copy. Also consider aligning the arch env var/list format with whatever the ROCm build expects (see UCCL rebuild hook).

Suggested change
cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd ..
cp ep/build/**/*.so uccl
cd ep && TORCH_CUDA_ARCH_LIST="gfx942;gfx950" python3 setup.py build && cd ..
find ep/build -name '*.so' -exec cp -t uccl {} +

Copilot uses AI. Check for mistakes.
fi

git clone https://github.com/uccl-project/uccl.git
cd uccl || exit
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rebuild path clones UCCL from main without pinning to a commit/tag, which makes runs non-reproducible and can break unexpectedly over time. Consider supporting a UCCL_COMMIT/UCCL_REF env var (similar to CI) and checking out that ref when provided.

Suggested change
cd uccl || exit
cd uccl || exit
# Optionally pin UCCL to a specific ref/commit for reproducible rebuilds.
# If UCCL_REF is not set, UCCL_COMMIT can be used as a fallback.
UCCL_CHECKOUT_REF="${UCCL_REF:-$UCCL_COMMIT}"
if [ -n "${UCCL_CHECKOUT_REF}" ]; then
LOG_INFO "Checking out UCCL ref: ${UCCL_CHECKOUT_REF}"
git checkout "${UCCL_CHECKOUT_REF}" || exit
fi

Copilot uses AI. Check for mistakes.
Comment on lines 19 to 24
if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then
LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
exit 1
fi
LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"
LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.

Suggested change
if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then
LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
exit 1
fi
LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"
LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"
if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then
LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
exit 1
fi
LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)"
LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)"

Copilot uses AI. Check for mistakes.
Comment on lines 19 to 24
if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then
LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
exit 1
fi
LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"
LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.

Suggested change
if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then
LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
exit 1
fi
LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)"
LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)"
if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then
LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1."
exit 1
fi
LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)"
LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)"

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 9, 2026 09:27
zhenhuang12 and others added 9 commits February 9, 2026 09:30
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 57 out of 57 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

docs/backends/extending-backends.md:1

  • This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.
# Backend Extension Guide

Comment on lines 60 to 68
run_patches(
backend="megatron",
phase="before_train",
backend_version=type(self).detect_megatron_version(),
model_name=self.model_name,
extra={
"module_config": module_config,
"backend_args": self.backend_args,
"primus_config": self.primus_config,
"module_config": self.module_config,
},
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies before_train patches inside the trainer constructor, but the new architecture already applies before_train in PrimusRuntime (and tests assert the phase order). This will cause duplicate patch application for Megatron-Bridge and breaks the runtime-owned phase placement contract; remove patch execution from __init__ and rely on PrimusRuntime._run_phase_patches() (or move any truly-required setup into setup() and keep it patch-free).

Copilot uses AI. Check for mistakes.
# Logger may not be initialized yet; sys.path is already updated.
pass
return norm_path
assert False, error_msg
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using assert False for runtime path validation is fragile (asserts can be optimized out with -O) and raises AssertionError instead of a more appropriate exception. Prefer raising FileNotFoundError (or RuntimeError with a clear message) when the path doesn’t exist.

Suggested change
assert False, error_msg
raise FileNotFoundError(error_msg)

Copilot uses AI. Check for mistakes.
Comment on lines 93 to 102
def merge_namespace(dst: SimpleNamespace, src: SimpleNamespace, allow_override=False, excepts: list = None):
src_dict = vars(src)
dst_dict = vars(dst)
excepts = excepts or []
for key, value in src_dict.items():
if key in excepts:
continue
if key in dst_dict and not allow_override:
raise ValueError(f"Key '{key}' from {src.name} already exists in {dst.name}.")
else:
setattr(dst, key, value)
continue # Skip duplicate keys, keep dst value
setattr(dst, key, value)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change silently skips conflicting keys by default, whereas the previous behavior raised a ValueError. That’s a behavioral breaking change that can hide misconfigurations (e.g., typos or unintended duplicate keys). Consider restoring the previous default (raise on conflict) and adding an explicit mode/flag for 'skip duplicates', or at least logging conflicts when skipping.

Copilot uses AI. Check for mistakes.
Comment on lines 126 to 136
def _get_backend_version(self) -> Optional[str]:
assert self.ctx is not None, "TrainContext must be initialized before detecting backend version."
if self.ctx.backend_version is not None:
return self.ctx.backend_version
if self.ctx.adapter is None:
return None
try:
self.ctx.backend_version = self.ctx.adapter.detect_backend_version()
except Exception:
self.ctx.backend_version = None
return self.ctx.backend_version
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swallowing all exceptions here makes version-dependent patch selection silently degrade to backend_version=None, which can lead to incorrect patch applicability without any signal. Recommend logging a warning with the exception details (at least rank 0), and/or providing a strict mode where version detection failures propagate.

Copilot uses AI. Check for mistakes.
# These should stay in sync with the phases used by:
# - BackendAdapter._apply_setup_patches() / _apply_build_args_patches()
# - PrimusRuntime (setup / build_args / before_train / after_train)
# - BaseTrainer.run() (before_train / after_train)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references BaseTrainer.run(), but BaseTrainer no longer owns a run() template method in this PR (lifecycle is runtime-owned and trainers implement train()). Update/remove this reference to avoid misleading phase ownership documentation.

Suggested change
# - BaseTrainer.run() (before_train / after_train)
# - Trainer lifecycle (before_train / after_train)

Copilot uses AI. Check for mistakes.
Comment on lines 6 to 9
- `BackendAdapter` – integrates a backend framework
- `BackendRegistry` – discovers and instantiates adapters
- `BaseTrainer` – defines the minimal training lifecycle that all backends follow
- `PrimusRuntime` – orchestrates config loading, env setup, patches, adapter & trainer
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.

Copilot uses AI. Check for mistakes.
from typing import Any, Dict

from primus.core.backend.backend_adapter import BackendAdapter
from primus.core.backend.backend_registry import BackendRegistry
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guide still describes trainer-class lookup via BackendRegistry.get_trainer_class(...), but the PR removes trainer-class registration APIs from BackendRegistry and updates adapters to load trainer classes directly. The doc should be updated to match the new contract (adapters own load_trainer_class() without going through registry trainer-class APIs), otherwise the guide will not work for readers.

Suggested change
from primus.core.backend.backend_registry import BackendRegistry

Copilot uses AI. Check for mistakes.
"""

def __init__(self, primus_config: Any, module_config: Any, backend_args: Any = None):
def __init__(self, backend_args: Any):
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests and runtime behavior indicate backend_args can be None (e.g., tests explicitly construct DummyTrainer(backend_args=None)). Consider making the constructor accept an optional value (backend_args: Any | None = None) to align the public trainer interface with actual supported usage and avoid inconsistent typing/expectations.

Copilot uses AI. Check for mistakes.
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
Copilot AI review requested due to automatic review settings February 9, 2026 09:38
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

@@ -47,7 +45,7 @@ if [[ -n "$UCCL_REF" ]]; then
fi

LOG_INFO_RANK0 "[hook system] Building uccl ep"
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing PYTORCH_ROCM_ARCH/GPU arch injection can break building ROCm extensions in environments without a visible GPU (common in container builds/CI), or produce binaries targeting the wrong arch. Consider restoring a configurable arch env (e.g., PYTORCH_ROCM_ARCH="${GPU_ARCHS}") with a sane default, or otherwise ensuring the build is deterministic across nodes.

Suggested change
LOG_INFO_RANK0 "[hook system] Building uccl ep"
LOG_INFO_RANK0 "[hook system] Building uccl ep"
# Ensure deterministic ROCm arch selection for extension build
if [[ -z "${PYTORCH_ROCM_ARCH:-}" ]]; then
if [[ -n "${GPU_ARCHS:-}" ]]; then
export PYTORCH_ROCM_ARCH="${GPU_ARCHS}"
else
# Fallback to a sane default ROCm arch if none is provided
export PYTORCH_ROCM_ARCH="gfx90a"
fi
fi
LOG_INFO_RANK0 "[hook system] Using PYTORCH_ROCM_ARCH='${PYTORCH_ROCM_ARCH}' for uccl ep build"

Copilot uses AI. Check for mistakes.
Comment on lines +539 to +542
if (
args.expert_model_parallel_size >= 16
and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "DEEP_EP") == "TURBO"
):
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This backend-dependent validation is easy to regress because it depends on an environment variable and specific expert_model_parallel_size thresholds. Add a unit/integration test that exercises validate_args_on_rocm for both PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=TURBO and =DEEP_EP (and unset), verifying the assertion behavior around turbo_deepep_num_cu when expert_model_parallel_size >= 16.

Copilot uses AI. Check for mistakes.
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant