Feat/baseline by gphuang · Pull Request #444 · AMD-AGI/Primus

gphuang · 2025-12-19T07:27:31Z

No description provided.

…extension (#299) * Add mxfp8 recipe support * Add `PrimusTurboLinear` to replace `TELinear` Current mxfp8 can work on some models (without any error) as below but the performance is pretty bad only half of FP8. It will be resolved later. <img width="2894" height="1002" alt="image" src="https://github.com/user-attachments/assets/df73d725-aec5-47cd-b78c-f7efb32c0462" />

## Summary This PR refactors the tests structure so that CI runs can better distinguish failure types and make it easier to pinpoint issues. --------- Co-authored-by: Xiaoming-AMD <xiaompen@amd.com>

This PR fixes an incorrect container image tag used by primus-cli in container mode. The image tag was previously set to rocm/primus:v25.10_gfx942, which does not exist. It is now corrected to use the valid tag:

## 🎯 Overview This PR introduces explicit precision indicators in configuration filenames and adds FP8 training support across all model configurations. ## 📋 Changes Summary ### 1. Configuration File Renaming - **Renamed all existing configs**: `xxx-pretrain.yaml` → `xxx-BF16-pretrain.yaml` - **Total files renamed**: 42 configuration files (MI300X: 20, MI355X: 22) - **Purpose**: Explicitly indicate BF16 precision in filenames for better clarity ### 2. FP8 Configuration Support - **Created FP8 variants**: Added `xxx-FP8-pretrain.yaml` for all models - **Total new files**: 42 FP8 configuration files - **FP8-specific settings**: ```yaml # enable fp8 training fp8: hybrid moe_use_legacy_grouped_gemm: false ``` ### 3. Configuration Cleanup - Removed deprecated fusion-related configurations: - `moe_permute_fusion` - `moe_use_fused_router_with_aux_score` - Related comments and unnecessary parameters - Retained `gradient_accumulation_fusion: false` where it existed in original configs ### 4. Documentation Updates - **Updated `examples/README.md`**: - All example commands now reference `-BF16-pretrain.yaml` configs - Updated model table with new config file links - Updated HipBLASLt tuning examples - Updated Kubernetes examples - **Updated `tests/trainer/test_megatron_trainer.py`**: - All test cases updated to use `-BF16-pretrain.yaml` configs - 11 test methods updated ## 🗂️ Affected Models **MI300X**: deepseek_v2, deepseek_v2_lite, deepseek_v3, gpt_oss_20B, grok1, grok2, llama2 (7B/70B), llama3 (8B/70B), llama3.1 (8B/70B/405B), llama3.3_70B, llama4 (17B16E/17B128E), mixtral (8x7B/8x22B), qwen2.5 (7B/72B) **MI355X**: All above models + qwen3 (8B/30B_A3B/235B_A22B) ## ✅ Benefits 1. **Clear precision indication**: Users can easily identify BF16 vs FP8 configurations 2. **FP8 training ready**: All models now have pre-configured FP8 training support 3. **Optimized settings**: FP8 configs include recommended settings (`moe_use_legacy_grouped_gemm: false`) 4. **Cleaner configs**: Removed deprecated parameters for better maintainability 5. **Backward compatibility**: Original BF16 training behavior preserved ## 🧪 Testing - All configuration file references updated in test suite - Existing tests continue to work with renamed BF16 configs - FP8 configs follow the same structure as BF16 with precision-specific optimizations ## 📝 Migration Guide ### For existing users: - Replace `xxx-pretrain.yaml` → `xxx-BF16-pretrain.yaml` in your scripts - Examples: ```bash # Old EXP=examples/megatron/configs/MI300X/llama3_8B-pretrain.yaml # New EXP=examples/megatron/configs/MI300X/llama3_8B-BF16-pretrain.yaml ``` ### To use FP8 training: ```bash # Simply switch to FP8 config EXP=examples/megatron/configs/MI300X/llama3_8B-FP8-pretrain.yaml bash ./examples/run_pretrain.sh ```

https://rocm.blogs.amd.com/software-tools-optimization/primus-moe-package/README.html

This PR improves CI stability for JAX / MaxText training jobs by adding an explicit training completion log and using it as a success signal in unit tests. In some cases, the training process may terminate abnormally **after training has already completed** (e.g. random core dump), which causes CI failures even though the training itself finished successfully. To address this, we now log an explicit marker after the MaxText training finishes: ```text MaxText Pre-Trainer: after training is done ``` In CI and unit tests, if the training process exits with an error **but this log is present**, the run is treated as successful.

## Background Currently, CI sets `UT_LOG_PATH` to fixed directories in some cases (e.g. `.../ut_out/latest`). This means different CI runs (especially consecutive pushes to `main` or multiple runners) can reuse the same path, leading to: - Previous run logs/results being overwritten - Cleanup logic having trouble distinguishing runs, making debugging harder ## Changes - **Torch CI (`run-unittest-torch`)** - Change `UT_LOG_PATH` from fixed paths to unique, per-run directories that include a **timestamp** and **short commit SHA**: - Pull requests: `ut_out/pr-<pr_number>-<YYYYMMDD-HHMMSS>-<commit>` - Push to `main`: `ut_out/main-<YYYYMMDD-HHMMSS>-<commit>` - Releases: `ut_out/<tag>-<YYYYMMDD-HHMMSS>-<commit>` - Other events: `ut_out/others-<YYYYMMDD-HHMMSS>-<commit>` - **JAX CI (`run-unittest-jax`)** - Apply the same `UT_LOG_PATH` naming scheme as Torch CI, with timestamp and short commit SHA for PR / `main` / release / other events. ## Benefits - Each CI run writes UT logs to a globally unique directory, avoiding cross-run interference. - The log path encodes both time and commit, making it easy to trace logs back to a specific run. - No change to the actual tests, only to where logs are written, so the risk is low. ## Testing - Verified in CI logs that `UT_LOG_PATH` is set to the expected `<event>-<timestamp>-<commit>` pattern for both Torch and JAX jobs. - Confirmed that the UT jobs create and use the new per-run log directories successfully.

* Fix primus-turbo spec provider

…374) (1) Fix the hardcoded settings in the MaxText Docker image (2) set default value for wandb args in maxtext

…lt Models (#436) Added Primus auto benchmarking tool for default models with megatron and torchtitan backend. **Features:** ``` ✅ Interactive Menu System - User-friendly CLI with color-coded outputs and ASCII banner ✅ Multi-Backend Support - Compatible with Megatron and TorchTitan with device-specific configs ✅ Batch Processing - Run multiple model configurations sequentially with flexible selection ✅ Configuration Viewing - Preview YAML configs before execution ✅ Configuration Editing - Edit YAML configs individually or in batch before execution ✅ Parameter Overrides - Override specific parameters without editing files permanently ✅ Auto Device Detection - Automatically detects AMD MI300X/MI355X GPUs with intelligent fallback ✅ Device-Specific Paths - Automatically uses device-specific config directories (MI300X/MI355X) ✅ Comprehensive Logging - Timestamped logs saved in organized backend-specific directories ✅ Environment Management - Custom device-specific environment variable support ✅ Automatic Metrics Generation - Backend-specific metrics tables generated after completion ✅ Smart Config Management - Handles edited/override configs properly with automatic cleanup ```

Update runs-on label for run-unittest-jax job: - Old: primus-jax-l85pj - New: primus-llm-cicd-jax-7b4zw This updates the CI to use the new JAX runner infrastructure for better stability and performance in JAX unit tests.

This PR introduces support for the Primus Turbo grouped gemm backend in the MaxText MoE implementation. ### Key Changes - Added `use_turbo_grouped_gemm` option in configuration. - Implemented fallback to default `ragged_dot` when Primus Turbo is unavailable. - Added related logging. ### How to Turn It On - In config file, i.e. examples/maxtext/configs/MI355X/mixtral_8x7B-pretrain.yaml - add `use_turbo_grouped_gemm`: true - make sure` sparse_matmul: true` and `megablox: false` - In shell, add `JAX_ENABLE_X64=1` ### Functional Testing - Integration test on single node and two nodes - Tested different config combinations, for example, 1) `JAX_ENABLE_X64=1`, 2) use `use_turbo_grouped_gemm` with megablox - Verified fallback behavior when Primus Turbo is not available - Confirmed correct logging ### Logging: - When `use_turbo_grouped_gemm` is on <img width="816" height="20" alt="image" src="https://github.com/user-attachments/assets/85e4821e-c681-4bb2-a24a-3945da7fa053" /> - Use `use_turbo_grouped_gemm` with `megablox` <img width="830" height="25" alt="image" src="https://github.com/user-attachments/assets/8559757a-3364-464f-9302-d24087fe12a9" /> - When `primus_turbo` cannot be loaded <img width="1131" height="21" alt="image" src="https://github.com/user-attachments/assets/20ad05b1-4fc9-4982-b2b1-4c39d5657522" /> ### Benchmarking The following tests were run on MI355 using the ds-proxy-e128-h2048 configuration with N=1. Compared to sparse+ragged_dot, the Primus Turbo grouped GEMM backend enables much larger per-device batch sizes without OOM. <img width="950" height="621" alt="image" src="https://github.com/user-attachments/assets/a8eb149f-d6a8-4985-9f25-903679cf655e" /> Primus ver: `a16f2524e2ad5b35d06eb306da64b22652478785` Primus Turbo ver: `d8f8dd0af5c82af0d30489a1dada61ffe9463869` Docker: `rocm/jax-training:maxtext-v25.9` --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Update deepseek_v3_16b-pretrain.yaml bs=13

…n using measured layer-wise latencies. (#362) Example usage: bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v3-pretrain.yaml Example output: [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 00 bubble: 2414.28 ms (ratio=7.25%), activation_peak=144.41 GB, param_memory=136.67 GB, total_peak=281.08 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 01 bubble: 2563.14 ms (ratio=7.70%), activation_peak=139.49 GB, param_memory=136.67 GB, total_peak=276.16 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 02 bubble: 2563.14 ms (ratio=7.70%), activation_peak=135.45 GB, param_memory=136.67 GB, total_peak=272.11 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 03 bubble: 2563.14 ms (ratio=7.70%), activation_peak=131.40 GB, param_memory=136.67 GB, total_peak=268.07 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 04 bubble: 2563.14 ms (ratio=7.70%), activation_peak=127.36 GB, param_memory=136.67 GB, total_peak=264.03 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 05 bubble: 2563.14 ms (ratio=7.70%), activation_peak=123.32 GB, param_memory=136.67 GB, total_peak=259.99 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 06 bubble: 2563.14 ms (ratio=7.70%), activation_peak=119.27 GB, param_memory=136.67 GB, total_peak=255.94 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 07 bubble: 877.40 ms (ratio=2.64%), activation_peak=116.00 GB, param_memory=136.67 GB, total_peak=252.66 GB --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

…ckend (#376) * Add `PRIMUS_DETERMINISTIC` env var to realize deterministic. * Bring deterministic unit test back. --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

- Refactor existing Megatron MoE patch logic into the Primus patch system. - No new MoE functionality is introduced; this only changes how patches are registered and applied. ## Changes - **Deprecated MoE layer patch** - File: `primus/backends/megatron/patches/moe_patches/deprecated_layer_patches.py` - Wrap existing deprecated MoE layer logic into a registered patch: - When `use_deprecated_20241209_moe_layer=True`, replace `MoELayer`, `MoESubmodules`, and expert MLP classes with the deprecated versions. - Update `megatron.core.models.gpt.moe_module_specs` to point to the same deprecated classes as before. - **MoE permute fusion patch** - File: `primus/backends/megatron/patches/moe_patches/permute_fusion_patches.py` - Move existing fused permutation logic into a patch: - When `moe_permute_fusion=True`, replace TE permute/unpermute and sort functions with the Primus fused implementations. - Apply the same replacements in `megatron.core.transformer.moe.moe_utils` and set `HAVE_TE = True`. - **Primus TopKRouter patch** - File: `primus/backends/megatron/patches/moe_patches/topk_router_patches.py` - Register the existing `PrimusTopKRouter` integration as a patch: - By default (unless `disable_primus_topk_router=True`), replace `TopKRouter` in `megatron.core.transformer.moe.router` and `moe_layer` with `PrimusTopKRouter`. - If `use_deprecated_20241209_moe_layer=True`, also patch `deprecated_20251209.moe_layer.TopKRouter`.

## 📋 Summary This PR introduces a new modular Transformer Engine (TE) patches module under `primus/backends/megatron/patches/te_patches/`. It replaces the monolithic `patch_get_extra_te_kwargs()` and `patch_te_tp_overlap()` methods from `MegatronTrainer` with well-organized, condition-based patches. ## 🎯 Motivation **Problems with old approach:** - Single large method handling multiple TE configurations - Hard to understand which patches apply in which scenarios - Version-specific logic mixed with feature logic - Difficult to test individual TE patches **New approach:** - Each TE feature has its own patch file - Clear version-based separation (TE < 2.0 vs >= 2.0) - Condition-driven patch application - Reusable utility functions

RuibinCheung and others added 19 commits December 16, 2025 11:40

refactor(tests): Refactor tests layout and update CI workflow (#359)

5cffd65

## Summary This PR refactors the tests structure so that CI runs can better distinguish failure types and make it easier to pinpoint issues. --------- Co-authored-by: Xiaoming-AMD <xiaompen@amd.com>

update moe package blog (#363)

b3699ab

fix(primus-cli): fix container image tag for v25.10 (#364)

43583ef

This PR fixes an incorrect container image tag used by primus-cli in container mode. The image tag was previously set to rocm/primus:v25.10_gfx942, which does not exist. It is now corrected to use the valid tag:

MoE Package Blog Released (#369)

3aaf2d5

https://rocm.blogs.amd.com/software-tools-optimization/primus-moe-package/README.html

[Megatron-LM][hotfix] fix primus-turbo spec provider (#371)

bae5e15

* Fix primus-turbo spec provider

Enable MLA in MI355X DeepSeek-V3 (#373)

efd3ae1

hotfix for maxtext docker image and ut's random crash in docker 25.9 (#…

1a189c7

…374) (1) Fix the hardcoded settings in the MaxText Docker image (2) set default value for wandb args in maxtext

chore(ci): update JAX unit test runner (#438)

d268740

Update runs-on label for run-unittest-jax job: - Old: primus-jax-l85pj - New: primus-llm-cicd-jax-7b4zw This updates the CI to use the new JAX runner infrastructure for better stability and performance in JAX unit tests.

Increase DeepSeek-V3-16B BF16 batch size (#367)

4bccca9

Update deepseek_v3_16b-pretrain.yaml bs=13

[Megatron][feat] Add deterministic traning support for megatron-lm ba…

3917371

…ckend (#376) * Add `PRIMUS_DETERMINISTIC` env var to realize deterministic. * Bring deterministic unit test back. --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

gphuang closed this Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/baseline#444

Feat/baseline#444
gphuang wants to merge 19 commits intofeat/baselinefrom
main

gphuang commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

gphuang commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants