Closed
Conversation
…extension (#299) * Add mxfp8 recipe support * Add `PrimusTurboLinear` to replace `TELinear` Current mxfp8 can work on some models (without any error) as below but the performance is pretty bad only half of FP8. It will be resolved later. <img width="2894" height="1002" alt="image" src="https://github.com/user-attachments/assets/df73d725-aec5-47cd-b78c-f7efb32c0462" />
## Summary This PR refactors the tests structure so that CI runs can better distinguish failure types and make it easier to pinpoint issues. --------- Co-authored-by: Xiaoming-AMD <xiaompen@amd.com>
This PR fixes an incorrect container image tag used by primus-cli in container mode. The image tag was previously set to rocm/primus:v25.10_gfx942, which does not exist. It is now corrected to use the valid tag:
## 🎯 Overview This PR introduces explicit precision indicators in configuration filenames and adds FP8 training support across all model configurations. ## 📋 Changes Summary ### 1. Configuration File Renaming - **Renamed all existing configs**: `xxx-pretrain.yaml` → `xxx-BF16-pretrain.yaml` - **Total files renamed**: 42 configuration files (MI300X: 20, MI355X: 22) - **Purpose**: Explicitly indicate BF16 precision in filenames for better clarity ### 2. FP8 Configuration Support - **Created FP8 variants**: Added `xxx-FP8-pretrain.yaml` for all models - **Total new files**: 42 FP8 configuration files - **FP8-specific settings**: ```yaml # enable fp8 training fp8: hybrid moe_use_legacy_grouped_gemm: false ``` ### 3. Configuration Cleanup - Removed deprecated fusion-related configurations: - `moe_permute_fusion` - `moe_use_fused_router_with_aux_score` - Related comments and unnecessary parameters - Retained `gradient_accumulation_fusion: false` where it existed in original configs ### 4. Documentation Updates - **Updated `examples/README.md`**: - All example commands now reference `-BF16-pretrain.yaml` configs - Updated model table with new config file links - Updated HipBLASLt tuning examples - Updated Kubernetes examples - **Updated `tests/trainer/test_megatron_trainer.py`**: - All test cases updated to use `-BF16-pretrain.yaml` configs - 11 test methods updated ## 🗂️ Affected Models **MI300X**: deepseek_v2, deepseek_v2_lite, deepseek_v3, gpt_oss_20B, grok1, grok2, llama2 (7B/70B), llama3 (8B/70B), llama3.1 (8B/70B/405B), llama3.3_70B, llama4 (17B16E/17B128E), mixtral (8x7B/8x22B), qwen2.5 (7B/72B) **MI355X**: All above models + qwen3 (8B/30B_A3B/235B_A22B) ## ✅ Benefits 1. **Clear precision indication**: Users can easily identify BF16 vs FP8 configurations 2. **FP8 training ready**: All models now have pre-configured FP8 training support 3. **Optimized settings**: FP8 configs include recommended settings (`moe_use_legacy_grouped_gemm: false`) 4. **Cleaner configs**: Removed deprecated parameters for better maintainability 5. **Backward compatibility**: Original BF16 training behavior preserved ## 🧪 Testing - All configuration file references updated in test suite - Existing tests continue to work with renamed BF16 configs - FP8 configs follow the same structure as BF16 with precision-specific optimizations ## 📝 Migration Guide ### For existing users: - Replace `xxx-pretrain.yaml` → `xxx-BF16-pretrain.yaml` in your scripts - Examples: ```bash # Old EXP=examples/megatron/configs/MI300X/llama3_8B-pretrain.yaml # New EXP=examples/megatron/configs/MI300X/llama3_8B-BF16-pretrain.yaml ``` ### To use FP8 training: ```bash # Simply switch to FP8 config EXP=examples/megatron/configs/MI300X/llama3_8B-FP8-pretrain.yaml bash ./examples/run_pretrain.sh ```
This PR improves CI stability for JAX / MaxText training jobs by adding an explicit training completion log and using it as a success signal in unit tests. In some cases, the training process may terminate abnormally **after training has already completed** (e.g. random core dump), which causes CI failures even though the training itself finished successfully. To address this, we now log an explicit marker after the MaxText training finishes: ```text MaxText Pre-Trainer: after training is done ``` In CI and unit tests, if the training process exits with an error **but this log is present**, the run is treated as successful.
## Background
Currently, CI sets `UT_LOG_PATH` to fixed directories in some cases
(e.g. `.../ut_out/latest`).
This means different CI runs (especially consecutive pushes to `main` or
multiple runners) can
reuse the same path, leading to:
- Previous run logs/results being overwritten
- Cleanup logic having trouble distinguishing runs, making debugging
harder
## Changes
- **Torch CI (`run-unittest-torch`)**
- Change `UT_LOG_PATH` from fixed paths to unique, per-run directories
that include
a **timestamp** and **short commit SHA**:
- Pull requests: `ut_out/pr-<pr_number>-<YYYYMMDD-HHMMSS>-<commit>`
- Push to `main`: `ut_out/main-<YYYYMMDD-HHMMSS>-<commit>`
- Releases: `ut_out/<tag>-<YYYYMMDD-HHMMSS>-<commit>`
- Other events: `ut_out/others-<YYYYMMDD-HHMMSS>-<commit>`
- **JAX CI (`run-unittest-jax`)**
- Apply the same `UT_LOG_PATH` naming scheme as Torch CI, with timestamp
and short commit SHA
for PR / `main` / release / other events.
## Benefits
- Each CI run writes UT logs to a globally unique directory, avoiding
cross-run interference.
- The log path encodes both time and commit, making it easy to trace
logs back to a specific run.
- No change to the actual tests, only to where logs are written, so the
risk is low.
## Testing
- Verified in CI logs that `UT_LOG_PATH` is set to the expected
`<event>-<timestamp>-<commit>` pattern for both Torch and JAX jobs.
- Confirmed that the UT jobs create and use the new per-run log
directories successfully.
* Fix primus-turbo spec provider
…374) (1) Fix the hardcoded settings in the MaxText Docker image (2) set default value for wandb args in maxtext
…lt Models (#436) Added Primus auto benchmarking tool for default models with megatron and torchtitan backend. **Features:** ``` ✅ Interactive Menu System - User-friendly CLI with color-coded outputs and ASCII banner ✅ Multi-Backend Support - Compatible with Megatron and TorchTitan with device-specific configs ✅ Batch Processing - Run multiple model configurations sequentially with flexible selection ✅ Configuration Viewing - Preview YAML configs before execution ✅ Configuration Editing - Edit YAML configs individually or in batch before execution ✅ Parameter Overrides - Override specific parameters without editing files permanently ✅ Auto Device Detection - Automatically detects AMD MI300X/MI355X GPUs with intelligent fallback ✅ Device-Specific Paths - Automatically uses device-specific config directories (MI300X/MI355X) ✅ Comprehensive Logging - Timestamped logs saved in organized backend-specific directories ✅ Environment Management - Custom device-specific environment variable support ✅ Automatic Metrics Generation - Backend-specific metrics tables generated after completion ✅ Smart Config Management - Handles edited/override configs properly with automatic cleanup ```
Update runs-on label for run-unittest-jax job: - Old: primus-jax-l85pj - New: primus-llm-cicd-jax-7b4zw This updates the CI to use the new JAX runner infrastructure for better stability and performance in JAX unit tests.
This PR introduces support for the Primus Turbo grouped gemm backend in the MaxText MoE implementation. ### Key Changes - Added `use_turbo_grouped_gemm` option in configuration. - Implemented fallback to default `ragged_dot` when Primus Turbo is unavailable. - Added related logging. ### How to Turn It On - In config file, i.e. examples/maxtext/configs/MI355X/mixtral_8x7B-pretrain.yaml - add `use_turbo_grouped_gemm`: true - make sure` sparse_matmul: true` and `megablox: false` - In shell, add `JAX_ENABLE_X64=1` ### Functional Testing - Integration test on single node and two nodes - Tested different config combinations, for example, 1) `JAX_ENABLE_X64=1`, 2) use `use_turbo_grouped_gemm` with megablox - Verified fallback behavior when Primus Turbo is not available - Confirmed correct logging ### Logging: - When `use_turbo_grouped_gemm` is on <img width="816" height="20" alt="image" src="https://github.com/user-attachments/assets/85e4821e-c681-4bb2-a24a-3945da7fa053" /> - Use `use_turbo_grouped_gemm` with `megablox` <img width="830" height="25" alt="image" src="https://github.com/user-attachments/assets/8559757a-3364-464f-9302-d24087fe12a9" /> - When `primus_turbo` cannot be loaded <img width="1131" height="21" alt="image" src="https://github.com/user-attachments/assets/20ad05b1-4fc9-4982-b2b1-4c39d5657522" /> ### Benchmarking The following tests were run on MI355 using the ds-proxy-e128-h2048 configuration with N=1. Compared to sparse+ragged_dot, the Primus Turbo grouped GEMM backend enables much larger per-device batch sizes without OOM. <img width="950" height="621" alt="image" src="https://github.com/user-attachments/assets/a8eb149f-d6a8-4985-9f25-903679cf655e" /> Primus ver: `a16f2524e2ad5b35d06eb306da64b22652478785` Primus Turbo ver: `d8f8dd0af5c82af0d30489a1dada61ffe9463869` Docker: `rocm/jax-training:maxtext-v25.9` --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Update deepseek_v3_16b-pretrain.yaml bs=13
…n using measured layer-wise latencies. (#362) Example usage: bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v3-pretrain.yaml Example output: [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 00 bubble: 2414.28 ms (ratio=7.25%), activation_peak=144.41 GB, param_memory=136.67 GB, total_peak=281.08 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 01 bubble: 2563.14 ms (ratio=7.70%), activation_peak=139.49 GB, param_memory=136.67 GB, total_peak=276.16 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 02 bubble: 2563.14 ms (ratio=7.70%), activation_peak=135.45 GB, param_memory=136.67 GB, total_peak=272.11 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 03 bubble: 2563.14 ms (ratio=7.70%), activation_peak=131.40 GB, param_memory=136.67 GB, total_peak=268.07 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 04 bubble: 2563.14 ms (ratio=7.70%), activation_peak=127.36 GB, param_memory=136.67 GB, total_peak=264.03 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 05 bubble: 2563.14 ms (ratio=7.70%), activation_peak=123.32 GB, param_memory=136.67 GB, total_peak=259.99 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 06 bubble: 2563.14 ms (ratio=7.70%), activation_peak=119.27 GB, param_memory=136.67 GB, total_peak=255.94 GB [20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 07 bubble: 877.40 ms (ratio=2.64%), activation_peak=116.00 GB, param_memory=136.67 GB, total_peak=252.66 GB --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
…ckend (#376) * Add `PRIMUS_DETERMINISTIC` env var to realize deterministic. * Bring deterministic unit test back. --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
- Refactor existing Megatron MoE patch logic into the Primus patch system. - No new MoE functionality is introduced; this only changes how patches are registered and applied. ## Changes - **Deprecated MoE layer patch** - File: `primus/backends/megatron/patches/moe_patches/deprecated_layer_patches.py` - Wrap existing deprecated MoE layer logic into a registered patch: - When `use_deprecated_20241209_moe_layer=True`, replace `MoELayer`, `MoESubmodules`, and expert MLP classes with the deprecated versions. - Update `megatron.core.models.gpt.moe_module_specs` to point to the same deprecated classes as before. - **MoE permute fusion patch** - File: `primus/backends/megatron/patches/moe_patches/permute_fusion_patches.py` - Move existing fused permutation logic into a patch: - When `moe_permute_fusion=True`, replace TE permute/unpermute and sort functions with the Primus fused implementations. - Apply the same replacements in `megatron.core.transformer.moe.moe_utils` and set `HAVE_TE = True`. - **Primus TopKRouter patch** - File: `primus/backends/megatron/patches/moe_patches/topk_router_patches.py` - Register the existing `PrimusTopKRouter` integration as a patch: - By default (unless `disable_primus_topk_router=True`), replace `TopKRouter` in `megatron.core.transformer.moe.router` and `moe_layer` with `PrimusTopKRouter`. - If `use_deprecated_20241209_moe_layer=True`, also patch `deprecated_20251209.moe_layer.TopKRouter`.
## 📋 Summary This PR introduces a new modular Transformer Engine (TE) patches module under `primus/backends/megatron/patches/te_patches/`. It replaces the monolithic `patch_get_extra_te_kwargs()` and `patch_te_tp_overlap()` methods from `MegatronTrainer` with well-organized, condition-based patches. ## 🎯 Motivation **Problems with old approach:** - Single large method handling multiple TE configurations - Hard to understand which patches apply in which scenarios - Version-specific logic mixed with feature logic - Difficult to test individual TE patches **New approach:** - Each TE feature has its own patch file - Clear version-based separation (TE < 2.0 vs >= 2.0) - Condition-driven patch application - Reusable utility functions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.