Skip to content

Feat/baseline#444

Closed
gphuang wants to merge 19 commits intofeat/baselinefrom
main
Closed

Feat/baseline#444
gphuang wants to merge 19 commits intofeat/baselinefrom
main

Conversation

@gphuang
Copy link

@gphuang gphuang commented Dec 19, 2025

No description provided.

RuibinCheung and others added 19 commits December 16, 2025 11:40
…extension (#299)

* Add mxfp8 recipe support
* Add `PrimusTurboLinear` to replace `TELinear`

Current mxfp8 can work on some models (without any error) as below but
the performance is pretty bad only half of FP8. It will be resolved
later.
<img width="2894" height="1002" alt="image"
src="https://github.com/user-attachments/assets/df73d725-aec5-47cd-b78c-f7efb32c0462"
/>
## Summary

This PR refactors the tests structure so that CI runs can better
distinguish failure types and make it easier to pinpoint issues.

---------

Co-authored-by: Xiaoming-AMD <xiaompen@amd.com>
This PR fixes an incorrect container image tag used by primus-cli in
container mode.

The image tag was previously set to rocm/primus:v25.10_gfx942, which
does not exist.
It is now corrected to use the valid tag:
## 🎯 Overview

This PR introduces explicit precision indicators in configuration
filenames and adds FP8 training support across all model configurations.

## 📋 Changes Summary

### 1. Configuration File Renaming
- **Renamed all existing configs**: `xxx-pretrain.yaml` →
`xxx-BF16-pretrain.yaml`
- **Total files renamed**: 42 configuration files (MI300X: 20, MI355X:
22)
- **Purpose**: Explicitly indicate BF16 precision in filenames for
better clarity

### 2. FP8 Configuration Support
- **Created FP8 variants**: Added `xxx-FP8-pretrain.yaml` for all models
- **Total new files**: 42 FP8 configuration files
- **FP8-specific settings**:
  ```yaml
  # enable fp8 training
  fp8: hybrid
  moe_use_legacy_grouped_gemm: false
  ```

### 3. Configuration Cleanup
- Removed deprecated fusion-related configurations:
  - `moe_permute_fusion`
  - `moe_use_fused_router_with_aux_score`
  - Related comments and unnecessary parameters
- Retained `gradient_accumulation_fusion: false` where it existed in
original configs

### 4. Documentation Updates
- **Updated `examples/README.md`**:
  - All example commands now reference `-BF16-pretrain.yaml` configs
  - Updated model table with new config file links
  - Updated HipBLASLt tuning examples
  - Updated Kubernetes examples
  
- **Updated `tests/trainer/test_megatron_trainer.py`**:
  - All test cases updated to use `-BF16-pretrain.yaml` configs
  - 11 test methods updated

## 🗂️ Affected Models

**MI300X**: deepseek_v2, deepseek_v2_lite, deepseek_v3, gpt_oss_20B,
grok1, grok2, llama2 (7B/70B), llama3 (8B/70B), llama3.1 (8B/70B/405B),
llama3.3_70B, llama4 (17B16E/17B128E), mixtral (8x7B/8x22B), qwen2.5
(7B/72B)

**MI355X**: All above models + qwen3 (8B/30B_A3B/235B_A22B)

## ✅ Benefits

1. **Clear precision indication**: Users can easily identify BF16 vs FP8
configurations
2. **FP8 training ready**: All models now have pre-configured FP8
training support
3. **Optimized settings**: FP8 configs include recommended settings
(`moe_use_legacy_grouped_gemm: false`)
4. **Cleaner configs**: Removed deprecated parameters for better
maintainability
5. **Backward compatibility**: Original BF16 training behavior preserved

## 🧪 Testing

- All configuration file references updated in test suite
- Existing tests continue to work with renamed BF16 configs
- FP8 configs follow the same structure as BF16 with precision-specific
optimizations

## 📝 Migration Guide

### For existing users:
- Replace `xxx-pretrain.yaml` → `xxx-BF16-pretrain.yaml` in your scripts
- Examples:
  ```bash
  # Old
  EXP=examples/megatron/configs/MI300X/llama3_8B-pretrain.yaml
  
  # New
  EXP=examples/megatron/configs/MI300X/llama3_8B-BF16-pretrain.yaml
  ```

### To use FP8 training:
```bash
# Simply switch to FP8 config
EXP=examples/megatron/configs/MI300X/llama3_8B-FP8-pretrain.yaml bash ./examples/run_pretrain.sh
```
This PR improves CI stability for JAX / MaxText training jobs by adding
an explicit training completion log and using it as a success signal in
unit tests.

In some cases, the training process may terminate abnormally **after
training has already completed** (e.g. random core dump), which causes
CI failures even though the training itself finished successfully.

To address this, we now log an explicit marker after the MaxText
training finishes:

```text
MaxText Pre-Trainer: after training is done
```

In CI and unit tests, if the training process exits with an error **but
this log is present**, the run is treated as successful.
## Background

Currently, CI sets `UT_LOG_PATH` to fixed directories in some cases
(e.g. `.../ut_out/latest`).
This means different CI runs (especially consecutive pushes to `main` or
multiple runners) can
reuse the same path, leading to:

- Previous run logs/results being overwritten
- Cleanup logic having trouble distinguishing runs, making debugging
harder

## Changes

- **Torch CI (`run-unittest-torch`)**
- Change `UT_LOG_PATH` from fixed paths to unique, per-run directories
that include
    a **timestamp** and **short commit SHA**:
    - Pull requests: `ut_out/pr-<pr_number>-<YYYYMMDD-HHMMSS>-<commit>`
    - Push to `main`: `ut_out/main-<YYYYMMDD-HHMMSS>-<commit>`
    - Releases: `ut_out/<tag>-<YYYYMMDD-HHMMSS>-<commit>`
    - Other events: `ut_out/others-<YYYYMMDD-HHMMSS>-<commit>`

- **JAX CI (`run-unittest-jax`)**
- Apply the same `UT_LOG_PATH` naming scheme as Torch CI, with timestamp
and short commit SHA
    for PR / `main` / release / other events.

## Benefits

- Each CI run writes UT logs to a globally unique directory, avoiding
cross-run interference.
- The log path encodes both time and commit, making it easy to trace
logs back to a specific run.
- No change to the actual tests, only to where logs are written, so the
risk is low.

## Testing

- Verified in CI logs that `UT_LOG_PATH` is set to the expected
  `<event>-<timestamp>-<commit>` pattern for both Torch and JAX jobs.
- Confirmed that the UT jobs create and use the new per-run log
directories successfully.
…374)

(1) Fix the hardcoded settings in the MaxText Docker image
(2) set default value for wandb args in maxtext
…lt Models (#436)

Added Primus auto benchmarking tool for default models with megatron and
torchtitan backend.

**Features:** 

```
✅ Interactive Menu System - User-friendly CLI with color-coded outputs and ASCII banner
✅ Multi-Backend Support - Compatible with Megatron and TorchTitan with device-specific configs
✅ Batch Processing - Run multiple model configurations sequentially with flexible selection
✅ Configuration Viewing - Preview YAML configs before execution
✅ Configuration Editing - Edit YAML configs individually or in batch before execution
✅ Parameter Overrides - Override specific parameters without editing files permanently
✅ Auto Device Detection - Automatically detects AMD MI300X/MI355X GPUs with intelligent fallback
✅ Device-Specific Paths - Automatically uses device-specific config directories (MI300X/MI355X)
✅ Comprehensive Logging - Timestamped logs saved in organized backend-specific directories
✅ Environment Management - Custom device-specific environment variable support
✅ Automatic Metrics Generation - Backend-specific metrics tables generated after completion
✅ Smart Config Management - Handles edited/override configs properly with automatic cleanup
```
Update runs-on label for run-unittest-jax job:
- Old: primus-jax-l85pj
- New: primus-llm-cicd-jax-7b4zw

This updates the CI to use the new JAX runner infrastructure for better
stability and performance in JAX unit tests.
This PR introduces support for the Primus Turbo grouped gemm backend in
the MaxText MoE implementation.

### Key Changes
- Added `use_turbo_grouped_gemm` option in configuration.
- Implemented fallback to default `ragged_dot` when Primus Turbo is
unavailable.
- Added related logging.

### How to Turn It On
- In config file, i.e.
examples/maxtext/configs/MI355X/mixtral_8x7B-pretrain.yaml
- add `use_turbo_grouped_gemm`: true
- make sure` sparse_matmul: true` and `megablox: false`
- In shell, add `JAX_ENABLE_X64=1`

### Functional Testing
- Integration test on single node and two nodes
- Tested different config combinations, for example, 1)
`JAX_ENABLE_X64=1`, 2) use `use_turbo_grouped_gemm` with megablox
- Verified fallback behavior when Primus Turbo is not available
- Confirmed correct logging

### Logging:
- When `use_turbo_grouped_gemm` is on
<img width="816" height="20" alt="image"
src="https://github.com/user-attachments/assets/85e4821e-c681-4bb2-a24a-3945da7fa053"
/>

- Use `use_turbo_grouped_gemm` with `megablox` 
<img width="830" height="25" alt="image"
src="https://github.com/user-attachments/assets/8559757a-3364-464f-9302-d24087fe12a9"
/>

- When `primus_turbo` cannot be loaded
<img width="1131" height="21" alt="image"
src="https://github.com/user-attachments/assets/20ad05b1-4fc9-4982-b2b1-4c39d5657522"
/>

### Benchmarking
The following tests were run on MI355 using the ds-proxy-e128-h2048
configuration with N=1. Compared to sparse+ragged_dot, the Primus Turbo
grouped GEMM backend enables much larger per-device batch sizes without
OOM.
<img width="950" height="621" alt="image"
src="https://github.com/user-attachments/assets/a8eb149f-d6a8-4985-9f25-903679cf655e"
/>

Primus ver: `a16f2524e2ad5b35d06eb306da64b22652478785`
Primus Turbo ver: `d8f8dd0af5c82af0d30489a1dada61ffe9463869`
Docker: `rocm/jax-training:maxtext-v25.9`

---------

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Update deepseek_v3_16b-pretrain.yaml bs=13
…n using measured layer-wise latencies. (#362)

Example usage:

bash runner/primus-cli direct --script primus/cli/main.py -- projection
performance --config
examples/megatron/configs/MI300X/deepseek_v3-pretrain.yaml

Example output:

[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 00
bubble: 2414.28 ms (ratio=7.25%), activation_peak=144.41 GB,
param_memory=136.67 GB, total_peak=281.08 GB
[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 01
bubble: 2563.14 ms (ratio=7.70%), activation_peak=139.49 GB,
param_memory=136.67 GB, total_peak=276.16 GB
[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 02
bubble: 2563.14 ms (ratio=7.70%), activation_peak=135.45 GB,
param_memory=136.67 GB, total_peak=272.11 GB
[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 03
bubble: 2563.14 ms (ratio=7.70%), activation_peak=131.40 GB,
param_memory=136.67 GB, total_peak=268.07 GB
[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 04
bubble: 2563.14 ms (ratio=7.70%), activation_peak=127.36 GB,
param_memory=136.67 GB, total_peak=264.03 GB
[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 05
bubble: 2563.14 ms (ratio=7.70%), activation_peak=123.32 GB,
param_memory=136.67 GB, total_peak=259.99 GB
[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 06
bubble: 2563.14 ms (ratio=7.70%), activation_peak=119.27 GB,
param_memory=136.67 GB, total_peak=255.94 GB
[20251215 23:50:14][rank-0/8][DEBUG] [------projection.py:374] : Rank 07
bubble: 877.40 ms (ratio=2.64%), activation_peak=116.00 GB,
param_memory=136.67 GB, total_peak=252.66 GB

---------

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
…ckend (#376)

* Add `PRIMUS_DETERMINISTIC` env var to realize deterministic.
* Bring deterministic unit test back.

---------

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
- Refactor existing Megatron MoE patch logic into the Primus patch
system.
- No new MoE functionality is introduced; this only changes how patches
are registered and applied.

## Changes

- **Deprecated MoE layer patch**
- File:
`primus/backends/megatron/patches/moe_patches/deprecated_layer_patches.py`
  - Wrap existing deprecated MoE layer logic into a registered patch:
- When `use_deprecated_20241209_moe_layer=True`, replace `MoELayer`,
`MoESubmodules`, and expert MLP classes with the deprecated versions.
- Update `megatron.core.models.gpt.moe_module_specs` to point to the
same deprecated classes as before.

- **MoE permute fusion patch**
- File:
`primus/backends/megatron/patches/moe_patches/permute_fusion_patches.py`
  - Move existing fused permutation logic into a patch:
- When `moe_permute_fusion=True`, replace TE permute/unpermute and sort
functions with the Primus fused implementations.
- Apply the same replacements in
`megatron.core.transformer.moe.moe_utils` and set `HAVE_TE = True`.

- **Primus TopKRouter patch**
- File:
`primus/backends/megatron/patches/moe_patches/topk_router_patches.py`
  - Register the existing `PrimusTopKRouter` integration as a patch:
- By default (unless `disable_primus_topk_router=True`), replace
`TopKRouter` in `megatron.core.transformer.moe.router` and `moe_layer`
with `PrimusTopKRouter`.
- If `use_deprecated_20241209_moe_layer=True`, also patch
`deprecated_20251209.moe_layer.TopKRouter`.
## 📋 Summary

This PR introduces a new modular Transformer Engine (TE) patches module
under `primus/backends/megatron/patches/te_patches/`. It replaces the
monolithic `patch_get_extra_te_kwargs()` and `patch_te_tp_overlap()`
methods from `MegatronTrainer` with well-organized, condition-based
patches.

## 🎯 Motivation

**Problems with old approach:**
- Single large method handling multiple TE configurations
- Hard to understand which patches apply in which scenarios
- Version-specific logic mixed with feature logic
- Difficult to test individual TE patches

**New approach:**
- Each TE feature has its own patch file
- Clear version-based separation (TE < 2.0 vs >= 2.0)
- Condition-driven patch application
- Reusable utility functions
@gphuang gphuang closed this Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants