feat: Add MLflow artifact upload for traces and logs by gphuang · Pull Request #440 · AMD-AGI/Primus

gphuang · 2025-12-18T09:10:45Z

feat: Add MLflow artifact upload for traces and logs

Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.

Features

Upload PyTorch profiler trace files to MLflow artifacts/traces/
Upload training log files to MLflow artifacts/logs/
Unique timestamp-based output directories for multi-node consistency
Pass MLflow environment variables through Docker container

Config Options

mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow

Usage

When MLflow is enabled, artifacts are automatically uploaded at the end of training:

Trace files from tensorboard_dir → MLflow artifacts/traces/
Log files from exp_root_path/logs/ → MLflow artifacts/logs/

Example

Code
`
export CONFIG_NAME="deepseek_v2_lite-FP8-pretrain"

export EXP="examples/megatron/configs/MI300X/${CONFIG_NAME}.yaml"

export MLFLOW_RUN_NAME="${CONFIG_NAME}.single-node.baseline"

bash ./examples/run_pretrain.sh
--train_iters=5
--profile_step_start=2
--profile_step_end=4
--profile_ranks=ALL
--mlflow_run_name=${MLFLOW_RUN_NAME}
--mlflow_experiment_name=/Performance-data/Megatron-LM/primus-test
--mlflow_upload_performance_metrics=True
--mlflow_upload_traces=True
--mlflow_upload_logs=True
`

Output

- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container

Copilot

Pull request overview

This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.

Key changes:

New artifact upload module with functions to collect and upload trace/log files to MLflow
Integration of artifact uploads before MLflow run completion in the trainer
Configuration options to control trace and log uploads (defaulting to enabled)
Shell script improvements for timestamp-based output directories with multi-node consistency

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
primus/backends/megatron/training/mlflow_artifacts.py	New module implementing trace/log file discovery and MLflow artifact upload functionality
primus/backends/megatron/training/global_vars.py	Adds global variable for exp_root_path and wrapper function for artifact uploads
primus/modules/trainer/megatron/trainer.py	Integrates artifact upload calls before MLflow run termination in two exit paths
primus/configs/modules/megatron/primus_megatron_module.yaml	Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true)
examples/run_slurm_pretrain.sh	Implements timestamp-based output directory naming and exports timestamp for multi-node consistency
examples/run_pretrain.sh	Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message
examples/run_local_pretrain.sh	Adds MLflow environment variables and Primus path variables to Docker container environment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_slurm_pretrain.sh

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

examples/run_pretrain.sh

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot · 2025-12-18T10:20:26Z

@gphuang I've opened a new pull request, #441, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/modules/trainer/megatron/trainer.py

The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

primus/modules/trainer/megatron/trainer.py

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

When mlflow_upload_traces or mlflow_upload_logs is True: - Auto-enable mlflow (set disable_mlflow=False) - Auto-enable profiling if trace upload is requested This removes the need to explicitly set: - --disable_mlflow=False - --profile=True - --use_pytorch_profiler=True

The profiler saves traces to tensorboard_dir, which is None when tensorboard is disabled. This caused a TypeError during trace save. Moved auto-enable logic before tensorboard section and added tensorboard auto-enable when mlflow_upload_traces is True.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

primus/modules/trainer/megatron/trainer.py

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

…anks participate Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

…ame process Co-authored-by: Cursor <cursoragent@cursor.com>

gphuang · 2026-02-09T14:56:22Z

@wenxie-amd @Xiaoming-AMD @limou102 Could you please review?

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

primus/modules/trainer/megatron/trainer.py

Scope MLflow artifact imports to call sites, add exception detail and tracebacks, and avoid forcing default upload flags when args omit them. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Align Slurm EXP defaults with local training, ensure finalize barriers wait for uploads, and update tests to match handled exceptions. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings December 18, 2025 09:10

Copilot started reviewing on behalf of gphuang December 18, 2025 09:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang requested a review from Copilot December 18, 2025 10:10

Copilot started reviewing on behalf of gphuang December 18, 2025 10:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Outdated Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

Copilot AI mentioned this pull request Dec 18, 2025

Move MLflow import to function scope to avoid import-time dependencies #441

Closed

docs: Clarify MLflow upload defaults are opt-out when MLflow enabled

13dfa81

Copilot AI review requested due to automatic review settings December 18, 2025 10:30

Copilot started reviewing on behalf of gphuang December 18, 2025 10:31 View session

gphuang force-pushed the feat/6-enable-mlflow-uploading branch from 3c149be to 13dfa81 Compare December 18, 2025 10:33

Update primus/modules/trainer/megatron/trainer.py

1f2e136

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

Update examples/run_pretrain.sh

d30b920

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings December 18, 2025 10:37

Copilot started reviewing on behalf of gphuang December 18, 2025 10:38 View session

Update primus/backends/megatron/training/mlflow_artifacts.py

b2da61b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/modules/trainer/megatron/trainer.py Outdated Show resolved Hide resolved

gphuang mentioned this pull request Dec 18, 2025

feat: Add TraceLens integration for trace analysis with MLflow upload #439

Open

gphuang and others added 2 commits December 18, 2025 15:15

Merge branch 'main' into feat/6-enable-mlflow-uploading

476c05d

Copilot AI review requested due to automatic review settings December 19, 2025 08:26

gphuang marked this pull request as ready for review December 19, 2025 08:26

gphuang requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners December 19, 2025 08:26

Copilot started reviewing on behalf of gphuang December 19, 2025 08:27 View session

Copilot AI reviewed Dec 19, 2025

View reviewed changes

Copilot AI reviewed Feb 3, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Outdated Show resolved Hide resolved

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py Show resolved Hide resolved

gphuang added 2 commits February 5, 2026 13:43

Copilot AI review requested due to automatic review settings February 5, 2026 14:16

Copilot started reviewing on behalf of gphuang February 5, 2026 14:17 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

gphuang added 2 commits February 9, 2026 10:12

Merge branch 'main' into feat/6-enable-mlflow-uploading

fdb79f6

Merge branch 'main' into feat/6-enable-mlflow-uploading

7d86d07

Copilot AI review requested due to automatic review settings February 9, 2026 08:21

Copilot started reviewing on behalf of gphuang February 9, 2026 08:22 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Outdated Show resolved Hide resolved

gphuang and others added 2 commits February 9, 2026 14:14

Merge branch 'main' into feat/6-enable-mlflow-uploading

1400630

Keep MLflow opt-in: do not override disable_mlflow from upload flags

7a87c51

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings February 9, 2026 12:26

Copilot started reviewing on behalf of gphuang February 9, 2026 12:27 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Outdated Show resolved Hide resolved

primus/backends/megatron/training/mlflow_artifacts.py Outdated Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

gphuang and others added 3 commits February 9, 2026 14:47

Fix barrier deadlock: run barrier before mlflow_writer check so all r…

1e0f3b8

…anks participate Co-authored-by: Cursor <cursoragent@cursor.com>

Guard NNODES/SLURM_NNODES parse: catch ValueError and default to 1 node

f1fa6a1

Co-authored-by: Cursor <cursoragent@cursor.com>

Reset exp_root_path on MLflow finalization to avoid stale global in s…

961246b

…ame process Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feat/6-enable-mlflow-uploading

585626a

Copilot AI review requested due to automatic review settings February 10, 2026 08:01

Copilot started reviewing on behalf of gphuang February 10, 2026 08:01 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

gphuang and others added 2 commits February 10, 2026 10:30

Improve MLflow artifact upload robustness

1fc65ff

Scope MLflow artifact imports to call sites, add exception detail and tracebacks, and avoid forcing default upload flags when args omit them. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feat/6-enable-mlflow-uploading

0ab1a87

Copilot AI review requested due to automatic review settings February 11, 2026 09:02

Copilot started reviewing on behalf of gphuang February 11, 2026 09:03 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

Fix Copilot review notes for PR 440

1b36c3d

Align Slurm EXP defaults with local training, ensure finalize barriers wait for uploads, and update tests to match handled exceptions. Co-authored-by: Cursor <cursoragent@cursor.com>

Conversation

gphuang commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!